-
Notifications
You must be signed in to change notification settings - Fork 3.3k
ENH: Added create_homogeneous_subsets_dataframe to the TukeyHSDResults class
#9573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Added create_homogeneous_subsets_dataframe to the TukeyHSDResults class
#9573
Conversation
|
Do you have a reference? Is this similar to #9493?
I need to look at how this works here. But in general, groupings have overlapping groups. BTW: Thanks for the PR. |
|
Thank you for the feedback! Regarding the You're absolutely right that groupings can overlap, and the method supports that. It creates one row per group and one column per subset. A group may appear in multiple subsets, and in those cases its mean is shown in more than one column. For example, in the table below:
Here, each cell shows the group mean if the group belongs to the subset. The last row, min p-value, shows the smallest p-value among all pairwise comparisons within that subset, giving a sense of how "homogeneous" the subset is (higher values suggest greater similarity among its members). Regarding #9493, I agree it uses a similar conceptual idea of homogeneous subsets, but with a different representation. I'll take a closer look in case there's overlap or opportunity for integration. As for style, I'll fix the mis-indented docstring and will revert any Black formatting in modules you maintain. Thanks for the clarification. Let me know if you see a better place for this functionality or any needed adjustments to align with |
|
good The letter display in my PR is more general because it also applies to Games-Howell (or similar) with unequal variances across groups. In terms of interface and result format. I will look for my notebook with examples for the |
|
looking at parts of your pr code. AFAICS:
(I guess the all subset algorithm will be slower than the iterative algorithm when we have a large number of groups.) The example and test case needs unbalanced groups, i.e. different group sizes so that the standard errors of the means are not all the same. |
|
here is my dirty notebook https://gist.github.com/josef-pkt/183dd4b6cc04429385725f502c578b39 I got stuck at the end in finding a sorting option, how to define |
|
I've included a test for unbalanced data (i.e. different sample sizes), and you're right, there are some issues in this case. You can check the notebook here: https://gist.github.com/victormvy/25cbf2dc11d2706ccd3e5e8a6e86e55e (cell 47). I also ran the same test in SPSS to compare the results, and here's what I got: Not only is the ordering messed up, but the p-values are also different. So it seems SPSS handles different sample sizes just fine. I'll need to look into how it manages that. But yeah, it looks like we're facing the same problem. |
|
I am checking the full Tukey test table from SPSS and I found something weird. I'm not a statistician, so maybe I'm missing something, but... For example, groups 2 and 4 are both included in subset 2. However, according to the Tukey test table, the p-value for the comparison between groups 2 and 4 is below alpha = 0.05. Meanwhile, the p-value between groups 2 and 5 is 0.051. So shouldn't groups 2 and 4 be in different subsets, and groups 2 and 5 be in the same one? My other concern is that the significance levels shown in the homogeneous subsets table don't appear anywhere in the Tukey pairwise comparison table. So it doesn't look like they're just the minimum p-values from the pairwise comparisons within each subset. If it were the minimum p-value, it would match the p-values in our table. I'm a bit lost 😕 |
|
see footnote b in the SPSS groupings table I guess SPSS assumes equal variances and equal sample sizes in computing the The point of the article underlying my PR was that their "letter" display is correct even if standard errors of means differ. Old statistical methods. For example, standard 2-way anova or, IIRC, repeated measures anova only works for balanced samples (i.e. equal variances and equal cell sizes). update |
|
Yes, I think SPSS must be doing something additional under the hood when generating the homogeneous subsets table. In fact, the pairwise comparisons table matches ours, but the homogeneous subsets table, especially the p-values it shows, does not. Also, is the Tukey test still reliable when dealing with unbalanced data? I wonder whether the discrepancies we're seeing might be due to limitations of the method in those cases. By the way, does it make sense to sort the groups by the first and last subset they appear in? I've been experimenting with that approach, and it seems to produce visually continuous subsets. Of course, when sample sizes differ, the groups aren't sorted by mean. You can see a somewhat hacky implementation of this idea in cell 13: https://gist.github.com/victormvy/25cbf2dc11d2706ccd3e5e8a6e86e55e |
|
tukey-hsd is robust to unequal sample sizes, I think the reference is to tukey-kramer method. |


I have added the method
create_homogeneous_subsets_dataframeto theTukeyHSDResultsclass, which is the type of object returned by thepairwise_tukeyhsdtest. This method summarises the results of Tukey's HSD test by constructing a DataFrame that groups factor levels into homogeneous subsets—sets of groups whose pairwise differences are not statistically significant (i.e., p > alpha). Each group appears only once in the table, with its mean value displayed under each subset it belongs to. A final row, labelled "min p-value", shows the smallest p-value among all comparisons within each subset. This table offers a concise and intuitive visual summary of which groups are statistically similar, making it easier to interpret the results of the post hoc analysis.I have also included a test for the new method and added an example notebook that demonstrates how to use Tukey’s HSD test, the newly implemented
create_homogeneous_subsets_dataframemethod, and the existingplot_simultaneousmethod. This notebook provides a complete workflow for performing post hoc analysis and visualising group differences effectively.Details
Notes:
needed for doc changes.
then show that it is fixed with the new code.
verify you changes are well formatted by running
flake8is installed. This command is also available on Windowsusing the Windows System for Linux once
flake8is installed in thelocal Linux environment. While passing this test is not required, it is good practice and it help
improve code quality in
statsmodels.