-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Fix Tree Median Calculation for MAE criterion #11649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tree MAE fix to ensure sample_weights are used during impurity calculation
|
More detail... 1) Proof that there is an existing problem with the median calculations. In addition to my original bug illustration (#10725), I have added a branch to my github page that makes testing for this issue much easier: https://github.com/JohnStott/scikit-learn/tree/median_issue_example The median_issue_example branch is the existing version of sklearn (without this pull fix)... I have created a new function in class: 2) Proof the fix solves this problem. 3) Test unit(s) to ensure median calculation integrity. For the production code, I have modified sklearn/tree/tests/test_tree.py with an additional test. For the existing implementation, this test data will produce an incorrect median and thus a different tree to the fixed version. I am not sure this test is sufficient, or whether my comments there are ideal, so this process would be good for further discussion / review. I will also run some benchmark tests to show that the fix doesn't hinder the performance but more importantly that the fix works i.e., the MAE error should <= the older version. Other considerations - Both the original version of MAE and the fixed version of MAE sometimes SILENTLY fail to calculate correctly when negative sample weights are included (this can be observed by removing my surrounding np.abs(..) call in the brute force script when creating |
|
Brute force / naive test script. This simply creates lots of random data sets and pushes them through a DecisionTreeRegressor. Note that this test script has an added emphasis on choosing datasets with duplicate values, as per above issue. |
|
Running the above on the various branches produces the following results:
(Note, I only ran the tests once as looking to quickly check everything is as expected.) It is encouraging to see that the additional logic hasn't affected the efficiency of the calculations timewise (i.e., 28.52 seconds versus 28.55). It is also good to see that the MAE for median_fix versus median_fix_debug is the same (i.e., I haven't accidently introduced error in the 2 slight different implementations). It is interesting to see the difference in MAE between fixed and original. The above script loops around 5000 times, so some exaggeration can be expected(?). The median_fix_debug is expected to take a lot longer since it is calculating the naive version of the median, k, and sum_w_0_k too (nb I could make this much more efficient by just having one loop in |
|
As a final test, I have used the Boston dataset to compare the existing implementation ("master (Origin)") versus the newly fixed median version ("median_fix"). Note... here I am using a RandomForestRegressor, this is a good test of the 'sample_weight' and duplication of y since the bootstrapping mechanism works by simply adding or removing from the sample weight in order to simulate sampling with replacement. Here is the test script: |
|
The results:
|
|
It's nice to see (with this dataset at least) that the fixed median model trains better to the data. |
| original_median) | ||
| return return_value | ||
| push_index = self.samples.push(data, weight) | ||
| if push_index == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if push_index == -1:
return -1
I added this to replicate what was previously being returned. Though we should only get a -1 when an exception occurs i.e., MemoryError. So in hindsight I think I can remove this check since an exception in self.samples.push should terminate immediately...? I am not 100% sure though with being new to Cyphon?
|
While appreciating that it is an important bug, I don't think this fix can
make it into 0.20.0 (or at least not the RC), so it's best if we review it
with a clear view after the release of 0.20. It might be best to ping us on
your PRs in a couple of weeks.
|
|
Thanks @jnothman, no problem. Have to cut off somewhere. Will keep an eye on the release and remind later. Cheers. |
|
Hi @JohnStott, this looks like an interesting contribution! Are you still interested in working on this PR? 🙂 |
|
Hi Julien ***@***.***),
I believe this solution worked, just that it was too late to be included in the release at the time. I haven't been able to look at the code for a few years now so not sure if still applicable? Please feel free to take over, unfortunately I don't have enough time to do so myself ... Looks like the fix was mainly keeping track of position indexes: https://github.com/scikit-learn/scikit-learn/pull/11649/files
Might be worth looking into the efficiency of this solution as had to use "where" loops and may impact datasets with many duplicate values from memory?), perhaps there is a better way?
…________________________________
From: Julien Jerphanion ***@***.***>
Sent: 10 June 2021 08:01
To: scikit-learn/scikit-learn ***@***.***>
Cc: JohnStott ***@***.***>; Mention ***@***.***>
Subject: Re: [scikit-learn/scikit-learn] Fix Tree Median Calculation for MAE criterion (#11649)
Hi @JohnStott<https://github.com/JohnStott>, this looks like an interesting contribution!
Are you still interested in working on this PR? 🙂
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#11649 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABIQZ4MVKB4WL24JMF23NIDTSBPMJANCNFSM4FLGDJTQ>.
|
|
This PR should probably be closed since it was superseded by #32100 which has been merged. |
|
Thanks, closing this one then! |
Kis not being updated properly in certain situations where we have non-uniform sample weights. This occurs during a removal/pop or push onto theWeightedMedianCalculator. The proposed fix solves this problem by identifying the exact index a new / old sample is added / removed and applying additional logic to update K correctly.Reference Issues/PRs
Fixes #10725 (BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator)