Fix Tree Median Calculation for MAE criterion #11649

JohnStott · 2018-07-21T07:10:34Z

K is not being updated properly in certain situations where we have non-uniform sample weights. This occurs during a removal/pop or push onto the WeightedMedianCalculator. The proposed fix solves this problem by identifying the exact index a new / old sample is added / removed and applying additional logic to update K correctly.

Reference Issues/PRs

Fixes #10725 (BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator)

Tree MAE fix to ensure sample_weights are used during impurity calculation

JohnStott · 2018-07-21T07:28:16Z

More detail...

1) Proof that there is an existing problem with the median calculations.

In addition to my original bug illustration (#10725), I have added a branch to my github page that makes testing for this issue much easier: https://github.com/JohnStott/scikit-learn/tree/median_issue_example

The median_issue_example branch is the existing version of sklearn (without this pull fix)... I have created a new function in class:WeightedMedianCalculator (tree_utils.pyx) called verify_state(). This is called each time a sample is popped, removed or pushed. It simply checks the median, k and sum_w_0_k are correct. It checks these by doing a whole loop through the current node's data (remember, the existing median, k and sum_w_0_k are incrementally calculated from memory state for efficiency). If an issue is found it will throw an exception. This new function is simply for testing and shouldn't be used in production hence my separate branch just for this.

2) Proof the fix solves this problem.
I have another branch that is identical to this pull request but which also contains the verify_state() code: https://github.com/JohnStott/scikit-learn/tree/median_fix_debug. This means we can test that this new version does in fact solve the issue.

3) Test unit(s) to ensure median calculation integrity.
I have created a "brute force / naive" type script that should highlight any issues (see my next post)
Running this on my median_issue_example and median_fix_debug branches should highlight any problems.

For the production code, I have modified sklearn/tree/tests/test_tree.py with an additional test. For the existing implementation, this test data will produce an incorrect median and thus a different tree to the fixed version. I am not sure this test is sufficient, or whether my comments there are ideal, so this process would be good for further discussion / review.

I will also run some benchmark tests to show that the fix doesn't hinder the performance but more importantly that the fix works i.e., the MAE error should <= the older version.

Other considerations - Both the original version of MAE and the fixed version of MAE sometimes SILENTLY fail to calculate correctly when negative sample weights are included (this can be observed by removing my surrounding np.abs(..) call in the brute force script when creating sample_weights). Garbage in garbage out. I believe a fix for this would require more checking logic and could make the process less efficient time wise? Is it worth all the effort/loss of efficiency for such edge case requirements? We could throw an error in such a situation like what is thrown when sample weights all sum to 0? Perhaps this should be raised as a separate issue to encourage further discussion?

JohnStott · 2018-07-21T07:29:15Z

Brute force / naive test script. This simply creates lots of random data sets and pushes them through a DecisionTreeRegressor. Note that this test script has an added emphasis on choosing datasets with duplicate values, as per above issue.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
import time

mae = 0
maeSeconds = 0

#change these:
trials = 100
x_distincts = 10
y_distincts = 10
sample_weight_distincts = 10

#for n_samples in range(40000,40001):
#for n_samples in range(33,34):
for n_samples in range(2,50):
#for n_samples in range(2, 50000, 1000):
    seed = 0
    medianExceptionRaised = False
    while seed < trials:
        seed += 1
        np.random.seed(seed)

        # Randomly select X data structure:
        rand_X_type = int(np.ceil(np.random.rand(1) * 2)[0])
        # 1 to 50 variables:
        rand_X_vars = int(np.ceil(np.random.rand(1) * 50)[0])
        
        if rand_X_type == 1:
            # random norm:
            train_X = np.random.randn(n_samples, rand_X_vars)
        else:
            # test where x has mostly duplicate values!
            train_X = np.ceil((
                              np.random.rand(n_samples, rand_X_vars) 
                              * x_distincts) - np.floor(x_distincts / 2))

        # Randomly select y data structure:
        rand_y_type = int(np.ceil(np.random.rand(1) * 2)[0])
        # 1 to 10 y outputs per sample:
        rand_y_vars = int(np.ceil(np.random.rand(1) * 10)[0])

        if rand_y_type == 1:
            # random norm:
            train_y = np.random.randn(n_samples, rand_y_vars)
        else:
            # test where y has mostly duplicate values!
            train_y = np.ceil((
                              np.random.rand(n_samples, rand_y_vars) 
                              * y_distincts) - np.floor(y_distincts / 2))

        # Randomly select sample_weight data structure:
        rand_wt_type = int(np.ceil(np.random.rand(1) * 2)[0])

        if rand_wt_type == 1:
            # random norm:
            # Ensure sum of sample_weights are not smaller than or equal to 0:
            sample_weight = 0
            while np.sum(sample_weight) <= 0:
                sample_weight = np.abs(np.ravel(np.random.randn(n_samples, 1)))
        else:
            # 1 to 5 distinct wts:
            # Ensure sum of sample_weights are not smaller than or equal to 0:
            sample_weight = 0
            while np.sum(sample_weight) <= 0:
                sample_weight = np.abs(np.ravel(np.ceil((
                    np.random.rand(n_samples, 1) * sample_weight_distincts) - np.floor(sample_weight_distincts / 2)
                    )))

        start = time.time()
        
        wineTree = DecisionTreeRegressor(
            criterion='mae',
            random_state=seed)
        try:
            wineTree.fit(train_X, train_y, sample_weight=sample_weight)
        except ValueError as err:
            print ("n_samples: " + str(n_samples))
            print ("seed: " + str(seed))
            raise

        prediction = wineTree.predict(train_X)
        mae += mean_absolute_error(train_y, prediction,
                                    sample_weight=sample_weight)

        
        end = time.time()
        maeSeconds += (end - start)

    print ("n_samples: " + str(n_samples))

print ("Seconds Elapsed: " + str(maeSeconds))
print ("Total MAE: " + str(mae))

JohnStott · 2018-07-22T06:39:46Z

Running the above on the various branches produces the following results:

Branch	Seconds Elapsed:	Total MAE:
master (Original)	28.5206606388092	495.060674631426
median_issue_example	Fails!	Fails!
median_fix	28.547548532486	10.3636367106389
median_fix_debug	234.1753885746	10.3636367106389

(Note, I only ran the tests once as looking to quickly check everything is as expected.)

It is encouraging to see that the additional logic hasn't affected the efficiency of the calculations timewise (i.e., 28.52 seconds versus 28.55). It is also good to see that the MAE for median_fix versus median_fix_debug is the same (i.e., I haven't accidently introduced error in the 2 slight different implementations).

It is interesting to see the difference in MAE between fixed and original. The above script loops around 5000 times, so some exaggeration can be expected(?).

The median_fix_debug is expected to take a lot longer since it is calculating the naive version of the median, k, and sum_w_0_k too (nb I could make this much more efficient by just having one loop in verify_state() but this is just for demonstration purposes).

JohnStott · 2018-07-22T08:35:07Z

As a final test, I have used the Boston dataset to compare the existing implementation ("master (Origin)") versus the newly fixed median version ("median_fix"). Note... here I am using a RandomForestRegressor, this is a good test of the 'sample_weight' and duplication of y since the bootstrapping mechanism works by simply adding or removing from the sample weight in order to simulate sampling with replacement.

Here is the test script:

import time
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

dataset = load_boston()
X_full, y_full = dataset.data, dataset.target

x = 1
total_mae = 0
total_seconds = 0

noOfTrees = 1000
while x <= 10: 
    estimator = RandomForestRegressor(random_state=x, n_estimators=noOfTrees,
                                      criterion="mae")

    start = time.time()
    estimator.fit(X_full, y_full)
    seconds = time.time() - start
    
    prediction = estimator.predict(X_full)
    mae = mean_absolute_error(y_full, prediction)

    print ("Loop #", str(x))
    print ("Seconds elapsed:", str(seconds))
    print ("MAE: ", str(mae))
    print ("")

    total_mae += mae
    total_seconds += seconds
    x += 1

print ("")
print ("Total Seconds elapsed:", str(total_seconds))
print ("Total MAE: ", str(total_mae))

JohnStott · 2018-07-22T08:37:44Z

The results:

master (Original)			median_fix

Test	MAE:	Seconds elapsed:	Test	MAE:	Seconds elapsed:
1	0.82701	22.50109	1	0.81779	22.69898
2	0.82477	22.48553	2	0.81234	22.45428
3	0.82858	22.48545	3	0.81862	22.43860
4	0.82266	22.43859	4	0.81243	22.40735
5	0.81545	22.46982	5	0.80810	22.39163
6	0.82332	22.42289	6	0.81008	22.36039
7	0.82873	22.46974	7	0.81871	22.39168
8	0.82093	22.45420	8	0.81791	22.36042
9	0.82839	22.39163	9	0.81739	22.31357
10	0.81844	22.43855	10	0.81175	22.32927

Sum	8.23828	224.55748	Sum	8.14513	224.14616

JohnStott · 2018-07-22T08:39:10Z

It's nice to see (with this dataset at least) that the fixed median model trains better to the data.

JohnStott · 2018-07-22T08:44:41Z

sklearn/tree/_utils.pyx

-                                                original_median)
-        return return_value
+        push_index = self.samples.push(data, weight)
+        if push_index == -1:


if push_index == -1:
return -1

I added this to replicate what was previously being returned. Though we should only get a -1 when an exception occurs i.e., MemoryError. So in hindsight I think I can remove this check since an exception in self.samples.push should terminate immediately...? I am not 100% sure though with being new to Cyphon?

jnothman · 2018-07-22T12:59:19Z

While appreciating that it is an important bug, I don't think this fix can make it into 0.20.0 (or at least not the RC), so it's best if we review it with a clear view after the release of 0.20. It might be best to ping us on your PRs in a couple of weeks.

JohnStott · 2018-07-23T06:49:23Z

Thanks @jnothman, no problem. Have to cut off somewhere. Will keep an eye on the release and remind later. Cheers.

jjerphan · 2021-06-10T07:01:07Z

Hi @JohnStott, this looks like an interesting contribution!

Are you still interested in working on this PR? 🙂

JohnStott · 2021-06-10T13:13:28Z

Hi Julien ***@***.***), I believe this solution worked, just that it was too late to be included in the release at the time. I haven't been able to look at the code for a few years now so not sure if still applicable? Please feel free to take over, unfortunately I don't have enough time to do so myself ... Looks like the fix was mainly keeping track of position indexes: https://github.com/scikit-learn/scikit-learn/pull/11649/files Might be worth looking into the efficiency of this solution as had to use "where" loops and may impact datasets with many duplicate values from memory?), perhaps there is a better way?

…

________________________________ From: Julien Jerphanion ***@***.***> Sent: 10 June 2021 08:01 To: scikit-learn/scikit-learn ***@***.***> Cc: JohnStott ***@***.***>; Mention ***@***.***> Subject: Re: [scikit-learn/scikit-learn] Fix Tree Median Calculation for MAE criterion (#11649) Hi @JohnStott<https://github.com/JohnStott>, this looks like an interesting contribution! Are you still interested in working on this PR? 🙂 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#11649 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABIQZ4MVKB4WL24JMF23NIDTSBPMJANCNFSM4FLGDJTQ>.

cakedev0 · 2025-11-13T11:12:29Z

This PR should probably be closed since it was superseded by #32100 which has been merged.

lesteve · 2025-11-17T11:11:21Z

Thanks, closing this one then!

JohnStott added 30 commits July 10, 2018 04:42

Fix to allow M

5278339

Merge pull request #1 from JohnStott/mae_sample_wts

1c5a6cd

Tree MAE fix to ensure sample_weights are used during impurity calculation

Updated MAE test to consider sample_weights in calculation

0dacd2e

Merge branch 'master' of https://github.com/JohnStott/scikit-learn

d7e8161

Removed comment

16bd695

Fixed: E501 line too long (82 > 79 characters)

2bddc6a

syntax correction

37badb8

Added fix details

a404983

Changed to use consistent datatypes during calculaions

6ad17c0

Corrected formatting

2d0a97e

local testing

1a36123

Requested Changes

f49ef59

changes as per review

af98aeb

check for empty stack

db74c0e

fixed issue

c35624f

removed explicit casts

a136cf5

removed explicit casts

ad6201b

removed debug info

88ade1e

Removed unnecessary explicits

aa073d5

Removed unnecessary explicit casts

5f90f71

merge conflict resolution

bd417e9

added additional test

0912207

Merge branch 'master' into median_fix

947d54e

updated comments

6c8ff77

Merge branch 'master' into median_fix

2760423

Requested changes incl additional unit test

100157e

Merge branch 'master' into median_fix

28663d9

fix mistake

de00b02

formatting

fed3117

Merge branch 'master' into median_fix

3752d35

JohnStott added 5 commits July 17, 2018 17:34

merge

d7d8dee

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

d0c503d

Merge branch 'master' into median_fix

af73020

clean up / extras

0d50a23

clean up

9b4b88b

missing closing bracket

caf017e

JohnStott commented Jul 22, 2018

View reviewed changes

amueller added the Waiting for Reviewer label Aug 5, 2019

github-actions bot added the module:tree label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

thomasjpfan added the cython label Apr 13, 2021

cmarmo added help wanted Stalled and removed Waiting for Reviewer labels Jan 15, 2022

This was referenced Sep 8, 2025

DecisionTreeRegressor with absolute error criterion: non-optimal split #32099

Closed

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Merged

lesteve closed this Nov 17, 2025

github-project-automation bot moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Nov 17, 2025

Uh oh!

Fix Tree Median Calculation for MAE criterion #11649

Fix Tree Median Calculation for MAE criterion #11649

Uh oh!

Conversation

JohnStott commented Jul 21, 2018

Reference Issues/PRs

Uh oh!

JohnStott commented Jul 21, 2018

Uh oh!

JohnStott commented Jul 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnStott commented Jul 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnStott commented Jul 22, 2018

Uh oh!

JohnStott commented Jul 22, 2018

Uh oh!

JohnStott commented Jul 22, 2018

Uh oh!

JohnStott Jul 22, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jul 22, 2018 via email

Uh oh!

JohnStott commented Jul 23, 2018

Uh oh!

jjerphan commented Jun 10, 2021

Uh oh!

JohnStott commented Jun 10, 2021 via email

Uh oh!

cakedev0 commented Nov 13, 2025

Uh oh!

lesteve commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

JohnStott commented Jul 21, 2018 •

edited

Loading

JohnStott commented Jul 22, 2018 •

edited

Loading