Incorrect Poisson objective for decision tree/random forest #22186

RAMitchell · 2022-01-11T11:03:54Z

Describe the bug

The Poisson objective has a slight mistake in its derivation.

Given an unregularised decision tree, as the depth increases we expect to see the training loss go to zero. This does not occur in sklearn. We noticed this while implementing the same objective in the cuml project.

The problem is that the loss of the left and right children get normalised by number of examples in each branch, such that they have equal weight even when the left child has many more examples that the right child.

This can be corrected by replacing this code:

scikit-learn/sklearn/tree/_criterion.pyx

Line 1394 in ff09c8a

y_mean_left = self.sum_left[k] / self.weighted_n_left

With something equivalent to this: https://github.com/rapidsai/cuml/blob/416ce61a478a879a49d685e9b06dc4e6d25cb758/cpp/src/decisiontree/batched-levelalgo/objectives.cuh#L316

lorentzenchr · 2022-01-11T13:36:21Z

@RAMitchell Thanks for raising this issue. Do you have a minimal reproducible example where the described behaviour happens?
As we explicitly forbid splits that would produce a predicted value of 0 in a terminal node, your statement

as the depth increases we expect to see the training loss go to zero

does not hold.

RAMitchell · 2022-01-11T16:23:23Z

We can illustrate the problem here, where the MSE objective converges faster than Poisson to Poisson training loss. This is corrected if the derivation is changed as suggested above.

from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
from scipy.stats import beta
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

matplotlib.use("Agg")
sns.set()


def beta_dataset():
    np.random.seed(33)
    n = 1000

    X = np.random.random((n, 4)).astype(np.float32)
    a, b = 2.31, 0.627
    y = beta.rvs(a, b, size=n).astype(np.float32) * 10
    return X, y


rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 0  # 1e-5
algo = {
    "skl_dt_poisson": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="poisson",
    ),
    "skl_dt_mse": sklDT(
        random_state=rs, min_impurity_decrease=min_impurity_decrease, criterion="mse",
    ),
}

datasets = {
    "poisson": beta_dataset(),
}
for data_name, (X, y) in datasets.items():
    X = X.astype(np.float32)
    y = y.astype(np.float32)
    df = pd.DataFrame(columns=["algorithm", "accuracy"])
    for d in depths:
        for name, alg in algo.items():
            name, alg, d
            alg.set_params(max_depth=d)
            alg.fit(X, y)

            pred = alg.predict(X)
            accuracy = mean_poisson_deviance(y, (pred))
            df = df.append(
                {"algorithm": name, "accuracy": accuracy, "depth": d},
                ignore_index=True,
            )

    print(df)
    sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm")
    plt.ylabel("train poisson")
    plt.tight_layout()
    plt.savefig("poisson_convergence.png")
    plt.clf()

lorentzenchr · 2022-01-11T17:03:28Z

You mean instead of

proxy_impurity_left -= y_mean_left * log(y_mean_left)
proxy_impurity_right -= y_mean_right * log(y_mean_right)

it should be

proxy_impurity_left -= self.sum_left[k] * log(y_mean_left)
proxy_impurity_right -= self.sum_right[k] * log(y_mean_right)

At first sight, I guess that's right and should be corrected.

RAMitchell added Bug Needs Triage Issue requires triage labels Jan 11, 2022

lorentzenchr added the module:tree label Jan 11, 2022

lorentzenchr mentioned this issue Jan 11, 2022

FIX poisson proxy_impurity_improvement #22191

Merged

lorentzenchr removed the Needs Triage Issue requires triage label Jan 11, 2022

lorentzenchr added this to the 1.1 milestone Jan 13, 2022

ogrisel closed this as completed in #22191 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect Poisson objective for decision tree/random forest #22186

Incorrect Poisson objective for decision tree/random forest #22186

RAMitchell commented Jan 11, 2022 •

edited by lorentzenchr

Loading

lorentzenchr commented Jan 11, 2022

Uh oh!

RAMitchell commented Jan 11, 2022 •

edited by lorentzenchr

Loading

Uh oh!

lorentzenchr commented Jan 11, 2022

Uh oh!

Uh oh!

Incorrect Poisson objective for decision tree/random forest #22186

Incorrect Poisson objective for decision tree/random forest #22186

Comments

RAMitchell commented Jan 11, 2022 • edited by lorentzenchr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

lorentzenchr commented Jan 11, 2022

Uh oh!

RAMitchell commented Jan 11, 2022 • edited by lorentzenchr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jan 11, 2022

Uh oh!

RAMitchell commented Jan 11, 2022 •

edited by lorentzenchr

Loading

RAMitchell commented Jan 11, 2022 •

edited by lorentzenchr

Loading