Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Incorrect Poisson objective for decision tree/random forest #22186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RAMitchell opened this issue Jan 11, 2022 · 3 comments · Fixed by #22191
Closed

Incorrect Poisson objective for decision tree/random forest #22186

RAMitchell opened this issue Jan 11, 2022 · 3 comments · Fixed by #22191

Comments

@RAMitchell
Copy link

RAMitchell commented Jan 11, 2022

Describe the bug

@lorentzenchr

The Poisson objective has a slight mistake in its derivation.

Given an unregularised decision tree, as the depth increases we expect to see the training loss go to zero. This does not occur in sklearn. We noticed this while implementing the same objective in the cuml project.

The problem is that the loss of the left and right children get normalised by number of examples in each branch, such that they have equal weight even when the left child has many more examples that the right child.

This can be corrected by replacing this code:

y_mean_left = self.sum_left[k] / self.weighted_n_left

With something equivalent to this: https://github.com/rapidsai/cuml/blob/416ce61a478a879a49d685e9b06dc4e6d25cb758/cpp/src/decisiontree/batched-levelalgo/objectives.cuh#L316

@RAMitchell RAMitchell added Bug Needs Triage Issue requires triage labels Jan 11, 2022
@lorentzenchr
Copy link
Member

@RAMitchell Thanks for raising this issue. Do you have a minimal reproducible example where the described behaviour happens?
As we explicitly forbid splits that would produce a predicted value of 0 in a terminal node, your statement

as the depth increases we expect to see the training loss go to zero

does not hold.

@RAMitchell
Copy link
Author

RAMitchell commented Jan 11, 2022

We can illustrate the problem here, where the MSE objective converges faster than Poisson to Poisson training loss. This is corrected if the derivation is changed as suggested above.

from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
from scipy.stats import beta
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

matplotlib.use("Agg")
sns.set()


def beta_dataset():
    np.random.seed(33)
    n = 1000

    X = np.random.random((n, 4)).astype(np.float32)
    a, b = 2.31, 0.627
    y = beta.rvs(a, b, size=n).astype(np.float32) * 10
    return X, y


rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 0  # 1e-5
algo = {
    "skl_dt_poisson": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="poisson",
    ),
    "skl_dt_mse": sklDT(
        random_state=rs, min_impurity_decrease=min_impurity_decrease, criterion="mse",
    ),
}

datasets = {
    "poisson": beta_dataset(),
}
for data_name, (X, y) in datasets.items():
    X = X.astype(np.float32)
    y = y.astype(np.float32)
    df = pd.DataFrame(columns=["algorithm", "accuracy"])
    for d in depths:
        for name, alg in algo.items():
            name, alg, d
            alg.set_params(max_depth=d)
            alg.fit(X, y)

            pred = alg.predict(X)
            accuracy = mean_poisson_deviance(y, (pred))
            df = df.append(
                {"algorithm": name, "accuracy": accuracy, "depth": d},
                ignore_index=True,
            )

    print(df)
    sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm")
    plt.ylabel("train poisson")
    plt.tight_layout()
    plt.savefig("poisson_convergence.png")
    plt.clf()

poisson_convergence

@lorentzenchr
Copy link
Member

You mean instead of

proxy_impurity_left -= y_mean_left * log(y_mean_left)
proxy_impurity_right -= y_mean_right * log(y_mean_right)

it should be

proxy_impurity_left -= self.sum_left[k] * log(y_mean_left)
proxy_impurity_right -= self.sum_right[k] * log(y_mean_right)

At first sight, I guess that's right and should be corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants