-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Incorrect Poisson objective for decision tree/random forest #22186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@RAMitchell Thanks for raising this issue. Do you have a minimal reproducible example where the described behaviour happens?
does not hold. |
We can illustrate the problem here, where the MSE objective converges faster than Poisson to Poisson training loss. This is corrected if the derivation is changed as suggested above. from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
from scipy.stats import beta
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.use("Agg")
sns.set()
def beta_dataset():
np.random.seed(33)
n = 1000
X = np.random.random((n, 4)).astype(np.float32)
a, b = 2.31, 0.627
y = beta.rvs(a, b, size=n).astype(np.float32) * 10
return X, y
rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 0 # 1e-5
algo = {
"skl_dt_poisson": sklDT(
random_state=rs,
min_impurity_decrease=min_impurity_decrease,
criterion="poisson",
),
"skl_dt_mse": sklDT(
random_state=rs, min_impurity_decrease=min_impurity_decrease, criterion="mse",
),
}
datasets = {
"poisson": beta_dataset(),
}
for data_name, (X, y) in datasets.items():
X = X.astype(np.float32)
y = y.astype(np.float32)
df = pd.DataFrame(columns=["algorithm", "accuracy"])
for d in depths:
for name, alg in algo.items():
name, alg, d
alg.set_params(max_depth=d)
alg.fit(X, y)
pred = alg.predict(X)
accuracy = mean_poisson_deviance(y, (pred))
df = df.append(
{"algorithm": name, "accuracy": accuracy, "depth": d},
ignore_index=True,
)
print(df)
sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm")
plt.ylabel("train poisson")
plt.tight_layout()
plt.savefig("poisson_convergence.png")
plt.clf() |
You mean instead of proxy_impurity_left -= y_mean_left * log(y_mean_left)
proxy_impurity_right -= y_mean_right * log(y_mean_right) it should be proxy_impurity_left -= self.sum_left[k] * log(y_mean_left)
proxy_impurity_right -= self.sum_right[k] * log(y_mean_right) At first sight, I guess that's right and should be corrected. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
@lorentzenchr
The Poisson objective has a slight mistake in its derivation.
Given an unregularised decision tree, as the depth increases we expect to see the training loss go to zero. This does not occur in sklearn. We noticed this while implementing the same objective in the cuml project.
The problem is that the loss of the left and right children get normalised by number of examples in each branch, such that they have equal weight even when the left child has many more examples that the right child.
This can be corrected by replacing this code:
scikit-learn/sklearn/tree/_criterion.pyx
Line 1394 in ff09c8a
With something equivalent to this: https://github.com/rapidsai/cuml/blob/416ce61a478a879a49d685e9b06dc4e6d25cb758/cpp/src/decisiontree/batched-levelalgo/objectives.cuh#L316
The text was updated successfully, but these errors were encountered: