-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
DOC: Poisson criterion is not slower than MSE in decision trees #32203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
For completeness and self containment of the PR, can you link/copy the code and relevant results comparing poisson and MSE? Thanks! |
@thomasjpfan do you have an historical context why the docs say poisson is much slower than MSE? |
The PR that added poisson loss: #17386 The original code was basically the same as today. I quickly browsed the reviews and I think this statement was just not challenged (which seems fair, I tend to challenge people when say it's fast but much less when they say it's slow 😂 ). My hypothesis: it's true that the criterion-related computations are much slower for Poisson loss than for MSE, that's why the PR author added this comment. But because tree building is sort-dominated, this doesn't change much the total exec. time. |
From memory it was because the Poisson's scikit-learn/sklearn/tree/_criterion.pyx Lines 1580 to 1588 in be9ac7a
where MSE can compute the scikit-learn/sklearn/tree/_criterion.pyx Lines 1076 to 1080 in be9ac7a
Although, "much slower" could be an overstatement. @lorentzenchr Do you recall why poisson was marked as "much slower" in the docs? |
Bump here. While it remaines unclear why this statement was added in the doc, I feel that we have enough proofs to remove it:
Here is a new extensive benchmark, trying to explore cases were the sort might be less dominant in the execution time (duplicates, no max_depth). In all cases, poisson is less than 25% slower than MSE. from time import perf_counter
import numpy
from sklearn.tree import DecisionTreeRegressor
if __name__ == "__main__":
n_fit = 15
n_skip = 5
for d in [2, 20]:
for with_duplicates in [False, True]:
for max_depth in [4, None]:
for criterion in ["squared_error", "poisson"]:
n = 2_000_000 // d
dts = []
for _ in range(n_fit):
X = numpy.random.rand(n, d)
if with_duplicates:
X = X.round(2)
y = numpy.random.rand(n) + X.sum(axis=1)
t = perf_counter()
tree = DecisionTreeRegressor(
criterion=criterion,
max_features=d, max_depth=max_depth,
)
tree.fit(X, y)
dts.append(perf_counter() - t)
avg = numpy.mean(dts[n_skip:])
std = numpy.std(dts[n_skip:])
print(
f"d={d}; with_duplicates={with_duplicates}; "
f"max_depth={max_depth}; criterion={criterion}:"
f" {avg:.2f} ± {std:.3f}s"
)
print() Results:
|
In the user guide, remove the sentence:
From:
As it's not true: poisson criterion is only ~10% slower than MSE criterion. Did the experiment with the same script than for this PR #32181 for both criteria, the execution time is vastly dominated by the sort (
sort_samples_and_feature_values
).