Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 28, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ Enhancements
Bug fixes
.........

- The ``min_weight_fraction_leaf`` parameter of tree-based classifiers and
regressors now assumes uniform sample weights by default if the
``sample_weight`` argument is not passed to the ``fit`` function.
Previously, the parameter was silently ignored. (`#7301
<https://github.com/scikit-learn/scikit-learn/pull/7301>`_) by `Nelson
Liu`_.

.. _changes_0_18:

Version 0.18
Expand Down
25 changes: 15 additions & 10 deletions sklearn/ensemble/forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -807,8 +807,9 @@ class RandomForestClassifier(ForestClassifier):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1018,8 +1019,9 @@ class RandomForestRegressor(ForestRegressor):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1189,8 +1191,9 @@ class ExtraTreesClassifier(ForestClassifier):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1399,8 +1402,9 @@ class ExtraTreesRegressor(ForestRegressor):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -1556,8 +1560,9 @@ class RandomTreesEmbedding(BaseForest):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Expand Down
10 changes: 6 additions & 4 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -1330,8 +1330,9 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
Expand Down Expand Up @@ -1698,8 +1699,9 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
Expand Down
100 changes: 100 additions & 0 deletions sklearn/tree/tests/test_tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,6 +670,30 @@ def check_min_weight_fraction_leaf(name, datasets, sparse=False):
"min_weight_fraction_leaf={1}".format(
name, est.min_weight_fraction_leaf))

# test case with no weights passed in
total_weight = X.shape[0]

for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 6)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the duration of this test? If it's more than a couple hundred milliseconds, please reduce to np.linspace(0, 0.5, 3).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs in .106ms on my machine, late 2013 basic macbook pro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's fine then.

est = TreeEstimator(min_weight_fraction_leaf=frac,
max_leaf_nodes=max_leaf_nodes,
random_state=0)
est.fit(X, y)

if sparse:
out = est.tree_.apply(X.tocsr())
else:
out = est.tree_.apply(X)

node_weights = np.bincount(out)
# drop inner nodes
leaf_weights = node_weights[node_weights != 0]
assert_greater_equal(
np.min(leaf_weights),
total_weight * est.min_weight_fraction_leaf,
"Failed with {0} "
"min_weight_fraction_leaf={1}".format(
name, est.min_weight_fraction_leaf))


def test_min_weight_fraction_leaf():
# Check on dense input
Expand All @@ -681,6 +705,82 @@ def test_min_weight_fraction_leaf():
yield check_min_weight_fraction_leaf, name, "multilabel", True


def check_min_weight_fraction_leaf_with_min_samples_leaf(name, datasets,
sparse=False):
"""Test the interaction between min_weight_fraction_leaf and min_samples_leaf
when sample_weights is not provided in fit."""
if sparse:
X = DATASETS[datasets]["X_sparse"].astype(np.float32)
else:
X = DATASETS[datasets]["X"].astype(np.float32)
y = DATASETS[datasets]["y"]

total_weight = X.shape[0]
TreeEstimator = ALL_TREES[name]
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 3)):
# test integer min_samples_leaf
est = TreeEstimator(min_weight_fraction_leaf=frac,
max_leaf_nodes=max_leaf_nodes,
min_samples_leaf=5,
random_state=0)
est.fit(X, y)

if sparse:
out = est.tree_.apply(X.tocsr())
else:
out = est.tree_.apply(X)

node_weights = np.bincount(out)
# drop inner nodes
leaf_weights = node_weights[node_weights != 0]
assert_greater_equal(
np.min(leaf_weights),
max((total_weight *
est.min_weight_fraction_leaf), 5),
"Failed with {0} "
"min_weight_fraction_leaf={1}, "
"min_samples_leaf={2}".format(name,
est.min_weight_fraction_leaf,
est.min_samples_leaf))
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 3)):
# test float min_samples_leaf
est = TreeEstimator(min_weight_fraction_leaf=frac,
max_leaf_nodes=max_leaf_nodes,
min_samples_leaf=.1,
random_state=0)
est.fit(X, y)

if sparse:
out = est.tree_.apply(X.tocsr())
else:
out = est.tree_.apply(X)

node_weights = np.bincount(out)
# drop inner nodes
leaf_weights = node_weights[node_weights != 0]
assert_greater_equal(
np.min(leaf_weights),
max((total_weight * est.min_weight_fraction_leaf),
(total_weight * est.min_samples_leaf)),
"Failed with {0} "
"min_weight_fraction_leaf={1}, "
"min_samples_leaf={2}".format(name,
est.min_weight_fraction_leaf,
est.min_samples_leaf))


def test_min_weight_fraction_leaf_with_min_samples_leaf():
# Check on dense input
for name in ALL_TREES:
yield (check_min_weight_fraction_leaf_with_min_samples_leaf,
name, "iris")

# Check on sparse input
for name in SPARSE_TREES:
yield (check_min_weight_fraction_leaf_with_min_samples_leaf,
name, "multilabel", True)


def test_min_impurity_split():
# test if min_impurity_split creates leaves with impurity
# [0, min_impurity_split) when min_samples_leaf = 1 and
Expand Down
17 changes: 10 additions & 7 deletions sklearn/tree/tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,11 +301,12 @@ def fit(self, X, y, sample_weight=None, check_input=True,
sample_weight = expanded_class_weight

# Set min_weight_leaf from min_weight_fraction_leaf
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
if sample_weight is None:
min_weight_leaf = (self.min_weight_fraction_leaf *
np.sum(sample_weight))
n_samples)
else:
min_weight_leaf = 0.
min_weight_leaf = (self.min_weight_fraction_leaf *
np.sum(sample_weight))

if self.min_impurity_split < 0.:
raise ValueError("min_impurity_split must be greater than or equal "
Expand Down Expand Up @@ -592,8 +593,9 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Expand Down Expand Up @@ -862,8 +864,9 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Expand Down