Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit df9c27a

Browse files
committed
Merge remote-tracking branch 'upstream/master' into datasets
2 parents 25667a5 + 4c29be4 commit df9c27a

File tree

89 files changed

+4389
-1917
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+4389
-1917
lines changed

.landscape.yml

Lines changed: 0 additions & 5 deletions
This file was deleted.

azure-pipelines.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ jobs:
3939
PILLOW_VERSION: '*'
4040
PYTEST_VERSION: '*'
4141
JOBLIB_VERSION: '*'
42+
THREADPOOLCTL_VERSION: '2.0.0'
4243
COVERAGE: 'true'
4344

4445
- template: build_tools/azure/posix.yml
@@ -54,6 +55,7 @@ jobs:
5455
DISTRIB: 'ubuntu'
5556
PYTHON_VERSION: '3.6'
5657
JOBLIB_VERSION: '0.11'
58+
THREADPOOLCTL_VERSION: '2.0.0'
5759
# Linux + Python 3.6 build with OpenBLAS and without SITE_JOBLIB
5860
py36_conda_openblas:
5961
DISTRIB: 'conda'
@@ -70,6 +72,7 @@ jobs:
7072
SCIKIT_IMAGE_VERSION: '*'
7173
# latest version of joblib available in conda for Python 3.6
7274
JOBLIB_VERSION: '0.13.2'
75+
THREADPOOLCTL_VERSION: '2.0.0'
7376
COVERAGE: 'true'
7477
# Linux environment to test the latest available dependencies and MKL.
7578
# It runs tests requiring lightgbm, pandas and PyAMG.
@@ -92,6 +95,7 @@ jobs:
9295
DISTRIB: 'ubuntu-32'
9396
PYTHON_VERSION: '3.6'
9497
JOBLIB_VERSION: '0.13'
98+
THREADPOOLCTL_VERSION: '2.0.0'
9599

96100
- template: build_tools/azure/posix.yml
97101
parameters:
@@ -109,6 +113,7 @@ jobs:
109113
PILLOW_VERSION: '*'
110114
PYTEST_VERSION: '*'
111115
JOBLIB_VERSION: '*'
116+
THREADPOOLCTL_VERSION: '2.0.0'
112117
COVERAGE: 'true'
113118
pylatest_conda_mkl_no_openmp:
114119
DISTRIB: 'conda'
@@ -120,6 +125,7 @@ jobs:
120125
PILLOW_VERSION: '*'
121126
PYTEST_VERSION: '*'
122127
JOBLIB_VERSION: '*'
128+
THREADPOOLCTL_VERSION: '2.0.0'
123129
COVERAGE: 'true'
124130
SKLEARN_TEST_NO_OPENMP: 'true'
125131
SKLEARN_SKIP_OPENMP_TEST: 'true'

benchmarks/bench_hist_gradient_boosting.py

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@
3232
parser.add_argument('--n-samples-max', type=int, default=int(1e6))
3333
parser.add_argument('--n-features', type=int, default=20)
3434
parser.add_argument('--max-bins', type=int, default=255)
35+
parser.add_argument('--random-sample-weights', action="store_true",
36+
default=False,
37+
help="generate and use random sample weights")
3538
args = parser.parse_args()
3639

3740
n_leaf_nodes = args.n_leaf_nodes
@@ -46,6 +49,7 @@ def get_estimator_and_data():
4649
n_features=args.n_features,
4750
n_classes=args.n_classes,
4851
n_clusters_per_class=1,
52+
n_informative=args.n_classes,
4953
random_state=0)
5054
return X, y, HistGradientBoostingClassifier
5155
elif args.problem == 'regression':
@@ -60,15 +64,30 @@ def get_estimator_and_data():
6064
np.bool)
6165
X[mask] = np.nan
6266

63-
X_train_, X_test_, y_train_, y_test_ = train_test_split(
64-
X, y, test_size=0.5, random_state=0)
67+
if args.random_sample_weights:
68+
sample_weight = np.random.rand(len(X)) * 10
69+
else:
70+
sample_weight = None
71+
72+
if sample_weight is not None:
73+
(X_train_, X_test_, y_train_, y_test_,
74+
sample_weight_train_, _) = train_test_split(
75+
X, y, sample_weight, test_size=0.5, random_state=0)
76+
else:
77+
X_train_, X_test_, y_train_, y_test_ = train_test_split(
78+
X, y, test_size=0.5, random_state=0)
79+
sample_weight_train_ = None
6580

6681

6782
def one_run(n_samples):
6883
X_train = X_train_[:n_samples]
6984
X_test = X_test_[:n_samples]
7085
y_train = y_train_[:n_samples]
7186
y_test = y_test_[:n_samples]
87+
if sample_weight is not None:
88+
sample_weight_train = sample_weight_train_[:n_samples]
89+
else:
90+
sample_weight_train = None
7291
assert X_train.shape[0] == n_samples
7392
assert X_test.shape[0] == n_samples
7493
print("Data size: %d samples train, %d samples test."
@@ -79,7 +98,7 @@ def one_run(n_samples):
7998
max_iter=n_trees,
8099
max_bins=max_bins,
81100
max_leaf_nodes=n_leaf_nodes,
82-
n_iter_no_change=None,
101+
early_stopping=False,
83102
random_state=0,
84103
verbose=0)
85104
loss = args.loss
@@ -93,7 +112,7 @@ def one_run(n_samples):
93112
if loss == 'default':
94113
loss = 'least_squares'
95114
est.set_params(loss=loss)
96-
est.fit(X_train, y_train)
115+
est.fit(X_train, y_train, sample_weight=sample_weight_train)
97116
sklearn_fit_duration = time() - tic
98117
tic = time()
99118
sklearn_score = est.score(X_test, y_test)
@@ -110,7 +129,7 @@ def one_run(n_samples):
110129
lightgbm_est = get_equivalent_estimator(est, lib='lightgbm')
111130

112131
tic = time()
113-
lightgbm_est.fit(X_train, y_train)
132+
lightgbm_est.fit(X_train, y_train, sample_weight=sample_weight_train)
114133
lightgbm_fit_duration = time() - tic
115134
tic = time()
116135
lightgbm_score = lightgbm_est.score(X_test, y_test)
@@ -127,7 +146,7 @@ def one_run(n_samples):
127146
xgb_est = get_equivalent_estimator(est, lib='xgboost')
128147

129148
tic = time()
130-
xgb_est.fit(X_train, y_train)
149+
xgb_est.fit(X_train, y_train, sample_weight=sample_weight_train)
131150
xgb_fit_duration = time() - tic
132151
tic = time()
133152
xgb_score = xgb_est.score(X_test, y_test)
@@ -144,7 +163,7 @@ def one_run(n_samples):
144163
cat_est = get_equivalent_estimator(est, lib='catboost')
145164

146165
tic = time()
147-
cat_est.fit(X_train, y_train)
166+
cat_est.fit(X_train, y_train, sample_weight=sample_weight_train)
148167
cat_fit_duration = time() - tic
149168
tic = time()
150169
cat_score = cat_est.score(X_test, y_test)

build_tools/azure/install.cmd

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,16 @@ IF "%PYTHON_ARCH%"=="64" (
1515

1616
call activate %VIRTUALENV%
1717

18+
pip install threadpoolctl
19+
1820
IF "%PYTEST_VERSION%"=="*" (
1921
pip install pytest
2022
) else (
2123
pip install pytest==%PYTEST_VERSION%
2224
)
2325
pip install pytest-xdist
2426
) else (
25-
pip install numpy scipy cython pytest wheel pillow joblib
27+
pip install numpy scipy cython pytest wheel pillow joblib threadpoolctl
2628
)
2729
if "%COVERAGE%" == "true" (
2830
pip install coverage codecov pytest-cov

build_tools/azure/install.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ if [[ "$DISTRIB" == "conda" ]]; then
6565

6666
make_conda $TO_INSTALL
6767

68+
pip install threadpoolctl==$THREADPOOLCTL_VERSION
69+
6870
if [[ "$PYTEST_VERSION" == "*" ]]; then
6971
python -m pip install pytest
7072
else
@@ -81,13 +83,13 @@ elif [[ "$DISTRIB" == "ubuntu" ]]; then
8183
sudo apt-get install python3-scipy python3-matplotlib libatlas3-base libatlas-base-dev python3-virtualenv
8284
python3 -m virtualenv --system-site-packages --python=python3 $VIRTUALENV
8385
source $VIRTUALENV/bin/activate
84-
python -m pip install pytest==$PYTEST_VERSION pytest-cov cython joblib==$JOBLIB_VERSION
86+
python -m pip install pytest==$PYTEST_VERSION pytest-cov cython joblib==$JOBLIB_VERSION threadpoolctl==$THREADPOOLCTL_VERSION
8587
elif [[ "$DISTRIB" == "ubuntu-32" ]]; then
8688
apt-get update
8789
apt-get install -y python3-dev python3-scipy python3-matplotlib libatlas3-base libatlas-base-dev python3-virtualenv
8890
python3 -m virtualenv --system-site-packages --python=python3 $VIRTUALENV
8991
source $VIRTUALENV/bin/activate
90-
python -m pip install pytest==$PYTEST_VERSION pytest-cov cython joblib==$JOBLIB_VERSION
92+
python -m pip install pytest==$PYTEST_VERSION pytest-cov cython joblib==$JOBLIB_VERSION threadpoolctl==$THREADPOOLCTL_VERSION
9193
elif [[ "$DISTRIB" == "conda-pip-latest" ]]; then
9294
# Since conda main channel usually lacks behind on the latest releases,
9395
# we use pypi to test against the latest releases of the dependencies.

build_tools/azure/posix-32.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ jobs:
3636
-e JUNITXML=$JUNITXML
3737
-e VIRTUALENV=testvenv
3838
-e JOBLIB_VERSION=$JOBLIB_VERSION
39+
-e THREADPOOLCTL_VERSION=$THREADPOOLCTL_VERSION
3940
-e PYTEST_VERSION=$PYTEST_VERSION
4041
-e OMP_NUM_THREADS=$OMP_NUM_THREADS
4142
-e OPENBLAS_NUM_THREADS=$OPENBLAS_NUM_THREADS

doc/developers/develop.rst

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -481,58 +481,63 @@ runtime. The default values for the estimator tags are defined in the
481481

482482
The current set of estimator tags are:
483483

484-
allow_nan (default=``False``)
484+
allow_nan (default=False)
485485
whether the estimator supports data with missing values encoded as np.NaN
486486

487-
binary_only (default=``False``)
487+
binary_only (default=False)
488488
whether estimator supports binary classification but lacks multi-class
489489
classification support.
490490

491-
multilabel (default=``False``)
491+
multilabel (default=False)
492492
whether the estimator supports multilabel output
493493

494-
multioutput (default=``False``)
494+
multioutput (default=False)
495495
whether a regressor supports multi-target outputs or a classifier supports
496496
multi-class multi-output.
497497

498-
multioutput_only (default=``False``)
498+
multioutput_only (default=False)
499499
whether estimator supports only multi-output classification or regression.
500500

501-
no_validation (default=``False``)
501+
no_validation (default=False)
502502
whether the estimator skips input-validation. This is only meant for
503503
stateless and dummy transformers!
504504

505-
non_deterministic (default=``False``)
505+
non_deterministic (default=False)
506506
whether the estimator is not deterministic given a fixed ``random_state``
507507

508-
poor_score (default=``False``)
508+
poor_score (default=False)
509509
whether the estimator fails to provide a "reasonable" test-set score, which
510510
currently for regression is an R2 of 0.5 on a subset of the boston housing
511511
dataset, and for classification an accuracy of 0.83 on
512512
``make_blobs(n_samples=300, random_state=0)``. These datasets and values
513513
are based on current estimators in sklearn and might be replaced by
514514
something more systematic.
515515

516-
requires_fit (default=``True``)
516+
requires_fit (default=True)
517517
whether the estimator requires to be fitted before calling one of
518518
`transform`, `predict`, `predict_proba`, or `decision_function`.
519519

520-
requires_positive_X (default=``False``)
520+
requires_positive_X (default=False)
521521
whether the estimator requires positive X.
522522

523-
requires_positive_y (default=``False``)
523+
requires_positive_y (default=False)
524524
whether the estimator requires a positive y (only applicable for regression).
525525

526-
_skip_test (default=``False``)
526+
_skip_test (default=False)
527527
whether to skip common tests entirely. Don't use this unless you have a
528528
*very good* reason.
529529

530-
stateless (default=``False``)
530+
_xfail_test (default=False)
531+
dictionary ``{check_name : reason}`` of common checks to mark as a
532+
known failure, with the associated reason. Don't use this unless you have a
533+
*very good* reason.
534+
535+
stateless (default=False)
531536
whether the estimator needs access to data for fitting. Even though an
532537
estimator is stateless, it might still need a call to ``fit`` for
533538
initialization.
534539

535-
X_types (default=``['2darray']``)
540+
X_types (default=['2darray'])
536541
Supported input types for X as list of strings. Tests are currently only
537542
run if '2darray' is contained in the list, signifying that the estimator
538543
takes continuous 2d numpy arrays as input. The default value is

doc/developers/tips.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,8 @@ Other `pytest` options that may become useful include:
8686
- ``-s`` so that pytest does not capture the output of ``print()``
8787
statements
8888
- ``--tb=short`` or ``--tb=line`` to control the length of the logs
89+
- ``--runxfail`` also run tests marked as a known failure (XFAIL) and report
90+
errors.
8991

9092
Since our continuous integration tests will error if
9193
``FutureWarning`` isn't properly caught,

doc/modules/clustering.rst

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -205,23 +205,17 @@ computing cluster centers and values of inertia. For example, assigning a
205205
weight of 2 to a sample is equivalent to adding a duplicate of that sample
206206
to the dataset :math:`X`.
207207

208-
A parameter can be given to allow K-means to be run in parallel, called
209-
``n_jobs``. Giving this parameter a positive value uses that many processors
210-
(default: 1). A value of -1 uses all available processors, with -2 using one
211-
less, and so on. Parallelization generally speeds up computation at the cost of
212-
memory (in this case, multiple copies of centroids need to be stored, one for
213-
each job).
214-
215-
.. warning::
216-
217-
The parallel version of K-Means is broken on OS X when `numpy` uses the
218-
`Accelerate` Framework. This is expected behavior: `Accelerate` can be called
219-
after a fork but you need to execv the subprocess with the Python binary
220-
(which multiprocessing does not do under posix).
221-
222208
K-means can be used for vector quantization. This is achieved using the
223209
transform method of a trained model of :class:`KMeans`.
224210

211+
Low-level parallelism
212+
---------------------
213+
214+
:class:`KMeans` benefits from OpenMP based parallelism through Cython. Small
215+
chunks of data (256 samples) are processed in parallel, which in addition
216+
yields a low memory footprint. For more details on how to control the number of
217+
threads, please refer to our :ref:`parallelism` notes.
218+
225219
.. topic:: Examples:
226220

227221
* :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`: Demonstrating when

doc/modules/ensemble.rst

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -856,8 +856,7 @@ leverage integer-based data structures (histograms) instead of relying on
856856
sorted continuous values when building the trees. The API of these
857857
estimators is slightly different, and some of the features from
858858
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
859-
are not yet supported: in particular sample weights, and some loss
860-
functions.
859+
are not yet supported, for instance some loss functions.
861860

862861
These estimators are still **experimental**: their predictions
863862
and their API might change without any deprecation cycle. To use them, you
@@ -957,6 +956,39 @@ If no missing values were encountered for a given feature during training,
957956
then samples with missing values are mapped to whichever child has the most
958957
samples.
959958

959+
Sample weight support
960+
---------------------
961+
962+
:class:`HistGradientBoostingClassifier` and
963+
:class:`HistGradientBoostingRegressor` sample support weights during
964+
:term:`fit`.
965+
966+
The following toy example demonstrates how the model ignores the samples with
967+
zero sample weights:
968+
969+
>>> X = [[1, 0],
970+
... [1, 0],
971+
... [1, 0],
972+
... [0, 1]]
973+
>>> y = [0, 0, 1, 0]
974+
>>> # ignore the first 2 training samples by setting their weight to 0
975+
>>> sample_weight = [0, 0, 1, 1]
976+
>>> gb = HistGradientBoostingClassifier(min_samples_leaf=1)
977+
>>> gb.fit(X, y, sample_weight=sample_weight)
978+
HistGradientBoostingClassifier(...)
979+
>>> gb.predict([[1, 0]])
980+
array([1])
981+
>>> gb.predict_proba([[1, 0]])[0, 1]
982+
0.99...
983+
984+
As you can see, the `[1, 0]` is comfortably classified as `1` since the first
985+
two samples are ignored due to their sample weights.
986+
987+
Implementation detail: taking sample weights into account amounts to
988+
multiplying the gradients (and the hessians) by the sample weights. Note that
989+
the binning stage (specifically the quantiles computation) does not take the
990+
weights into account.
991+
960992
Low-level parallelism
961993
---------------------
962994

0 commit comments

Comments
 (0)