Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Estimator tags #8022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 219 commits into from
Feb 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
219 commits
Select commit Hold shift + click to select a range
98d9aff
make common tests work on estimator instances, not classes
amueller Dec 8, 2016
165727a
checking whether an instance is default-constructible doesn't make a …
amueller Dec 8, 2016
660bc44
more instantiations
amueller Dec 8, 2016
74d10b6
minor fixes to for type vs instance, allow both as input to check_est…
amueller Dec 9, 2016
d29ca95
might actually run now
amueller Dec 9, 2016
b68c822
add _required_parameters as class attributes, get rid of METAESTIMATO…
amueller Dec 9, 2016
1ea2e28
fixes to EllipticEnvelope, SparseCoder, TruncatedSVD, DummyClassifier…
amueller Dec 9, 2016
ca84e37
wow.. fix bug in dict_learning with one component
amueller Dec 9, 2016
72944e0
add tag for skipping accuracy test
amueller Dec 9, 2016
f5c5b7c
add default tags to BaseEstimator
amueller Dec 12, 2016
8ec5d7c
add test_accuracy=False to PLS
amueller Dec 12, 2016
3e5194c
minor fixes and tags for dummy estimators.
amueller Dec 12, 2016
3868203
Fix dummy regressor super call
amueller Dec 12, 2016
1c7d02f
make tests pass finally
amueller Dec 12, 2016
4758abd
document estimator tests
amueller Dec 12, 2016
a62cd91
add stateless and missing values tags
amueller Dec 12, 2016
b857a05
include meta and dont_test everywhere
amueller Dec 12, 2016
62cfcc9
don't make gradient base estimators estimators
amueller Dec 12, 2016
ee2c97b
try to make common tests work when transform produces sparse matrix (…
amueller Dec 12, 2016
e8efb8b
input validation fixes in TfidfTransformer
amueller Dec 12, 2016
9aaae44
insist that estimators allow 2d input for the current checks. we can …
amueller Dec 13, 2016
fcf5169
hashing vectorizer and dict vectorizer input types, allow np.float64 …
amueller Dec 13, 2016
8a52e34
add _skip_test tag to force skipping tests - for CheckingClassifier o…
amueller Dec 13, 2016
b90f0d5
add label input type for label preprocessing
amueller Dec 13, 2016
f6e9b15
d'uh
amueller Dec 13, 2016
9502c6e
working on better meta-estimator support
amueller Dec 13, 2016
281a7c2
don't use the deprecated include_dont_test parameter of all_estimators
amueller Dec 13, 2016
5aa2390
check_estimator fix for when being called with instance and for multi…
amueller Dec 13, 2016
e36ea42
ducktyping partial_fit in multiclass, fix OvO decision function shape…
amueller Dec 13, 2016
f871162
check classification targets in OvO
amueller Dec 13, 2016
9194d73
input validation in OutputCodeClassifier
amueller Dec 13, 2016
e601a4b
add multioutput_only tag, fix some of the multi-output estimators, ad…
amueller Dec 13, 2016
a8648d5
give up on multioutput classifier for now
amueller Dec 13, 2016
7b7e152
make at least MultiOutputRegressor work
amueller Dec 13, 2016
1b23d88
input validation in EllipticEnvelope
amueller Dec 13, 2016
c877e77
fix order of checks in test_class_weight_balanced_linear_classifier
amueller Dec 13, 2016
8d42707
fix silly tag inplace errors
amueller Dec 13, 2016
923a946
complete fitting of FromModel in fit
amueller Dec 13, 2016
aa9f6ba
detect if score is a function instead of a method and shift parameter…
amueller Dec 13, 2016
74aa03d
DummyRegressor is actually multi-output
amueller Dec 13, 2016
ed0d91d
support dense arrays in TfidfTransformer
amueller Dec 13, 2016
0966ee9
fix from_model test on invalid input
amueller Dec 13, 2016
b944ee3
give EllipticEnvelope the accuracy score back... for some reason?
amueller Dec 13, 2016
bd5ccb0
pass instance to check in test, not class
amueller Dec 13, 2016
537daf9
run no smoothing test on sparse matrix because headache
amueller Dec 13, 2016
fd717e8
tag all multioutput estimators (regressors?) with MultiOutputMixin
amueller Dec 13, 2016
283217a
use ``safe_tags`` everywhere
amueller Dec 14, 2016
b3281c0
remove left-over "self=None"
amueller Dec 14, 2016
b32a2ca
introduce _update_tags helper
amueller Dec 14, 2016
ab594c2
sdd missing self in _update_tags call
amueller Dec 14, 2016
246d368
fix missing return, some typos
amueller Dec 14, 2016
3c353e8
fix OneVsRestClassifier decision function shape for n_classes=2
amueller Dec 14, 2016
4591799
removed unused mixin
amueller Dec 14, 2016
f368dd9
hopefully version-independent fix for explicit self argument
amueller Dec 14, 2016
dedb873
allow unicode parameters in python2
amueller Dec 14, 2016
928b3c8
some fixes in the docs
amueller Dec 14, 2016
d039962
pep8
amueller Dec 14, 2016
2edf651
added 13 whatsnew entries...
amueller Dec 15, 2016
c1f7842
some whitespace
amueller Dec 16, 2016
633f945
add partial_fit tests for OvR and SelectFromModel
amueller Dec 19, 2016
0d607eb
added tests for multiclass and multioutput input validation fixes
amueller Dec 19, 2016
5c12cba
add test for n_components = 1 transform in dict learning
amueller Dec 19, 2016
a52eff1
Merge branch 'master' into tags
amueller Jun 5, 2017
6749ff3
sync with master, fix merging issues
amueller Jun 5, 2017
a57a253
fix merge issue (though the new docstring seems worsE)
amueller Jun 5, 2017
e7cc0d7
add test for sparse_encode shapes
amueller Jun 5, 2017
779074a
fix test lol
amueller Jun 5, 2017
0d08435
fix test for sparse_encode shapes
amueller Jun 5, 2017
e5721be
fix multioutput_estimator_convert_y_2d calls (merge errors?)
amueller Jun 6, 2017
12112ac
ignore more deprecation warnings in common tests
amueller Jun 6, 2017
5866538
add if_delegate_has_method to MultiOutputRegressor.partial_fit
amueller Jun 6, 2017
b926691
ignore more deprecation warnings in common tests for good measure
amueller Jun 6, 2017
980a2dc
skip tests in GaussianProcess as it adds too many stuff during fit an…
amueller Jun 6, 2017
28b1dd1
Merge branch 'master' into tags
amueller Jun 6, 2017
2dce52c
merge fixes, don't do the tf-idf thing
amueller Jun 6, 2017
9046dcb
remove duplicate whatsnew entries
amueller Jun 6, 2017
b58c9d1
remove more duplicate whatsnew entries
amueller Jun 6, 2017
e054afd
fixes from merge
amueller Jun 6, 2017
095dd3f
give up on TfidfTransformer and GaussianRandomProjectionHash for now
amueller Jun 6, 2017
b5092cc
tests passing again... whew
amueller Jun 6, 2017
8666465
start work on separating instance-level tests
amueller Jun 6, 2017
bbfaf59
minor refactoring / fixes to work without tags
amueller Jun 6, 2017
4dd732d
add clone into check_supervised_y_2d estimator check (which made othe…
amueller Jun 6, 2017
7ce1123
remove duplicate check_estimator_unfitted assert
amueller Jun 6, 2017
48bd931
add issue reference to whatsnew entry
amueller Jun 6, 2017
b1171ed
added some clones, minor fixes from vene's review
amueller Jun 7, 2017
c636b20
rename estimator arg to estimator_org to make a visible distinction b…
amueller Jun 7, 2017
7eb6bed
more renaming for more explicit clones
amueller Jun 7, 2017
7cb4505
org -> orig
amueller Jun 7, 2017
c8b1f96
allclose, fix orig stuff
amueller Jun 7, 2017
ca0767a
don't use set_testing_parameters in the checks!
amueller Jun 7, 2017
79e1c8f
minor fixes for allclose
amueller Jun 7, 2017
9840f43
fix some test, add more tests on classes
amueller Jun 7, 2017
efe4614
added the test using pickles.
amueller Jun 7, 2017
8fede49
move assert_almost_equal_dense_sparse to utils.testing, rename to ass…
amueller Jun 8, 2017
27743d4
make assert_allclose_dense_sparse more stringent
amueller Jun 8, 2017
02a93e8
more allclose fixes
amueller Jun 8, 2017
764898e
run test_check_estimator on all estimators
amueller Jun 8, 2017
7ef1c2b
rename set_testing_parameters to set_checking_parameters so nose does…
amueller Jun 8, 2017
57736d1
fix in set_checking_parameters so that common tests pass
amueller Jun 8, 2017
0691b71
Merge branch 'master' into instance_level_tests
amueller Jun 8, 2017
3f74443
more fixes to assert_allclose_dense_sparse
amueller Jun 8, 2017
5a59d2f
rename alg to clusterer, don't scream even though I really want to
amueller Jun 8, 2017
cb74e53
ok this is not a pretty strict test that runs check_estimator with an…
amueller Jun 8, 2017
49b48c9
simplify test as they didn't help at all
amueller Jun 8, 2017
7e5e0a1
it works!!! omfg
amueller Jun 8, 2017
b96a335
run check_estimator clone test only on one of the configs, don't run …
amueller Jun 8, 2017
d660059
Add `slow_test` decorator and documentation
vene Jun 8, 2017
71a72a8
Merge pull request #31 from vene/testswip
amueller Jun 8, 2017
e2b8d63
Merge branch 'master' into instance_level_tests
amueller Jun 9, 2017
a0c5eeb
Merge branch 'instance_level_tests' of github.com:amueller/scikit-lea…
amueller Jun 9, 2017
5d91633
run test_check_estimator only on some estimators
amueller Jun 9, 2017
1ff8463
fix diags in test for older scipy
amueller Jun 9, 2017
cce8954
fix pep8 and shorten
vene Jun 9, 2017
ef97a81
Merge pull request #32 from vene/flakefix
amueller Jun 9, 2017
46189b8
use joblib.hash for inequality check because the pickle state machine…
amueller Jun 9, 2017
b151752
Merge branch 'instance_level_tests' of github.com:amueller/scikit-lea…
amueller Jun 9, 2017
2cd6e1c
Merge branch 'master' into tags
amueller Jun 9, 2017
4c509e6
Merge branch 'master' into tags
amueller Jun 9, 2017
16f487b
minor syncs with master
amueller Jun 9, 2017
dfc661a
remove duplicate test
amueller Jun 9, 2017
c499b08
don't test GaussianProcess as deprecated and being difficult
amueller Jun 10, 2017
720e34c
clean up some ifs
amueller Jun 10, 2017
22eee88
add "deterministic" and "requires_positive_data" tags (but don't use …
amueller Jun 10, 2017
9eab395
mark non-deterministic estimator with tag
amueller Jun 10, 2017
a47e9f8
simplify test_common
amueller Jun 10, 2017
3b5762d
deprecate / remove include_meta_estimators
amueller Jun 10, 2017
03e1716
add fix for models that can predict without fit but are not stateless…
amueller Jun 10, 2017
ff37f01
remove SpectralClustering special case, test meta-estimators using ba…
amueller Jun 12, 2017
5d73c1a
some additions to contributing doc
amueller Jun 12, 2017
2157614
remove estimator from _update_tags
amueller Jun 12, 2017
83744ef
don't use get() on tags, always use _safe_tags. address other minor c…
amueller Jun 12, 2017
e1f80d3
Merge branch 'master' into tags
amueller Jun 19, 2017
54bce7a
make _safe_tags more safe
amueller Jun 19, 2017
4e00dff
fix test_accuracy -> test_predictions rename in PLS
amueller Jun 19, 2017
c04f361
fix DummyClassifier to work on y.ndim == 2 with y.shape[1] == 1
amueller Jun 19, 2017
91804f8
special case TruncatedSVD :-( make fit error depend on input_validation
amueller Jun 19, 2017
5df999c
ugh silly typo == TruncatedSVD
amueller Jun 19, 2017
e053cce
set parameters on estimator only once (shouldn't change anything beca…
amueller Jun 19, 2017
5aa313b
merge head of master with amueller's tags branch. Tests are failing b…
GKjohns Nov 29, 2017
f547204
Merge branch 'master' into kyle_tags
amueller Jun 15, 2018
81b1c51
fix ore merge issues
amueller Jun 15, 2018
0617512
add tags to new imputers
amueller Jun 15, 2018
2e8d206
remove duplicate import
amueller Jun 15, 2018
af2aaa6
add required parameter to ColumnTransformer
amueller Jun 15, 2018
c29dac4
add input validation tag to TransformedTargetRegressor
amueller Jun 15, 2018
61c5628
cleanup imports, pep8 in estimator checks
amueller Jun 15, 2018
1dd02c0
don't worry about meta-estimators in common tests for now.
amueller Jun 15, 2018
500921e
fix whitespace issues
amueller Jun 15, 2018
afea648
skip some more input validation checks
amueller Jun 15, 2018
16ba879
fix pandas sample weight test
amueller Jun 15, 2018
a8ea48c
add missing vallue tag to MinMaxScaler
amueller Jun 15, 2018
860dd6b
test common cleanup
amueller Jun 15, 2018
48e6fca
remove duplicate transformer test
amueller Jun 15, 2018
f574be8
missing value tag for quantile transformer
amueller Jun 15, 2018
b217bb7
ensure min_features=2 in TruncatedSVD
amueller Jun 15, 2018
d09eb6f
require input validation tag in more places
amueller Jun 15, 2018
cf3ded6
Merge branch 'master' into tags
amueller Jun 29, 2018
2d67c2f
remove old files
amueller Jun 29, 2018
42138a4
Merge branch 'master' into tags
amueller Sep 28, 2018
e13df63
merge fixes
amueller Sep 28, 2018
17e5a9c
another merge error
amueller Sep 28, 2018
42fff09
don't check preprocessing methods for missing values as they pass the…
amueller Sep 28, 2018
9f34866
reset whatsnew
amueller Oct 4, 2018
f68d5c0
don't change self.n_values in OneHotEncoder.fit
amueller Oct 4, 2018
d71f0c4
raise more consistent error messages
amueller Oct 4, 2018
3b3ac3d
Merge branch 'master' into tags
amueller Oct 4, 2018
678b74f
more common test fixes
amueller Oct 4, 2018
e1d15b9
densify prediction in sample weight test
amueller Oct 4, 2018
7851b7f
skip tests on RandomTreesEmbedding for now
amueller Oct 4, 2018
aeb3b36
minor fixes and formatting
amueller Oct 4, 2018
b406af1
rename missing_values tag to allow_nan tag
amueller Oct 4, 2018
7e09f23
remove duplicate test
amueller Oct 4, 2018
83f8883
rename deterministic to non_deterministic
amueller Oct 4, 2018
259668c
rename tags to be false by default
amueller Oct 4, 2018
e3b6459
indentation fixes
amueller Oct 4, 2018
e7bf51d
add note on default values and unused tags
amueller Oct 4, 2018
6930c8a
remove non_meta as not applicable, cleanup
amueller Oct 4, 2018
89c3050
fix accuracy tests
amueller Oct 4, 2018
5da0089
cleanup old code
amueller Oct 4, 2018
1b7725d
try to refactor _get_tags
amueller Oct 4, 2018
a79b82d
fixes for missing tags
amueller Oct 4, 2018
af8856f
pep8 fixes
amueller Oct 4, 2018
d794c8b
don't use bare except
amueller Oct 4, 2018
f118b76
rm criterion and max_features from __init__ and store them as class a…
rohan-varma Oct 8, 2018
d1b67dc
make sure that the docstring comes first
rohan-varma Oct 8, 2018
7fe2cd4
Merge branch 'master' into tags
amueller Oct 10, 2018
20ca277
Merge branch 'move-criterion-max-features' into tags
amueller Oct 10, 2018
727267b
remove _skip_test from RandomTreeEmbedding because it got fixed
amueller Oct 11, 2018
5da2c16
Merge branch 'master' into tags
amueller Oct 11, 2018
11f5e5c
remove unused import
amueller Oct 11, 2018
18187a2
fix pep8
amueller Oct 15, 2018
4263515
fix merge messup
amueller Oct 15, 2018
56f6903
Merge branch 'master' into tags
amueller Nov 16, 2018
18e2d66
skip ovo test
amueller Nov 16, 2018
173c126
do tests on instances when possible, cover meta-estimators
amueller Nov 16, 2018
049e3aa
don't allow inf in targets. Too strict?
amueller Nov 16, 2018
2196a22
add classes_ to RFE and RFECV
amueller Nov 16, 2018
0a25ad3
minor fixed to RFE and RFECV, OrdinalEncoder
amueller Nov 16, 2018
1052f43
skip tests on multioutput classifier and RegressorChain
amueller Nov 16, 2018
584d702
fix instantiation in check_default_constructible
amueller Nov 16, 2018
3666f63
rename no_accuracy_assured to poor_score, document X_types
amueller Nov 16, 2018
8928ed4
fix error message in common tests for nan in class labels
amueller Nov 19, 2018
1109981
don't allow overwriting tags in the MRO
amueller Nov 19, 2018
8de0d04
Merge branch 'master' into tags
amueller Dec 28, 2018
b2c6b43
fix some merge issues
amueller Dec 28, 2018
22501b8
review comments by jnothman
amueller Dec 28, 2018
aef7378
Merge branch 'master' into tags
amueller Jan 17, 2019
42aa99a
don't use deprecated "message' in pytest.raises
amueller Jan 17, 2019
e10f20e
add some comments in the tag docs, make dict_vectorizer input_type "d…
amueller Jan 22, 2019
5e94012
Merge branch 'master' into tags
amueller Jan 22, 2019
e61b35d
Merge branch 'master' into tags
amueller Feb 21, 2019
4c1ed2d
Update doc/developers/contributing.rst
glemaitre Feb 21, 2019
281a7ef
Apply suggestions from code review
glemaitre Feb 21, 2019
d759329
very certain I fixed this before... more generic error message for in…
amueller Feb 21, 2019
4715e1b
add tags to iterativeimputer
amueller Feb 21, 2019
873d916
fix pep8 from a suggestion ;)
amueller Feb 21, 2019
d67df1c
fix missing indicator test
amueller Feb 21, 2019
83fa5f3
remove outdated comment
amueller Feb 22, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 88 additions & 17 deletions doc/developers/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1419,22 +1419,18 @@ advised to maintain notes on the `GitHub wiki
Specific models
---------------

Classifiers should accept ``y`` (target) arguments to ``fit``
that are sequences (lists, arrays) of either strings or integers.
They should not assume that the class labels
are a contiguous range of integers;
instead, they should store a list of classes
in a ``classes_`` attribute or property.
The order of class labels in this attribute
should match the order in which ``predict_proba``, ``predict_log_proba``
and ``decision_function`` return their values.
The easiest way to achieve this is to put::
Classifiers should accept ``y`` (target) arguments to ``fit`` that are
sequences (lists, arrays) of either strings or integers. They should not
assume that the class labels are a contiguous range of integers; instead, they
should store a list of classes in a ``classes_`` attribute or property. The
order of class labels in this attribute should match the order in which
``predict_proba``, ``predict_log_proba`` and ``decision_function`` return their
values. The easiest way to achieve this is to put::

self.classes_, y = np.unique(y, return_inverse=True)

in ``fit``.
This returns a new ``y`` that contains class indexes, rather than labels,
in the range [0, ``n_classes``).
in ``fit``. This returns a new ``y`` that contains class indexes, rather than
labels, in the range [0, ``n_classes``).

A classifier's ``predict`` method should return
arrays containing class labels from ``classes_``.
Expand All @@ -1445,14 +1441,89 @@ this can be achieved with::
D = self.decision_function(X)
return self.classes_[np.argmax(D, axis=1)]

In linear models, coefficients are stored in an array called ``coef_``,
and the independent term is stored in ``intercept_``.
``sklearn.linear_model.base`` contains a few base classes and mixins
that implement common linear model patterns.
In linear models, coefficients are stored in an array called ``coef_``, and the
independent term is stored in ``intercept_``. ``sklearn.linear_model.base``
contains a few base classes and mixins that implement common linear model
patterns.

The :mod:`sklearn.utils.multiclass` module contains useful functions
for working with multiclass and multilabel problems.

Estimator Tags
--------------
.. warning::

The estimator tags are experimental and the API is subject to change.

Scikit-learn introduced estimator tags in version 0.21. These are annotations
of estimators that allow programmatic inspection of their capabilities, such as
sparse matrix support, supported output types and supported methods. The
estimator tags are a dictionary returned by the method ``_get_tags()``. These
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to note that these tags may be dependent on estimator parameters and even system architecture, and hence are a method on an instance, rather than a property of the class.

You should probably also define the default implementation and _more_tags or do you consider that even more experimental???

tags are used by the common tests and the :func:`sklearn.utils.estomator_checks.check_estimator` function to
decide what tests to run and what input data is appropriate. Tags can depends on
estimator parameters or even system architecture and can in general only be
determined at runtime.

The default value of all tags except for ``X_types`` is ``False``.

The current set of estimator tags are:

non_deterministic
whether the estimator is not deterministic given a fixed ``random_state``

requires_positive_data - unused for now
whether the estimator requires positive X.

no_validation
whether the estimator skips input-validation. This is only meant for stateless and dummy transformers!

multioutput - unused for now
whether a regressor supports multi-target outputs or a classifier supports multi-class multi-output.

multilabel
whether the estimator supports multilabel output

stateless
whether the estimator needs access to data for fitting. Even though
an estimator is stateless, it might still need a call to ``fit`` for initialization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should we deprecate the need for a call to fit for initialisation in stateless estimators?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can, because "stateless" can still mean it depends on n_features. RBFSampler for example needs to sample a random matrix of shape (n_features, n_components). I'm not sure how to do it unless we call fit.
Do we error if Normalizer is called with different n_features during transform? do we want to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "stateless" is the right word then? We mean "data independent"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have separate tags for data independent and no state at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have separate tags for data independent and no state at all?

Dunno. What's the use case? If so, we could consider a ternary tag...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the main use-case here was that some estimators didn't complain if the number of features was different in fit and transform, possibly only AdditiveChi2Sampler.
Though this estimator still required calling fit.
I guess we can define stateless as "doesn't require calling fit" and everything that requires calling fit should check the shape?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm AdditiveChi2Sampler requires calling fit for no reason actually...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #12616 to follow up. I don't think there's a good reason for a ternary tag. Right now this is used for testing two things: checking that calling transform before fit will raise a nice error, and checking that the number of features needs to be consistent between fit and transform.
These checks are somewhat independent, but both related to being stateless.


allow_nan
whether the estimator supports data with missing values encoded as np.NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some subtlety to this. What if it supports NaN at transform but not at fit (with some parameters)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, or the other way around?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the other way around would be a case of interest. In #11635 we identified that you might have a feature selector that could not train on missing data (if only because the parameters weren't right) but there's no reason it shouldn't transform with missing data.


poor_score
whether the estimator fails to provide a "reasonable" test-set score, which
currently for regression is an R2 of 0.5 on a subset of the boston housing
dataset, and for classification an accuracy of 0.83 on
``make_blobs(n_samples=300, random_state=0)``. These datasets and values
are based on current estimators in sklearn and might be replaced by
something more systematic.

multioutput_only
whether estimator supports only multi-output classification or regression.

_skip_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the _ doesn't mean private, perhaps we can use something like !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it kinda means private in the sense that no-one should ever use it ;)

whether to skip common tests entirely. Don't use this unless you have a *very good* reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_types is undocumented at present, and is mysterious... should it not be a series of boolean tags instead of a list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require us to define a list of possible input types now and it would be harder to change in the future though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a set of boolean tags harder than a list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt it might be more natural to add new things to a set/list than add another boolean variable to a set/list of boolean variables.
If for the boolean variable the variables that are present might change, then the logic will be strictly more complex than a list, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. A list is fine... and could have benefits if the objects in the list are not merely strings

X_types
Supported input types for X as list of strings. Tests are currently only run if '2darray' is contained
in the list, signifying that the estimator takes continuous 2d numpy arrays as input. The default
value is ['2darray']. Other possible types are ``'string'``, ``'sparse'``,
``'categorical'``, ``dict``, ``'1dlabels'`` and ``'2dlabels'``.
The goals is that in the future the supported input type will determine the
data used during testsing, in particular for ``'string'``, ``'sparse'`` and
``'categorical'`` data. For now, the test for sparse data do not make use
of the ``'sparse'`` tag.


In addition to the tags, estimators are also need to declare any non-optional
parameters to ``__init__`` in the ``_required_parameters`` class attribute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not determine this automatically by inspecting Estimator.__init__ signature? Or are there case where the two don't match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman asked the same, so maybe my intentions are indeed unclear.
The point here is that we currently check that everything can be constructed without parameters in sklearn with a few exceptions. I'd like to keep checking that. This parameter is there to say "I'm really sure I want this to require parameters". I think otherwise we'd significantly weaken our tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following be an appropriate substitute, shooting two birds with one stone:

class BaseEstimator:
    ...
    @classmethod
    def _get_instances_for_checking(cls):
        yield cls()

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has the potential to make most of set_checking_parameters disappear. (Although I suppose it does not then strictly test it is default-constructable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_instances_for_checking can certainly be implemented as a separate PR. I think this would also allow us to use check_estimator in test_common.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for possibly separate PR. I kinda don't want to mess with the default construction test too much...

which is a list or tuple. If ``_required_parameters`` is only
``["estimator"]`` or ``["base_estimator"]``, then the estimator will be
instantiated with an instance of ``LinearDiscriminantAnalysis`` (or
``RidgeRegression`` if the estimator is a regressor) in the tests. The choice
of these two models is somewhat idiosyncratic but both should provide robust
closed-form solutions.

.. _reading-code:

Reading the existing code base
Expand Down
69 changes: 58 additions & 11 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,25 @@
import copy
import warnings
from collections import defaultdict
from inspect import signature
import struct
import inspect

import numpy as np

from . import __version__

_DEFAULT_TAGS = {
'non_deterministic': False,
'requires_positive_data': False,
'X_types': ['2darray'],
'poor_score': False,
'no_validation': False,
'multioutput': False,
"allow_nan": False,
'stateless': False,
'multilabel': False,
'_skip_test': False,
'multioutput_only': False}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary only would be an important tag for external libraries (and came up in the context of the GP here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you're clear that it's binary targets, not fratures

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary only is also relevant for calibration methods.



def clone(estimator, safe=True):
Expand Down Expand Up @@ -61,7 +74,6 @@ def clone(estimator, safe=True):
return new_object


###############################################################################
def _pprint(params, offset=0, printer=repr):
"""Pretty print the dictionary 'params'

Expand Down Expand Up @@ -112,7 +124,17 @@ def _pprint(params, offset=0, printer=repr):
return lines


###############################################################################
def _update_if_consistent(dict1, dict2):
common_keys = set(dict1.keys()).intersection(dict2.keys())
for key in common_keys:
if dict1[key] != dict2[key]:
raise TypeError("Inconsistent values for tag {}: {} != {}".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then this would error if the BaseEstimator and the final estimator class define the same tag? Maybe re-ordering slightly the MRO classes (cf #8022 (comment)), then not overwriting existing tags could be a way around it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, which I solved by having the BaseEstimator not defining anything. The current solutions does what I want...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think)

key, dict1[key], dict2[key]
))
dict1.update(dict2)
return dict1


class BaseEstimator:
"""Base class for all estimators in scikit-learn

Expand All @@ -135,7 +157,7 @@ def _get_param_names(cls):

# introspect the constructor arguments to find the model parameters
# to represent
init_signature = signature(init)
init_signature = inspect.signature(init)
# Consider the constructor parameters excluding 'self'
parameters = [p for p in init_signature.parameters.values()
if p.name != 'self' and p.kind != p.VAR_KEYWORD]
Expand Down Expand Up @@ -255,8 +277,22 @@ def __setstate__(self, state):
except AttributeError:
self.__dict__.update(state)

def _get_tags(self):
collected_tags = {}
for base_class in inspect.getmro(self.__class__):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to reverse this list to give precedence to tags set earlier in the MRO? The precedence should be tested either way.

(I think the official idiom might be type(self) rather than self.__class__ but not sure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm after thinking about this again, this looks like we're running into the same MRO issue that I was having earlier. I don't think @rth's solution actually works.
I was hoping we would be able to ignore the left-right MRO order with this approach and never overwrite tags. But he have the full tags defined in the BaseEstimator. One solution to make this work is to remove all the tags in the base estimator, not allow overwriting any tags, and then filling in missing tags with the default.

Copy link
Member

@rth rth Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, it is method resolution order, e.g.,

<class '__main__.LinearRegression'>
<class '__main__.BaseEstimator'>
<class '__main__.ClassifierMixin'>

So first, if a tag is defined in the first estimator, we don't want to overwrite it, i.e.

for key, val in base_class._more_tags(self).items():
    if key not in tags:
        tags[key] = val

(or something similar), instead of

tags.update(base_class._more_tags(self))

Then you are right that we want to tags from within the mixin to apply before the the base estimators. Maybe we want to sort inspect.getmro(self.__class__) with some custom comparison function, that would put mixins before BaseEstimator, e.g.,

def _mro_class_compare(args):
    """Ajust the ordering for some estimators classes,
    while preserving the MRO ordering for the rest"""
    position_init, cls = args
    offset = 0
    if cls.__name__ == 'BaseEstimator':
        # put the BaseEstimator last
        offset = 2000
    elif cls.__name__.startswith('Base'):
        # put any "Base.*" classes just before
        offset = 1000
    return position_init + offset


# [...]
        for _, base_class in sorted(enumerate(inspect.getmro(type(self))),
                                    key=_mro_class_compare):
            # setting tags here

it's a bit hackish, but might work. Here the output would be,

<class '__main__.LinearRegression'>
<class '__main__.ClassifierMixin'>
<class 'object'>
<class '__main__.BaseEstimator'>

if (hasattr(base_class, '_more_tags')
and base_class != self.__class__):
more_tags = base_class._more_tags(self)
collected_tags = _update_if_consistent(collected_tags,
more_tags)
if hasattr(self, '_more_tags'):
more_tags = self._more_tags()
collected_tags = _update_if_consistent(collected_tags, more_tags)
tags = _DEFAULT_TAGS.copy()
tags.update(collected_tags)
return tags


###############################################################################
class ClassifierMixin:
"""Mixin class for all classifiers in scikit-learn."""
_estimator_type = "classifier"
Expand Down Expand Up @@ -289,7 +325,6 @@ def score(self, X, y, sample_weight=None):
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)


###############################################################################
class RegressorMixin:
"""Mixin class for all regression estimators in scikit-learn."""
_estimator_type = "regressor"
Expand Down Expand Up @@ -330,7 +365,6 @@ def score(self, X, y, sample_weight=None):
multioutput='variance_weighted')


###############################################################################
class ClusterMixin:
"""Mixin class for all cluster estimators in scikit-learn."""
_estimator_type = "clusterer"
Expand Down Expand Up @@ -432,7 +466,6 @@ def get_submatrix(self, i, data):
return data[row_ind[:, np.newaxis], col_ind]


###############################################################################
class TransformerMixin:
"""Mixin class for all transformers in scikit-learn."""

Expand Down Expand Up @@ -510,13 +543,27 @@ def fit_predict(self, X, y=None):
return self.fit(X).predict(X)


###############################################################################
class MetaEstimatorMixin:
_required_parameters = ["estimator"]
"""Mixin class for all meta estimators in scikit-learn."""
# this is just a tag for the moment


###############################################################################
class MultiOutputMixin(object):
"""Mixin to mark estimators that support multioutput."""
def _more_tags(self):
return {'multioutput': True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should set 'multilabel' if is_classifier(self)



def _is_32bit():
"""Detect if process is 32bit Python."""
return struct.calcsize('P') * 8 == 32


class _UnstableOn32BitMixin(object):
"""Mark estimators that are non-determinstic on 32bit."""
def _more_tags(self):
return {'non_deterministic': _is_32bit()}


def is_classifier(estimator):
"""Returns True if the given estimator is (probably) a classifier.
Expand Down
1 change: 1 addition & 0 deletions sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
[0.5, 0.5, 0. , 1. ]])

"""
_required_parameters = ['transformers']

def __init__(self, transformers, remainder='drop', sparse_threshold=0.3,
n_jobs=None, transformer_weights=None):
Expand Down
3 changes: 3 additions & 0 deletions sklearn/compose/_target.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,6 @@ def predict(self, X):
pred_trans = pred_trans.squeeze(axis=1)

return pred_trans

def _more_tags(self):
return {'poor_score': True, 'no_validation': True}
3 changes: 2 additions & 1 deletion sklearn/cross_decomposition/cca_.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from .pls_ import _PLS
from ..base import _UnstableOn32BitMixin

__all__ = ['CCA']


class CCA(_PLS):
class CCA(_PLS, _UnstableOn32BitMixin):
"""CCA Canonical Correlation Analysis.

CCA inherits from PLS with mode="B" and deflation_mode="canonical".
Expand Down
6 changes: 5 additions & 1 deletion sklearn/cross_decomposition/pls_.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from scipy.sparse.linalg import svds

from ..base import BaseEstimator, RegressorMixin, TransformerMixin
from ..base import MultiOutputMixin
from ..utils import check_array, check_consistent_length
from ..utils.extmath import svd_flip
from ..utils.validation import check_is_fitted, FLOAT_DTYPES
Expand Down Expand Up @@ -116,7 +117,7 @@ def _center_scale_xy(X, Y, scale=True):
return X, Y, x_mean, y_mean, x_std, y_std


class _PLS(BaseEstimator, TransformerMixin, RegressorMixin,
class _PLS(BaseEstimator, TransformerMixin, RegressorMixin, MultiOutputMixin,
metaclass=ABCMeta):
"""Partial Least Squares (PLS)

Expand Down Expand Up @@ -454,6 +455,9 @@ def fit_transform(self, X, y=None):
"""
return self.fit(X, y).transform(X, y)

def _more_tags(self):
return {'poor_score': True}


class PLSRegression(_PLS):
"""PLS regression
Expand Down
4 changes: 2 additions & 2 deletions sklearn/decomposition/kernel_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
from ..utils import check_random_state
from ..utils.validation import check_is_fitted, check_array
from ..exceptions import NotFittedError
from ..base import BaseEstimator, TransformerMixin
from ..base import BaseEstimator, TransformerMixin, _UnstableOn32BitMixin
from ..preprocessing import KernelCenterer
from ..metrics.pairwise import pairwise_kernels


class KernelPCA(BaseEstimator, TransformerMixin):
class KernelPCA(BaseEstimator, TransformerMixin, _UnstableOn32BitMixin):
"""Kernel Principal component analysis (KPCA)

Non-linear dimensionality reduction through the use of kernels (see
Expand Down
3 changes: 2 additions & 1 deletion sklearn/decomposition/truncated_svd.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,8 @@ def fit_transform(self, X, y=None):
X_new : array, shape (n_samples, n_components)
Reduced version of X. This will always be a dense array.
"""
X = check_array(X, accept_sparse=['csr', 'csc'])
X = check_array(X, accept_sparse=['csr', 'csc'],
ensure_min_features=2)
random_state = check_random_state(self.random_state)

if self.algorithm == "arpack":
Expand Down
Loading