Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Support for strings in OneHotEncoder #8793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ea98484
Refactored OneHotEncoder to work with strings
vighneshbirodkar Mar 30, 2016
e03b5c7
ported functions to fixes.py
vighneshbirodkar May 2, 2016
06e6d3a
unique arrays are now sorted
vighneshbirodkar May 2, 2016
074f194
revert selection logic
vighneshbirodkar May 2, 2016
083142e
Added copy argument
vighneshbirodkar May 2, 2016
f768f3b
Inbetween adding the seen option
vighneshbirodkar Aug 31, 2016
1e34cae
remove seen argument and support range case with FutureWarning
vighneshbirodkar Sep 1, 2016
fed7959
Made label_encoders_ private
vighneshbirodkar Sep 2, 2016
c62d2ba
Added new attributes and tests for OHE
vighneshbirodkar Sep 2, 2016
e929f23
Fixed doctests
vighneshbirodkar Sep 2, 2016
bc7a26b
Fixed rst doc tests
vighneshbirodkar Sep 2, 2016
feaf014
Replaced type in array with ellipsis
vighneshbirodkar Sep 2, 2016
7b608e1
flake fixes
vighneshbirodkar Sep 2, 2016
5f305d8
Add NORMALIZE_WHITESPACE for python3 tests
vighneshbirodkar Sep 2, 2016
1392292
normalize whitespace for rst docs
vighneshbirodkar Sep 2, 2016
50d2360
normalizing whitespace again
vighneshbirodkar Sep 2, 2016
8f2f1d3
docstring changes and minor optimizations
vighneshbirodkar Sep 6, 2016
1c8accf
Made tests pass by creating arrays with object dtype
vighneshbirodkar Dec 28, 2016
6edda8b
Assign both values and n_values to self._values and remove redundant …
vighneshbirodkar Dec 28, 2016
1d2ca1a
removed extra spaces for flake8 compat
vighneshbirodkar Dec 28, 2016
93ae49e
REF Refactor OHE and avoid copies
Apr 25, 2017
fd11366
WIP
Apr 26, 2017
b96a8d2
Remove error-strict, add auto-strict
Apr 26, 2017
7902352
Fixes for test failures
Apr 26, 2017
4206d79
ENH Handle object and string types in LabelEncoder.transform
Apr 26, 2017
d96fbc6
Fix tests
Apr 26, 2017
0807604
Fix for doc test and scipy 0.11 sparse behavior
Apr 27, 2017
b6d198a
ENH Enforce dtypes in _apply_selected
Apr 27, 2017
7db5ced
TST More tests for OneHotEncoder
Apr 27, 2017
ac9e455
DOC Add What's new and polish docstring for OHE
Apr 27, 2017
2525019
Deprecate active_features_
May 4, 2017
7a53fe8
Switch from auto-strict to error-strict
May 4, 2017
05af448
Deprecate integer and list of integer inputs to `values`
May 4, 2017
d9d77ae
Address CR
May 4, 2017
840382e
Fix whitespace in doc test
May 4, 2017
ff4b30b
Fix doctest for Python 2.7
May 4, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 30 additions & 19 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -378,10 +378,10 @@ Encoding categorical features
Often features are not given as continuous values but categorical.
For example a person could have features ``["male", "female"]``,
``["from Europe", "from US", "from Asia"]``,
``["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]``.
``["Firefox", "Chrome", "Safari", "Internet Explorer"]``.
Such features can be efficiently coded as integers, for instance
``["male", "from US", "uses Internet Explorer"]`` could be expressed as
``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
``["male", "from US", "Internet Explorer"]`` could be expressed as
``[0, 1, 3]`` while ``["female", "from Asia", "Chrome"]`` would be
``[1, 2, 1]``.

Such integer representation can not be used directly with scikit-learn estimators, as these
Expand All @@ -397,31 +397,42 @@ only one active.
Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
>>> enc.fit([['female', 'from US', 'Chrome'],
... ['male', 'from Asia', 'Firefox']]) \
... # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features='all',
dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None,
sparse=True, values='auto')
>>> enc.transform([['female', 'from Asia', 'Firefox']]).toarray()
array([[ 1., 0., 1., 0., 0., 1.]])

By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
It is possible to specify this explicitly using the parameter ``values``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.
In the result, the first two values are genders, the next set of three
values are the continents and the last values are web browsers.

Note that, if there is a possibilty that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
>>> browsers = ['Internet Explorer', 'Chrome' , 'Safari', 'Firefox']
>>> genders = ['male', 'female']
>>> locations = ['from Europe', 'from Asia', 'from US']
>>> enc = preprocessing.OneHotEncoder(values=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the
>>> # 2nd and 3rd feature
>>> enc.fit([['female', 'from US', 'Chrome'],
... ['male', 'from Asia', 'Internet Explorer']]) \
... # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features='all',
dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None,
sparse=True,
values=[...])

>>> enc.transform([['male', 'from Europe', 'Safari']]).toarray()
array([[ 0., 1., 0., 1., 0., 0., 0., 0., 1.]])

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.
Expand Down
21 changes: 20 additions & 1 deletion doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,16 @@ Enhancements
removed by setting it to `None`.
:issue:`7674` by:user:`Yichuan Liu <yl565>`.

- :class:`preprocessing.OneHotEncoder` now fits and transforms inputs of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this between enhancements (the new stuff) and api changes (deprecations).

any numerical or string type instead of only integer arrays.
It has addtional fitted attributes ``feature_index_range_``,
``one_hot_feature_index_``, and ``categories_``.
In addition to previous allowed values, ``handle_unknown`` accepts "error-strict"
to error if any unknown values are seen during tranformation.
:issue:`7327` and :issue:`8793` by
:user:`Vighnesh Birodkar <vighneshbirodkar>` and
:user:`Stephen Hoover <stephen-hoover>`.

Bug fixes
.........
- Fixed a bug where :class:`sklearn.ensemble.IsolationForest` uses an
Expand Down Expand Up @@ -329,6 +339,15 @@ API changes summary
the weighted impurity decrease from splitting is no longer alteast
``min_impurity_decrease``. :issue:`8449` by `Raghav RV_`

- In :class:`preprocessing.OneHotEncoder`, deprecate the
``feature_indices_`` and ``active_features_`` attributes.
Deprecate integer and list of integer inputs to ``values``
in favor of lists of lists of categories.
The present behavior of ``handle_unknown="error"`` will
change to be the same as ``handle_unknown="error-strict"`` in v0.21.
:issue:`7327` and :issue:`8793` by
:user:`Vighnesh Birodkar <vighneshbirodkar>` and
:user:`Stephen Hoover <stephen-hoover>`.

.. _changes_0_18_1:

Expand Down Expand Up @@ -5070,4 +5089,4 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Anish Shah: https://github.com/AnishShah

.. _Neeraj Gangwar: http://neerajgangwar.in
.. _Arthur Mensch: https://amensch.fr
.. _Arthur Mensch: https://amensch.fr
Loading