scikit-learn · stephen-hoover · Mar 30, 2016 · May 2, 2016 · May 2, 2016 · May 2, 2016
diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
@@ -378,10 +378,10 @@ Encoding categorical features
 Often features are not given as continuous values but categorical.
 For example a person could have features ``["male", "female"]``,
 ``["from Europe", "from US", "from Asia"]``,
-``["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]``.
+``["Firefox", "Chrome", "Safari", "Internet Explorer"]``.
 Such features can be efficiently coded as integers, for instance
-``["male", "from US", "uses Internet Explorer"]`` could be expressed as
-``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
+``["male", "from US", "Internet Explorer"]`` could be expressed as
+``[0, 1, 3]`` while ``["female", "from Asia", "Chrome"]`` would be
 ``[1, 2, 1]``.
 
 Such integer representation can not be used directly with scikit-learn estimators, as these
@@ -397,31 +397,42 @@ only one active.
 Continuing the example above::
 
   >>> enc = preprocessing.OneHotEncoder()
-  >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  # doctest: +ELLIPSIS
-  OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
-         handle_unknown='error', n_values='auto', sparse=True)
-  >>> enc.transform([[0, 1, 3]]).toarray()
-  array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])
+  >>> enc.fit([['female', 'from US', 'Chrome'],
+  ... ['male', 'from Asia', 'Firefox']])  \
+  ... # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
+  OneHotEncoder(categorical_features='all',
+         dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None,
+         sparse=True, values='auto')
+  >>> enc.transform([['female', 'from Asia', 'Firefox']]).toarray()
+  array([[ 1.,  0.,  1.,  0.,  0.,  1.]])
 
 By default, how many values each feature can take is inferred automatically from the dataset.
-It is possible to specify this explicitly using the parameter ``n_values``.
+It is possible to specify this explicitly using the parameter ``values``.
 There are two genders, three possible continents and four web browsers in our
 dataset.
 Then we fit the estimator, and transform a data point.
-In the result, the first two numbers encode the gender, the next set of three
-numbers the continent and the last four the web browser.
+In the result, the first two values are genders, the next set of three
+values are the continents and the last values are web browsers.
 
 Note that, if there is a possibilty that the training data might have missing categorical
 features, one has to explicitly set ``n_values``. For example,
 
-    >>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
-    >>> # Note that there are missing categorical values for the 2nd and 3rd
-    >>> # features
-    >>> enc.fit([[1, 2, 3], [0, 2, 0]])  # doctest: +ELLIPSIS
-    OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
-           handle_unknown='error', n_values=[2, 3, 4], sparse=True)
-    >>> enc.transform([[1, 0, 0]]).toarray()
-    array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])
+    >>> browsers = ['Internet Explorer', 'Chrome' , 'Safari', 'Firefox']
+    >>> genders = ['male', 'female']
+    >>> locations = ['from Europe', 'from Asia', 'from US']
+    >>> enc = preprocessing.OneHotEncoder(values=[genders, locations, browsers])
+    >>> # Note that for there are missing categorical values for the
+    >>> # 2nd and 3rd feature
+    >>> enc.fit([['female', 'from US', 'Chrome'],
+    ... ['male', 'from Asia', 'Internet Explorer']]) \
+    ... # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
+    OneHotEncoder(categorical_features='all',
+           dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None,
+           sparse=True,
+           values=[...])
+
+    >>> enc.transform([['male', 'from Europe', 'Safari']]).toarray()
+    array([[ 0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])
 
 See :ref:`dict_feature_extraction` for categorical features that are represented
 as a dict, not as integers.

diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -171,6 +171,16 @@ Enhancements
      removed by setting it to `None`.
      :issue:`7674` by:user:`Yichuan Liu <yl565>`.
 
+   - :class:`preprocessing.OneHotEncoder` now fits and transforms inputs of
+     any numerical or string type instead of only integer arrays.
+     It has addtional fitted attributes ``feature_index_range_``,
+     ``one_hot_feature_index_``, and ``categories_``.
+     In addition to previous allowed values, ``handle_unknown`` accepts "error-strict"
+     to error if any unknown values are seen during tranformation.
+     :issue:`7327` and :issue:`8793` by
+     :user:`Vighnesh Birodkar <vighneshbirodkar>` and
+     :user:`Stephen Hoover <stephen-hoover>`.
+
 Bug fixes
 .........
    - Fixed a bug where :class:`sklearn.ensemble.IsolationForest` uses an
@@ -329,6 +339,15 @@ API changes summary
      the weighted impurity decrease from splitting is no longer alteast
      ``min_impurity_decrease``.  :issue:`8449` by `Raghav RV_`
 
+   - In :class:`preprocessing.OneHotEncoder`, deprecate the
+     ``feature_indices_`` and ``active_features_`` attributes.
+     Deprecate integer and list of integer inputs to ``values``
+     in favor of lists of lists of categories.
+     The present behavior of ``handle_unknown="error"`` will
+     change to be the same as ``handle_unknown="error-strict"`` in v0.21.
+     :issue:`7327` and :issue:`8793` by
+     :user:`Vighnesh Birodkar <vighneshbirodkar>` and
+     :user:`Stephen Hoover <stephen-hoover>`.
 
 .. _changes_0_18_1:
 
@@ -5070,4 +5089,4 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 .. _Anish Shah: https://github.com/AnishShah
 
 .. _Neeraj Gangwar: http://neerajgangwar.in
-.. _Arthur Mensch: https://amensch.fr
+.. _Arthur Mensch: https://amensch.fr