Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 5b9a65e

Browse files
committed
DOC copyedit FeatureHasher narrative
1 parent c59b2f6 commit 5b9a65e

File tree

1 file changed

+18
-8
lines changed

1 file changed

+18
-8
lines changed

doc/modules/feature_extraction.rst

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,9 @@ and has no ``inverse_transform`` method.
124124

125125
Since the hash function might cause collisions between (unrelated) features,
126126
a signed hash function is used and the sign of the hash value
127-
determines the sign of the value stored in the output matrix for a feature;
128-
this way, collisions are likely to cancel out rather than accumulate error,
129-
and the expected mean of any output feature's value is zero
127+
determines the sign of the value stored in the output matrix for a feature.
128+
This way, collisions are likely to cancel out rather than accumulate error,
129+
and the expected mean of any output feature's value is zero.
130130

131131
If ``non_negative=True`` is passed to the constructor,
132132
the absolute value is taken.
@@ -139,14 +139,20 @@ or ``chi2`` feature selectors that expect non-negative inputs.
139139
``(feature, value)`` pairs, or strings,
140140
depending on the constructor parameter ``input_type``.
141141
Mapping are treated as lists of ``(feature, value)`` pairs,
142-
while single strings have an implicit value of 1.
143-
If a feature occurs multiple times in a sample, the values will be summed.
142+
while single strings have an implicit value of 1,
143+
so ``['feat1', 'feat2', 'feat3']`` is interpreted as
144+
``[('feat1', 1), ('feat2', 1), ('feat3', 1)]``.
145+
If a single feature occurs multiple times in a sample,
146+
the associated values will be summed
147+
(so ``('feat', 2)`` and ``('feat', 3.5)`` become ``('feat', 5.5)``).
148+
The output from :class:`FeatureHasher` is always a ``scipy.sparse`` matrix
149+
in the CSR format.
150+
144151
Feature hashing can be employed in document classification,
145152
but unlike :class:`text.CountVectorizer`,
146153
:class:`FeatureHasher` does not do word
147-
splitting or any other preprocessing except Unicode-to-UTF-8 encoding.
148-
The output from :class:`FeatureHasher` is always a ``scipy.sparse`` matrix
149-
in the CSR format.
154+
splitting or any other preprocessing except Unicode-to-UTF-8 encoding;
155+
see :ref:`hashing_vectorizer`, below, for a combined tokenizer/hasher.
150156

151157
As an example, consider a word-level natural language processing task
152158
that needs features extracted from ``(token, part_of_speech)`` pairs.
@@ -193,6 +199,10 @@ to determine the column index and sign of a feature, respectively.
193199
The present implementation works under the assumption
194200
that the sign bit of MurmurHash3 is independent of its other bits.
195201

202+
Since a simple modulo is used to transform the hash function to a column index,
203+
it is advisable to use a power of two as the ``n_features`` parameter;
204+
otherwise the features will not be mapped evenly to the columns.
205+
196206

197207
.. topic:: References:
198208

0 commit comments

Comments
 (0)