@@ -124,9 +124,9 @@ and has no ``inverse_transform`` method.
124
124
125
125
Since the hash function might cause collisions between (unrelated) features,
126
126
a signed hash function is used and the sign of the hash value
127
- determines the sign of the value stored in the output matrix for a feature;
128
- this way, collisions are likely to cancel out rather than accumulate error,
129
- and the expected mean of any output feature's value is zero
127
+ determines the sign of the value stored in the output matrix for a feature.
128
+ This way, collisions are likely to cancel out rather than accumulate error,
129
+ and the expected mean of any output feature's value is zero.
130
130
131
131
If ``non_negative=True `` is passed to the constructor,
132
132
the absolute value is taken.
@@ -139,14 +139,20 @@ or ``chi2`` feature selectors that expect non-negative inputs.
139
139
``(feature, value) `` pairs, or strings,
140
140
depending on the constructor parameter ``input_type ``.
141
141
Mapping are treated as lists of ``(feature, value) `` pairs,
142
- while single strings have an implicit value of 1.
143
- If a feature occurs multiple times in a sample, the values will be summed.
142
+ while single strings have an implicit value of 1,
143
+ so ``['feat1', 'feat2', 'feat3'] `` is interpreted as
144
+ ``[('feat1', 1), ('feat2', 1), ('feat3', 1)] ``.
145
+ If a single feature occurs multiple times in a sample,
146
+ the associated values will be summed
147
+ (so ``('feat', 2) `` and ``('feat', 3.5) `` become ``('feat', 5.5) ``).
148
+ The output from :class: `FeatureHasher ` is always a ``scipy.sparse `` matrix
149
+ in the CSR format.
150
+
144
151
Feature hashing can be employed in document classification,
145
152
but unlike :class: `text.CountVectorizer `,
146
153
:class: `FeatureHasher ` does not do word
147
- splitting or any other preprocessing except Unicode-to-UTF-8 encoding.
148
- The output from :class: `FeatureHasher ` is always a ``scipy.sparse `` matrix
149
- in the CSR format.
154
+ splitting or any other preprocessing except Unicode-to-UTF-8 encoding;
155
+ see :ref: `hashing_vectorizer `, below, for a combined tokenizer/hasher.
150
156
151
157
As an example, consider a word-level natural language processing task
152
158
that needs features extracted from ``(token, part_of_speech) `` pairs.
@@ -193,6 +199,10 @@ to determine the column index and sign of a feature, respectively.
193
199
The present implementation works under the assumption
194
200
that the sign bit of MurmurHash3 is independent of its other bits.
195
201
202
+ Since a simple modulo is used to transform the hash function to a column index,
203
+ it is advisable to use a power of two as the ``n_features `` parameter;
204
+ otherwise the features will not be mapped evenly to the columns.
205
+
196
206
197
207
.. topic :: References:
198
208
0 commit comments