@@ -33,9 +33,6 @@ description, quoted from the `website
33
33
experiments in text applications of machine learning techniques,
34
34
such as text classification and text clustering.
35
35
36
- To download the dataset, go to ``$TUTORIAL_HOME/data/twenty_newsgroups ``
37
- and run the ``fetch_data.py `` script.
38
-
39
36
In the following we will use the built-in dataset loader for 20 newsgroups
40
37
from scikit-learn. Alternatively it is possible to download the dataset
41
38
manually from the web-site and use the :func: `sklearn.datasets.load_files `
@@ -158,32 +155,7 @@ and ``scikit-learn`` has built-in support for these structures.
158
155
Tokenizing text with ``scikit-learn ``
159
156
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
160
157
161
- ``scikit-learn `` offers a couple of basic yet useful utilities to
162
- work with text data. The first one is a preprocessor that removes
163
- accents and converts to lowercase on roman languages::
164
-
165
- >>> from sklearn.feature_extraction.text import RomanPreprocessor
166
- >>> text = u"J'ai bien mang\xe9."
167
- >>> print RomanPreprocessor().preprocess(text)
168
- j'ai bien mange.
169
-
170
- The second one is a utility that splits the text into words after
171
- having applied the preprocessor::
172
-
173
- >>> from sklearn.feature_extraction.text import WordNGramAnalyzer
174
- >>> WordNGramAnalyzer().analyze(text)
175
- ['ai', 'bien', 'mange']
176
-
177
- Note that punctuation and single letter words have automatically
178
- been removed.
179
-
180
- It is further possible to configure ``WordNGramAnalyzer `` to extract n-grams
181
- instead of single words::
182
-
183
- >>> WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
184
- [u'ai', u'bien', u'mange', u'ai bien', u'bien mange']
185
-
186
- These tools are wrapped into a higher level component that is able to build a
158
+ Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a
187
159
dictionary of features and transform documents to feature vectors::
188
160
189
161
>>> from sklearn.feature_extraction.text import CountVectorizer
@@ -192,6 +164,7 @@ dictionary of features and transform documents to feature vectors::
192
164
>>> X_train_counts.shape
193
165
(2257, 33883)
194
166
167
+ ``CountVectorizer `` supports counts of N-grams of words or consequective characters.
195
168
Once fitted, the vectorizer has built a dictionary of feature indices::
196
169
197
170
>>> count_vect.vocabulary.get(u'algorithm')
0 commit comments