Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit f5a47bd

Browse files
committed
Merge pull request scikit-learn#9 from benjaminwilson/master
fixes #8
2 parents ae7ab21 + 54fb2de commit f5a47bd

File tree

1 file changed

+2
-29
lines changed

1 file changed

+2
-29
lines changed

tutorial/working_with_text_data.rst

Lines changed: 2 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,6 @@ description, quoted from the `website
3333
experiments in text applications of machine learning techniques,
3434
such as text classification and text clustering.
3535

36-
To download the dataset, go to ``$TUTORIAL_HOME/data/twenty_newsgroups``
37-
and run the ``fetch_data.py`` script.
38-
3936
In the following we will use the built-in dataset loader for 20 newsgroups
4037
from scikit-learn. Alternatively it is possible to download the dataset
4138
manually from the web-site and use the :func:`sklearn.datasets.load_files`
@@ -158,32 +155,7 @@ and ``scikit-learn`` has built-in support for these structures.
158155
Tokenizing text with ``scikit-learn``
159156
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
160157

161-
``scikit-learn`` offers a couple of basic yet useful utilities to
162-
work with text data. The first one is a preprocessor that removes
163-
accents and converts to lowercase on roman languages::
164-
165-
>>> from sklearn.feature_extraction.text import RomanPreprocessor
166-
>>> text = u"J'ai bien mang\xe9."
167-
>>> print RomanPreprocessor().preprocess(text)
168-
j'ai bien mange.
169-
170-
The second one is a utility that splits the text into words after
171-
having applied the preprocessor::
172-
173-
>>> from sklearn.feature_extraction.text import WordNGramAnalyzer
174-
>>> WordNGramAnalyzer().analyze(text)
175-
['ai', 'bien', 'mange']
176-
177-
Note that punctuation and single letter words have automatically
178-
been removed.
179-
180-
It is further possible to configure ``WordNGramAnalyzer`` to extract n-grams
181-
instead of single words::
182-
183-
>>> WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
184-
[u'ai', u'bien', u'mange', u'ai bien', u'bien mange']
185-
186-
These tools are wrapped into a higher level component that is able to build a
158+
Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a
187159
dictionary of features and transform documents to feature vectors::
188160

189161
>>> from sklearn.feature_extraction.text import CountVectorizer
@@ -192,6 +164,7 @@ dictionary of features and transform documents to feature vectors::
192164
>>> X_train_counts.shape
193165
(2257, 33883)
194166

167+
``CountVectorizer`` supports counts of N-grams of words or consequective characters.
195168
Once fitted, the vectorizer has built a dictionary of feature indices::
196169

197170
>>> count_vect.vocabulary.get(u'algorithm')

0 commit comments

Comments
 (0)