-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Closed
Description
When a CountVectorizer is used to extract n-grams and n > number of words (or characters) in the document, it will return a ValueError with the "empty vocabulary" message, which is the solution to #1207 . This is frustrating behavior when that CountVectorizer is part of a FeatureUnion whose other steps may have successfully extracted features. Here is sample code that shows the issue:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
steps = [('uni', CountVectorizer(ngram_range=(1,1))),
('tri', CountVectorizer(ngram_range=(3,3))),
('five', CountVectorizer(ngram_range=(5,5)))]
union = FeatureUnion(steps)
texts = ['This is a test']
union.fit(texts)
Metadata
Metadata
Assignees
Labels
No labels