FeatureUnion of CountVectorizers returns "empty vocabulary" error

When a CountVectorizer is used to extract n-grams and n > number of words (or characters) in the document, it will return a ValueError with the "empty vocabulary" message, which is the solution to https://github.com/scikit-learn/scikit-learn/issues/1207 .  This is frustrating behavior when that CountVectorizer is part of a FeatureUnion whose other steps may have successfully extracted features. Here is sample code that shows the issue:

```
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion

steps = [('uni', CountVectorizer(ngram_range=(1,1))),
    ('tri', CountVectorizer(ngram_range=(3,3))),
    ('five', CountVectorizer(ngram_range=(5,5)))]

union = FeatureUnion(steps)
texts = ['This is a test']
union.fit(texts)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FeatureUnion of CountVectorizers returns "empty vocabulary" error #3164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

FeatureUnion of CountVectorizers returns "empty vocabulary" error #3164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions