Error with CountVectorizer OverflowError: signed integer is greater than maximum

I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
```
File "core.py", line 38, in <module>
    feature_vectorizer.ingest(data_dir='./data/docs/')
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
    self.transform()
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
    res = vect.fit_transform(text_gen)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
    indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum
```
#### Description
I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
```
File "core.py", line 38, in <module>
    feature_vectorizer.ingest(data_dir='./data/docs/')
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
    self.transform()
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
    res = vect.fit_transform(text_gen)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
    indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum
```

#### Steps/Code to Reproduce
This only happens when processing a large number of documents (>100k). I'm currently doing some more testing to figure out at what value `len(j_indices)` this error occurs.

#### Versions
Linux-4.4.0-111-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Nov 28 2017, 16:41:13) 
[GCC 4.8.4]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Error with CountVectorizer OverflowError: signed integer is greater than maximum #12112

Description

Steps/Code to Reproduce

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Error with CountVectorizer OverflowError: signed integer is greater than maximum #12112

Description

Description

Steps/Code to Reproduce

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions