Description
I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
File "core.py", line 38, in <module>
feature_vectorizer.ingest(data_dir='./data/docs/')
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
self.transform()
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
res = vect.fit_transform(text_gen)
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum
Description
I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
File "core.py", line 38, in <module>
feature_vectorizer.ingest(data_dir='./data/docs/')
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
self.transform()
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
res = vect.fit_transform(text_gen)
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum
Steps/Code to Reproduce
This only happens when processing a large number of documents (>100k). I'm currently doing some more testing to figure out at what value len(j_indices)
this error occurs.
Versions
Linux-4.4.0-111-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.2