Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 18 additions & 7 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -685,9 +685,11 @@ def _sort_features(self, X, vocabulary):
sorted_features = sorted(six.iteritems(vocabulary))
map_index = np.empty(len(sorted_features), dtype=np.int32)
for new_val, (term, old_val) in enumerate(sorted_features):
map_index[new_val] = old_val
vocabulary[term] = new_val
return X[:, map_index]
map_index[old_val] = new_val

X.indices = map_index.take(X.indices, mode='clip')
return X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't return from an in-place operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Thanks for the review! I agree with all your other comments, for this one though the documentation of CountVectorizer._sort_features states, "Returns a reordered matrix and modifies the vocabulary in place", hence the return X (same as before the PR). Do you want me to change this behavior in the current PR ?


def _limit_features(self, X, vocabulary, high=None, low=None,
limit=None):
Expand Down Expand Up @@ -741,16 +743,25 @@ def _count_vocab(self, raw_documents, fixed_vocab):
vocabulary.default_factory = vocabulary.__len__

analyze = self.build_analyzer()
j_indices = _make_int_array()
j_indices = []
indptr = _make_int_array()
values = _make_int_array()
indptr.append(0)
for doc in raw_documents:
feature_counter = {}
for feature in analyze(doc):
try:
j_indices.append(vocabulary[feature])
feature_idx = vocabulary[feature]
if feature_idx not in feature_counter:
feature_counter[feature_idx] = 1
else:
feature_counter[feature_idx] += 1
except KeyError:
# Ignore out-of-vocabulary items for fixed_vocab=True
continue

j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
indptr.append(len(j_indices))

if not fixed_vocab:
Expand All @@ -760,14 +771,14 @@ def _count_vocab(self, raw_documents, fixed_vocab):
raise ValueError("empty vocabulary; perhaps the documents only"
" contain stop words")

j_indices = frombuffer_empty(j_indices, dtype=np.intc)
j_indices = np.asarray(j_indices, dtype=np.intc)
indptr = np.frombuffer(indptr, dtype=np.intc)
values = np.ones(len(j_indices))
values = frombuffer_empty(values, dtype=np.intc)

X = sp.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(vocabulary)),
dtype=self.dtype)
X.sum_duplicates()
X.sort_indices()
return vocabulary, X

def fit(self, raw_documents, y=None):
Expand Down