CountVectorizer Performance Problem

I was doing text classification task these days, and I found the CountVectorizer / TfidfVectorizer quite slow. Then I look into the source code, I found the the function CountVectorizer._count_vocab:

```
def _count_vocab(self, raw_documents, fixed_vocab):
    """Create sparse feature matrix, and vocabulary where fixed_vocab=False
    """
    if fixed_vocab:
        vocabulary = self.vocabulary_
    else:
        # Add a new value when a new vocabulary item is seen
        vocabulary = defaultdict()
        vocabulary.default_factory = vocabulary.__len__

    analyze = self.build_analyzer()
    j_indices = _make_int_array()
    indptr = _make_int_array()
    indptr.append(0)
    for doc in raw_documents:
        for feature in analyze(doc):
            try:
                j_indices.append(vocabulary[feature])
            except KeyError:
                # Ignore out-of-vocabulary items for fixed_vocab=True
                continue
        indptr.append(len(j_indices))

    if not fixed_vocab:
        # disable defaultdict behaviour
        vocabulary = dict(vocabulary)
        if not vocabulary:
            raise ValueError("empty vocabulary; perhaps the documents only"
                             " contain stop words")

    j_indices = frombuffer_empty(j_indices, dtype=np.intc)
    indptr = np.frombuffer(indptr, dtype=np.intc)
    values = np.ones(len(j_indices))

    X = sp.csr_matrix((values, j_indices, indptr),
                      shape=(len(indptr) - 1, len(vocabulary)),
                      dtype=self.dtype)
    X.sum_duplicates()
    return vocabulary, X
```

Then I simply change 2 lines:

```
def _count_vocab(self, raw_documents, fixed_vocab):
    """Create sparse feature matrix, and vocabulary where fixed_vocab=False
    """
    if fixed_vocab:
        vocabulary = self.vocabulary_
    else:
        # Add a new value when a new vocabulary item is seen
        vocabulary = defaultdict()
        vocabulary.default_factory = vocabulary.__len__

    analyze = self.build_analyzer()
    ##############j_indices = _make_int_array()
    j_indices = []
    indptr = _make_int_array()
    indptr.append(0)
    for doc in raw_documents:
        for feature in analyze(doc):
            try:
                j_indices.append(vocabulary[feature])
            except KeyError:
                # Ignore out-of-vocabulary items for fixed_vocab=True
                continue
        indptr.append(len(j_indices))

    if not fixed_vocab:
        # disable defaultdict behaviour
        vocabulary = dict(vocabulary)
        if not vocabulary:
            raise ValueError("empty vocabulary; perhaps the documents only"
                             " contain stop words")

    #############j_indices = frombuffer_empty(j_indices, dtype=np.intc)
    j_indices = np.array(j_indices, dtype=np.intc)
    indptr = np.frombuffer(indptr, dtype=np.intc)
    values = np.ones(len(j_indices))

    X = sp.csr_matrix((values, j_indices, indptr),
                      shape=(len(indptr) - 1, len(vocabulary)),
                      dtype=self.dtype)
    X.sum_duplicates()
    return vocabulary, X
```

This function's time cost speedup to 21s from 27s, my question is that why using simple list[] and np.array() can achieve faster speed then array.array('i') and frombuffer_empty() ? Should we just use the simple and faster way to modify the code?
(I have 10,000 documents, total 50 million words)
First time ask question in scikit-learn community, sorry if this question is stupid.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CountVectorizer Performance Problem #5306

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CountVectorizer Performance Problem #5306

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions