Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CountVectorizer Performance Problem #5306

Closed
@fayeshine

Description

@fayeshine

I was doing text classification task these days, and I found the CountVectorizer / TfidfVectorizer quite slow. Then I look into the source code, I found the the function CountVectorizer._count_vocab:

def _count_vocab(self, raw_documents, fixed_vocab):
    """Create sparse feature matrix, and vocabulary where fixed_vocab=False
    """
    if fixed_vocab:
        vocabulary = self.vocabulary_
    else:
        # Add a new value when a new vocabulary item is seen
        vocabulary = defaultdict()
        vocabulary.default_factory = vocabulary.__len__

    analyze = self.build_analyzer()
    j_indices = _make_int_array()
    indptr = _make_int_array()
    indptr.append(0)
    for doc in raw_documents:
        for feature in analyze(doc):
            try:
                j_indices.append(vocabulary[feature])
            except KeyError:
                # Ignore out-of-vocabulary items for fixed_vocab=True
                continue
        indptr.append(len(j_indices))

    if not fixed_vocab:
        # disable defaultdict behaviour
        vocabulary = dict(vocabulary)
        if not vocabulary:
            raise ValueError("empty vocabulary; perhaps the documents only"
                             " contain stop words")

    j_indices = frombuffer_empty(j_indices, dtype=np.intc)
    indptr = np.frombuffer(indptr, dtype=np.intc)
    values = np.ones(len(j_indices))

    X = sp.csr_matrix((values, j_indices, indptr),
                      shape=(len(indptr) - 1, len(vocabulary)),
                      dtype=self.dtype)
    X.sum_duplicates()
    return vocabulary, X

Then I simply change 2 lines:

def _count_vocab(self, raw_documents, fixed_vocab):
    """Create sparse feature matrix, and vocabulary where fixed_vocab=False
    """
    if fixed_vocab:
        vocabulary = self.vocabulary_
    else:
        # Add a new value when a new vocabulary item is seen
        vocabulary = defaultdict()
        vocabulary.default_factory = vocabulary.__len__

    analyze = self.build_analyzer()
    ##############j_indices = _make_int_array()
    j_indices = []
    indptr = _make_int_array()
    indptr.append(0)
    for doc in raw_documents:
        for feature in analyze(doc):
            try:
                j_indices.append(vocabulary[feature])
            except KeyError:
                # Ignore out-of-vocabulary items for fixed_vocab=True
                continue
        indptr.append(len(j_indices))

    if not fixed_vocab:
        # disable defaultdict behaviour
        vocabulary = dict(vocabulary)
        if not vocabulary:
            raise ValueError("empty vocabulary; perhaps the documents only"
                             " contain stop words")

    #############j_indices = frombuffer_empty(j_indices, dtype=np.intc)
    j_indices = np.array(j_indices, dtype=np.intc)
    indptr = np.frombuffer(indptr, dtype=np.intc)
    values = np.ones(len(j_indices))

    X = sp.csr_matrix((values, j_indices, indptr),
                      shape=(len(indptr) - 1, len(vocabulary)),
                      dtype=self.dtype)
    X.sum_duplicates()
    return vocabulary, X

This function's time cost speedup to 21s from 27s, my question is that why using simple list[] and np.array() can achieve faster speed then array.array('i') and frombuffer_empty() ? Should we just use the simple and faster way to modify the code?
(I have 10,000 documents, total 50 million words)
First time ask question in scikit-learn community, sorry if this question is stupid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions