The following two-step abstraction is provided by the package:
- The vocabulary object is first built from the entire corpus with the help of
vocab(),update_vocab()andprune_vocab()functions. - Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the
mlvocabfunctions acceptnbucketsargument for partial or full hashing of the corpus.
Current functionality includes:
- term index sequences:
tix_seq(),tix_mat()andtix_df()produce integer sequences suitable for direct consumption by various sequence models. - term matrices:
dtm(),tdm()andtcm()create document-term term-document and term-co-occurrence matrices respectively. - subseting embedding matrices: given pre-trained word-vectors
prune_embeddings()creates smaller embedding matrices treating missing and unknown vocabulary terms with grace. - tfidf weighting:
tfidf()computes various versions of term frequency, inverse document frequency weighting ofdtmandtdmmatrices.
Package is in alpha state. API changes are likely.