Codestin Search App

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.
term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.
subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

Package is in alpha state. API changes are likely.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
R		R
man		man
src		src
tests		tests
travis		travis
.Rbuildignore		.Rbuildignore
.dir-locals.el		.dir-locals.el
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cleanup		cleanup