Tags: ropensci/textreuse
Tags
Store minhashes separately from hashes Keeping minhashes and hashes in the same element of the list that makes up a TextReuseTextDocument was a mistake. The two are not at all the same, conceptually. Furthermore, the entire advantage of hashing the tokens is that the tokens can be discarded. But if one needs to rehash, the tokens need to stick around or be retokenized, which is the most expensive part of the process in terms of memory and computation. This function adds an element `x$minhashes` to TextReuseTextDocuments, provides appropriate accessor and existence functions, make functions like `lsh()` only use minhashes instead of hashes, and rewrites documentation and vignettes as appropriate. The README is expanded to describe the three main kinds of analysis.
PreviousNext