textreuse

- 1. What does this package do? (explain in 50 words or less)  

This package detects document similarity, and implements the minhash/lsh algorithms.
- 2. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).

```
Package: textreuse
Type: Package
Title: Detect Text Reuse and Document Similarity
Version: 0.0.1.9001
Date: 2015-09-17
Authors@R: c(person("Lincoln", "Mullen", role = c("aut", "cre"),
    email = "lincoln@lincolnmullen.com"))
Description: Tools for measuring similarity among documents and detecting
    passages which have been reused. Implements shingled n-gram, skip n-gram,
    and other tokenizers; similarity/dissimilarity functions; pairwise
    comparisons; and minhash and locality sensitive hashing algorithms.
License: MIT + file LICENSE
LazyData: TRUE
URL: https://github.com/lmullen/textreuse
BugReports: https://github.com/lmullen/textreuse/issues
VignetteBuilder: knitr
Depends: R (>= 3.1.2)
Imports: assertthat (>= 0.1),
    digest(>= 0.6.8),
    hash (>= 2.2.6),
    NLP (>= 0.1.8),
    Rcpp (>= 0.12.0),
    stringr (>= 1.0.0)
Suggests: testthat (>= 0.10.0),
    knitr (>= 1.11),
    rmarkdown (>= 0.8)
LinkingTo: Rcpp,
    BH
```
- 3. URL for the package (the development repository, not a stylized html page)

https://github.com/lmullen/textreuse/
- 4. What data source(s) does it work with (if applicable)?

This package anticipates that the user has documents in plain text. Future versions could provide, for example, XML readers as the tm package does, but I think that probably does not belong in this package.
- 5. Who is the target audience?

Detecting document similarity is a common problem when working the natural language, so I anticipate that this package will be broadly useful for anyone working in NLP.
- 6. Are there other R packages that accomplish the same thing? If so, what is different about yours?

No, there are no other R packages that implement minhash/locality-sensitive hashing. The tm package does implement some document similarity measures, but these are similarity in terms of content rather than in terms of actual borrowing of text. In other words, it would mark two documents that both talked about football as being similar, even if they had no shared text.

That said, this package extends classes from the NLP and tm packages, so it is intended to play nice with other R NLP packages.
- 1. Check the box next to each policy below, confirming that you agree. These are mandatory.
- [x] This package does not violate the Terms of Service of any service it interacts with.
- [x] The repository has continuous integration with Travis and/or another service
- [x] The package contains a vignette
- [x] The package contains a reasonably complete readme with devtools install instructions
- [x] The package contains unit tests
- [x] The package only exports functions to the NAMESPACE that are intended for end users
- 1. Do you agree to follow the [rOpenSci packaging guidelines](https://github.com/ropensci/packaging_guide)? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.

Yes, I comply with all those guidelines. The exception is that I have named classes, for example, `TextReuseTextDocument` bowing to the precedent set by the NLP package. I don't like the name any better than you, but that's just how they do it with those packages.
- [x] Are there any package dependencies not on CRAN?
- [x] Do you intend for this package to go on CRAN?
- [x] Does the package have a CRAN accepted license?
- [x] Did `devtools::check()` produce any errors or warnings? If so paste them below.
- 1. Please add explanations below for any exceptions to the above:
- 10. If this is a resubmission following rejection, please explain the change in cirucmstances.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

textreuse #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

textreuse #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions