Releases: trinker/textclean
version 0.8.0
NEWS
Versioning
Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
- Breaking backward compatibility bumps the major (and resets the minor
and patch) - New additions without breaking backward compatibility bumps the minor
(and resets the patch) - Bug fixes and misc changes bumps the patch
textclean 0.8.0 -
BUG FIXES
fgsubhad a bug in which the the originalpatterninfgsubmatches the
location in the string but when the replacement occurs this was done on the
entire string rather than the location of the firstpatternmatch. This
means the extracted string was used as a search and might be found in places
other than the original location (e.g., a leading boundary in '^T' replaced
with '__' may have led to '__he __itle' rather than '__he Title' as expected
in the string 'The Title'). See #35 for details. The fix will add some time
to the computation but is safer.
NEW FEATURES
-
replace_to/replace_fromadded to remove from/to begin/end of string to/from
a character(s). -
The following replacement functions were added to provide remediation for
problems found incheck_text:replace_email,replace_hash,
replace_tag, &replace_url.
MINOR FEATURES
check_textpicks up achecksandnargument. The former allows the user
to specify which checks to conduct. The latter allows the user to truncate the
output to n number of elements with a closing...[truncated].... This makes
the function more useful and the code easier to maintain.
IMPROVEMENTS
replace_non_asciidid not replace all non-ASCII characters. This has been
fixed by an explicit replacement of '[^ -~]+' which are all non-ASCII characters.
See issue #34 for details.
CHANGES
textclean 0.7.3
Maintenance release to bring package up to date with the lexicon package API changes.
textclean 0.7.0 - 0.7.2
NEW FEATURES
-
match_tokensadded to find all the tokens that match a regex(es) within a
given text vector. This useful when combined with thereplace_tokens
function. -
Fixed versions of
drop_element/keep_elementadded to allow for dropping
elements specified by a known vector rather than a regex. -
The
collapseandgluefunctions from the glue package are reexported
for easy string manipulation. -
replace_dateadded for normalizing dates. -
replace_timeadded for normalizing time stamps. -
replace_moneyadded for normalizing money references. -
mgsubpicks up asafeargument using the mgsub package as the backend.
In additionmgsub_regex_safeadded to make the usage explicit. The safe mode
comes at the cost of speed.
IMPROVEMENTS
-
replace_namesdrops the replacement of
c('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un')which are
likely words and not names. -
replace_htmlpicks ups some additional symbol replacments including:
c("™", "“", "”", "‘", "’", "•", "·", "⋅", "–", "—", "≠", "½", "¼", "¾", "°", "←", "→", "…").
textclean 0.6.0 - 0.6.3
NEW FEATURES
-
replace_kernadded to replace a form of informal emphasis in which the
writer takes words >2 letters long, capitalizes the entire word, and places
spaces in between each letter. This was contributed by Stack Overflow's
@ctwheels: https://stackoverflow.com/a/47438305/1000343. -
replace_internet_slangadded to replace Internet acronyms and abbreviations
with machine friendly word equivalents. -
replace_word_elongationadded to replace word elongations (a.k.a. "word
lengthening") with the most likely normalized word form. See
http://www.aclweb.org/anthology/D11-105 for details. -
fgsubadded for the ability to match, extract, operate a function over the
extracted strings, & replace the original matches with the extracted strings.
This performs similar functionality togsubfn::gsubfnbut is less powerful.
For more powerful needs see the gsubfn package.
textclean 0.4.0 - 0.5.1
BUG FIXES
replace_gradedid not usefixed = TRUEfor its call tomgsub. This could
result in the plus signs being interpreted as meta-characters. This has been
corrected.
NEW FEATURES
-
replace_namesadded to remove/replace common first and last names from text
data. -
make_pluraladded to make a vector of singular noun forms plural. -
replace_emojiandreplace_emoji_identifieradded for replacing emojis with
text or an identifier token for use in the sentimentr package.
MINOR FEATURES
-
mgsub_regexandmgsub_fixedto provide wrappers formgsubthat makes
their use apparent without setting thefixedcommand. -
replace_curly_quoteadded to replace curly quotes with straight versions.
IMPROVEMENTS
-
replace_non_asciinow usesstringi::stri_trans_generalto coerce more
non-ASCII characters to ASCII format. -
check_textnow checks for HTML characters/tags. Thanks to @peter Gensler
for suggesting this (see issue #15).
CHANGES
filter_functions deprecated in favor ofdrop_/keep_versions of filter
functions. This was change was to address the opposite meaning that dplyr's
filterhas, which retains rows matching a pattern be default.
textclean 0.3.1
BUG FIXES
replace_tokensadded to complementmgsubfor times when the user wants to
replace fixed tokens with a single value or remove them entirely. This yields
an optimized solution that is much faster thanmgsub.
CHANGES
mgusbno longer usestrim = TRUEby default.
textclean 0.2.1 - 0.3.0
BUG FIXES
check_textreported to usereplace_incompleterather than
add_missing_endmarkwhen endmark is missing.
NEW FEATURES
-
The
replace_emoticon,replace_gradeandreplace_ratingfunctions have
been moved from the sentimentr package to textclean as these are
cleaning functions. This makes the functions more modular and generalizable
to all types of text cleaning. These functions are still imported and
exported by sentimentr. -
replace_htmladded to remove html tags and repalce symbols with appropriate
ASCII symbols. -
add_missing_endmarksadded to detect missing endmarks and replace with the
desired symbol.
IMPROVEMENTS
replace_numbernow uses the english package making it faster and more
maintainable. In addition, the function now handles decimal places as well.
textclean 0.1.0 - 0.2.0
BUG FIXES
check_textreportedNAas non-ASCII. This has been fixed.
NEW FEATURES
-
check_textadded to report on potential problems in a text vector. -
replace_ordinaladded to replace ordinal numbers (e.g., 1st) with word
representation (e.g., first). -
swapadded to swap two patterns simultaneously. -
filter_elementadded to exclude matching elements from a vector.
textclean 0.0.1
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.
Version 0.5.1
NEWS
Versioning
Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
- Breaking backward compatibility bumps the major (and resets the minor
and patch) - New additions without breaking backward compatibility bumps the minor
(and resets the patch) - Bug fixes and misc changes bumps the patch
textclean 0.4.0 - 0.5.1
BUG FIXES
replace_gradedid not usefixed = TRUEfor its call tomgsub. This could
result in the plus signs being interpreted as meta-characters. This has been
corrected.
NEW FEATURES
-
replace_namesadded to remove/replace common first and last names from text
data. -
make_pluraladded to make a vector of singular noun forms plural. -
replace_emojiandreplace_emoji_identifieradded for replacing emojis with
text or an identifier token for use in the sentimentr package.
MINOR FEATURES
-
mgsub_regexandmgsub_fixedto provide wrappers formgsubthat makes
their use apparent without setting thefixedcommand. -
replace_curly_quoteadded to replace curly quotes with straight versions.
IMPROVEMENTS
-
replace_non_asciinow usesstringi::stri_trans_generalto coerce more
non-ASCII characters to ASCII format. -
check_textnow checks for HTML characters/tags. Thanks to @peter Gensler
for suggesting this (see issue #15).
CHANGES
filter_functions deprecated in favor ofdrop_/keep_versions of filter
functions. This was change was to address the opposite meaning that dplyr's
filterhas, which retains rows matching a pattern be default.
textclean 0.3.1
BUG FIXES
replace_tokensadded to complementmgsubfor times when the user wants to
replace fixed tokens with a single value or remove them entirely. This yields
an optimized solution that is much faster thanmgsub.
CHANGES
mgusbno longer usestrim = TRUEby default.
textclean 0.2.1 - 0.3.0
BUG FIXES
check_textreported to usereplace_incompleterather than
add_missing_endmarkwhen endmark is missing.
NEW FEATURES
-
The
replace_emoticon,replace_gradeandreplace_ratingfunctions have
been moved from the sentimentr package to textclean as these are
cleaning functions. This makes the functions more modular and generalizable
to all types of text cleaning. These functions are still imported and
exported by sentimentr. -
replace_htmladded to remove html tags and repalce symbols with appropriate
ASCII symbols. -
add_missing_endmarksadded to detect missing endmarks and replace with the
desired symbol.
IMPROVEMENTS
replace_numbernow uses the english package making it faster and more
maintainable. In addition, the function now handles decimal places as well.
textclean 0.1.0 - 0.2.0
BUG FIXES
check_textreportedNAas non-ASCII. This has been fixed.
NEW FEATURES
-
check_textadded to report on potential problems in a text vector. -
replace_ordinaladded to replace ordinal numbers (e.g., 1st) with word
representation (e.g., first). -
swapadded to swap two patterns simultaneously. -
filter_elementadded to exclude matching elements from a vector.
textclean 0.0.1
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.
textclean version 0.3.1
NEWS
Versioning
Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
- Breaking backward compatibility bumps the major (and resets the minor
and patch) - New additions without breaking backward compatibility bumps the minor
(and resets the patch) - Bug fixes and misc changes bumps the patch
textclean 0.3.1
BUG FIXES
replace_tokensadded to complementmgsubfor times when the user wants to
replace fixed tokens with a single value or remove them entirely. This yields
an optimized solution that is much faster thanmgsub.
CHANGES
mgusbno longer usestrim = TRUEby default.
textclean 0.2.1 - 0.3.0
BUG FIXES
check_textreported to usereplace_incompleterather than
add_missing_endmarkwhen endmark is missing.
NEW FEATURES
- The
replace_emoticon,replace_gradeandreplace_ratingfunctions have
been moved from the sentimentr package to textclean as these are
cleaning functions. This makes the functions more modular and generalizable
to all types of text cleaning. These functions are still imported and
exported by sentimentr. replace_htmladded to remove html tags and repalce symbols with appropriate
ASCII symbols.add_missing_endmarksadded to detect missing endmarks and replace with the
desired symbol.
IMPROVEMENTS
replace_numbernow uses the english package making it faster and more
maintainable. In addition, the function now handles decimal places as well.
textclean 0.1.0 - 0.2.0
BUG FIXES
check_textreportedNAas non-ASCII. This has been fixed.
NEW FEATURES
check_textadded to report on potential problems in a text vector.replace_ordinaladded to replace ordinal numbers (e.g., 1st) with word
representation (e.g., first).swapadded to swap two patterns simultaneously.filter_elementadded to exclude matching elements from a vector.
textclean 0.0.1
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.