Thanks to visit codestin.com
Credit goes to github.com

Skip to content

❗ This is a read-only mirror of the CRAN R package repository. textrecipes — Extra 'Recipes' for Text Processing. Homepage: https://github.com/tidymodels/textrecipeshttps://textrecipes.tidymodels.org/ Report bugs for this package: https://github.com/tidymodels/textrecipes/issues

License

Notifications You must be signed in to change notification settings

cran/textrecipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

textrecipes

R-CMD-check Codecov test coverage CRAN status Downloads Lifecycle: maturing

Introduction

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

You can install the released version of textrecipes from CRAN with:

install.packages("textrecipes")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable medium and artist.

library(recipes)
library(textrecipes)
library(modeldata)

data("tate_text")

okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
  step_tokenize(medium, artist) %>%
  step_stopwords(medium, artist) %>%
  step_tokenfilter(medium, artist, max_tokens = 10) %>%
  step_tfidf(medium, artist)

okc_obj <- okc_rec %>%
  prep()

str(bake(okc_obj, tate_text))
#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)
#>  $ tfidf_medium_colour     : num [1:4284] 2.31 0 0 0 0 ...
#>  $ tfidf_medium_etching    : num [1:4284] 0 0.86 0.86 0.86 0 ...
#>  $ tfidf_medium_gelatin    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_paint      : num [1:4284] 0 0 0 0 2.35 ...
#>  $ tfidf_medium_paper      : num [1:4284] 0 0.422 0.422 0.422 0 ...
#>  $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_print      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_silver     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_akram      : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_beuys      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_ferrari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_john       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_joseph     : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_león       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_richard    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_schütte    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_thomas     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_zaatari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...

Breaking changes

As of version 0.4.0, step_lda() no longer accepts character variables and instead takes tokenlist variables.

the following recipe

recipe(~text_var, data = data) %>%
  step_lda(text_var)

can be replaced with the following recipe to achive the same results

lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
  step_tokenize(text_var,
    custom_token = lda_tokenizer
  ) %>%
  step_lda(text_var)

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

About

❗ This is a read-only mirror of the CRAN R package repository. textrecipes — Extra 'Recipes' for Text Processing. Homepage: https://github.com/tidymodels/textrecipeshttps://textrecipes.tidymodels.org/ Report bugs for this package: https://github.com/tidymodels/textrecipes/issues

Resources

License

Stars

Watchers

Forks

Packages

No packages published