-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Hello, I am now preparing for Chinese text mining using jiebaR, in Korea on Korean language Windows OS.
Followings are my computing environment verified by library(jiebaR); sessionInfo().
library(jiebaR); sessionInfo();
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.7 tidytext_0.3.2 stringr_1.4.0 jiebaR_0.11 jiebaRD_0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44
[6] R6_2.5.1 rlang_0.4.11 fansi_0.5.0 tools_4.1.1 grid_4.1.1
[11] utf8_1.2.2 cli_3.0.1 DBI_1.1.1 janeaustenr_0.1.5 ellipsis_0.3.2
[16] assertthat_0.2.1 tibble_3.1.3 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.3-4
[21] purrr_0.3.4 SnowballC_0.7.0 tokenizers_0.2.1 vctrs_0.3.8 glue_1.4.2
[26] stringi_1.7.4 compiler_4.1.1 pillar_1.6.2 generics_0.1.0 pkgconfig_2.0.3
I managed to set up the word segmentation process as the following. The result, however, is disappointing in that user-defined dictionary doesn't work.
bri_text<-readLines("BRIVA_revised3.txt", encoding="UTF-8")
bri_stnc<-bri_text %>% as_tibble() %>% unnest_tokens(input=value, output=sentence, token="sentences")
bri_stnc<-bri_stnc %>% mutate(sentence_id=row_number())
bri_df<-bri_stnc %>%mutate(text=sapply(segment(bri_stnc$sentence, worker(bylines=TRUE, user= "C:/Users/user/Documents/R/win-library/4.1/jiebaRD/dict/user.dict.utf8")), function(x){paste(x, collapse=" ")})) %>% unnest_tokens(word, text)
bri_df
A tibble: 4,175 x 3
sentence sentence_id word
1 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 2000
2 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 多年
3 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 前
4 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+4E9A>
5 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+6B27>
6 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+9646>上
7 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勤<U+52B3>
8 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勇敢
9 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 的
10 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 探索
... with 4,165 more rows.
The problem is that there is no difference between with user-defined dictionary and without. Tibble structure of "4175 by 3" does not change even with user.dict. By the way, I checked out that stopwords.dict works well. I have no idea of what seems toi be the problem
For reference, I attach screen capture of "use.dict.utf8" file below.
Thanks for advise!