Chinese-Tokenization

Authored by Chen Z.Y. and Ni B.L..

Introduction

This is a project to handle with Chinese Tokenization. It is also our homework for course of Natural Language Processing.

We've implemented the followings

Rule-based tokenization with a shortest path.
Simple 2-gram language model without part-of-speech
HMM model with part-of-speech
Character-based tagging

Installation

We develop with Python3.7 on Windows 10. However, we also run our codes successfully on Linux.

The only special package we use is ahocorasick. Run the following command to install. Please note that it requires C++ lib. Install Microsoft Visual C++ Build Tools on Windows.

pip install pyahocorasick

git clone https://github.com/volgachen/Chinese-Tokenization

Examples

python test_exp1.py --use-re --score Markov to experiment with rule-based algorithms. You can remove --use-re if you do not want re-replacement. --score can be choosen from None, Markov and HMM, which decides if we combine with language models.

python test_exp2.py to see results with simple 2-gram lauguage model. Several results with different datasets will be shown. You can run python test_exp2.py train or python test_exp2.py test to execute evaluation within/without training set.

python BMES_exps/BMES.py to run experiments about character-based tagging. We could see experiments with different train/val set and experiments with or without re-replacement. You need first run python BMES_exps/convert_BMES.py to get organized corpus for this experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
BMES_corpus		BMES_corpus
BMES_exps		BMES_exps
data		data
weibo_model		weibo_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
dict.py		dict.py
evaluation.py		evaluation.py
generate_proposals.py		generate_proposals.py
hmm_new.py		hmm_new.py
ngram.py		ngram.py
test_exp1.py		test_exp1.py
test_exp2.py		test_exp2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chinese-Tokenization

Introduction

Installation

Examples

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

License

volgachen/Chinese-Tokenization

Folders and files

Latest commit

History

Repository files navigation

Chinese-Tokenization

Introduction

Installation

Examples

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages