Codestin Search App

This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.

Datasets

DATASET ~8 hours Mongolian TTS dataset:MnTTS created from the Inner Mongolia University, China
- Application Entry
- Source code of TTS model
- Paper
DATASET LJSpeech like male voice TTS dataset created from the Mongolian Bible
- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASET LJSpeech like Kalmyk (West Mongolian) female voice TTS dataset created from the Kalmyk Bible (2 hours)
DATASET 300 hours Kalmyk synthetic STT dataset created by a voice conversion model
- each WAV has a different text created from Kalmyk books
- source voice is the Kalmyk Bible female TTS
- target voices are from the VCTK dataset
- an example WAV: https://twitter.com/tugstugi/status/1409111296897912835
DATASET Eduge news classification dataset provided by Bolorsoft LLC
- used to train the Eduge.mn production news classifier
- 75K news with 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин
DATASET 11-11.mn government agency complaint dataset
- 80K with 5 categories: санал хүсэлт, гомдол, шүүмжлэл, талархал and өргөдөл
DATASET online news corpus
- 700 million words
DATASET Digital Archive of Mongolian Newspapers 1990-1995 of the British Library
Common Crawl Mongolian dataset
opendata.burtgel.gov.mn
- DATASET 220K Mongolian personal names
- DATASET 90K Mongolian clan/family names
- DATASET 192K Mongolian company names
DATASET Mongolian provinces (aimags and sums) names
DATASET 195 country (with capital cities) names in Mongolian
DATASET 250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).
- These words could be used also as the stop words.
DATASET 500 Mongolian abbreviations
DATASET Mongolian NER dataset created from Mongolian politics and sport news
- 10K sentences annotated by tugstugi and enod using doccano
- 4 categories LOCATION (6453/1753), PERSON (2839/1698), ORGANIZATION (4453/1970) and MISC (3716/2617)
DATASET Mongolian POS dataset of the National University of Mongolia
- 100k words
- used POS tagsets
DATASET Traditional Mongolian synthetic OCR dataset created from Mongolian song lyrics and dictionary
- 80K images
- without any data augmentation, for augmenting data use external libraries like albumentations.
DATASET Traditional Mongolian OCR dataset
- 164631 sample, 200 people
DATASET Handwritten Mongolian Cyrillic Characters Database of the Mongolian University of Science and Technology
- 28x28 gray scale, 350k images
- dataset description
DATASET Mongolian Wordnet of the National University of Mongolia
- 26875 words, 2979 glosses, 23665 synsets, 213 examples
DATASET Mongolian Inflectional Morphology from UniMorph 4.0
- 2085 lemmas and 14592 inflections (+ morpheme segmentations)
DATASET Mongolian Derivational Morphology from MorphyNet
- 1410 lemmas, 1629 derivations, and 229 derivational suffixes.
DATASET Multilingual Spoken Words multilingual keyword spotting dataset
- 2200 Mongolian keywords, 44000 audio files
- example keywords: аав, байна, бэлдэж, дүрслэх, ламын, олов, сонирхож, түүний, хаанаас, хуулиар, чиглэсэн
DATASET Small Kalmyk text corpus
- newspaper, poetry etc.

Mongolian Text-to-Speech

PYTORCH tugstugi/pytorch-dc-tts
- DEMO Colab online demo
- DATASET LJSpeech like male voice dataset created from the Mongolian Bible
TF tugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible dataset
- DEMO Colab online demo
- DEMO speaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
PYTORCH Chimege TTS demo
- 1x female
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO HMM TTS online demo of the National University of Mongolia
- 1x male and 2x female voices
DEMO ~~Yet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК~~
- demo server is currently down
- 1x male and 1x female
- female voice samples
SAMPLES Tacotron2 TTS demo samples of Ikon.MN
- 1x female (35h)
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO HMM based TTS online demo of the Inner Mongolian university
- 1x female
DEMO MTL-Tacotron TTS demo samples of the Inner Mongolian university & National University of Singapore
- 1x female
TF ttslr/MonTTS Inner Mongolian TTS training code
- SAMPLES Speech samples
- DATASET SAMPLES MonSpeech of the Inner Mongolia University
- dataset and pretrained models are not available
TF walker-hyf/MnTTS Inner Mongolian TTS dataset and training code
- SAMPLES Speech samples
- DATASET MnTTS of the Inner Mongolia University
- Pretrained Model download link
- dataset and pretrained models are available :)
PRODUCT NVDA/HTS screen reader developed by Innovation Development Center for the blind
- 1x female (National University of Mongolia voice)
PYTORCH/DEMO Kalmyk TTS demo Kalmyk is a Mongolic language spoken in Russia
- dataset created from the Kalmyk Bible (2 hours)
- NVIDIA/tacotron2 + NVIDIA/waveglow
PYTORCH/DEMO Kalmyk TTS demo from Silero Kalmyk is a Mongolic language spoken in Russia
- snakers4/silero-models

Mongolian Language Model

MODEL 5-gram binary LM generated by KenLM on a 670M word dirty corpus.
- it can be used either with mozilla/DeepSpeech: ./generate_trie alphabet.txt mn_5gram.binary trie
- or in tugstugi/mongolian-speech-recognition
TF / PYTORCH tugstugi/mongolian-bert pretrained Mongolian BERT models
- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCH bayartsogt-ya/albert-mongolian pretrained Mongolian ALBERT
PYTORCH robertritz/NLP ULMFiT experiments
PYTORCH huggingface.co/bayartsogt/mongolian-gpt2 Mongolian GPT-2 model
PYTORCH huggingface.co/bayartsogt/mongolian-roberta-base Mongolian Roberta base model

Mongolian Speech Recognition

PYTORCH tugstugi/mongolian-speech-recognition
- DEMO Chimege Speech Recognition
- a proprietary dataset is used
PRODUCT Chinese and traditional Mongolian voice input from aicloud.com
- direct link to the APK file
- seems to be working only for simple cases (or it works only for Southern Mongolian dialects...)
- same system but for windows (according to someone, you have to register with a Chinese identity card to use it)
DEMO ~~Speech recognition of the Inner Mongolian university~~
- seems to be non functional
PRODUCT Huawei cloud ASR supports minority languages such as Mongolian, Tibetan, and Uyghur.
PRODUCT Google Cloud Speech-to-text
- 20% WER on a 3000 audio private test dataset
PYTORCH Wav2Vec2 XLSR finetuned on Mongolian Common Voice
- DEMO Colab online demo
- 50% WER
PYTORCH Wav2Vec2 XLSR trained on Kalmyk dataset
- pretrained on 500 hours Kalmyk TV recordings and 1000 hours Mongolian speech recognition dataset
- finetuned on 300 hours synthetic Kalmyk STT dataset created by voice conversion
- 50% WER on a private test set created from Kalmyk TV recordnings, on clean voice recordings, it should have much lower WER
- DEMO https://huggingface.co/tugstugi/wav2vec2-large-xlsr-53-kalmyk
TF coqui.ai mongolian speech recognition trained on Mongolian CommonVoice
- 90.08% WER

Mongolian Script

DEMO Cyrillic to Mongolian script converter demo of the Inner Mongolian university
DEMO Mongolian script OCR demo of the Inner Mongolian university
PYTORCH tugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converter
- DEMO Cyrillic to Mongolian Colab online demo
PYTORCH tugstugi/image2bichig Traditional Mongolian OCR using CRNN
- DEMO OCR Colab online demo
- DATASET Traditional Mongolian synthetic OCR dataset

Mongolian Text Classification

TF2 sharavsambuu/mongolian-text-classification
SKLEARN / DEMO simple SVM Colab notebook classifying the Eduge dataset with around 91% accuracy.
- SentencePiece model from tugstugi/mongolian-bert is used as the text tokenizer.

Mongolian Named Entity Recognition

DATASET Mongolian NER dataset created from Mongolian politics and sport news
- for more info see datasets
PYTORCH enod/mongolian-bert-ner BERT based Mongolian NER
- uses tugstugi/mongolian-bert Mongolian pre-trained BERT models
DEMO NER demo of the National University of Mongolia

Misc

PYTORCH tugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneas
- DEMO Colab online demo
TF2 cyrillic transliteration Colab notebook sharavsambuu/cyrillic-mongolian-transliteration
DATASET 1M back-translated MN->EN sentence dataset download link
- sharavsambuu/english-mongolian-nmt-dataset-augmentation
DICTIONARY Mongolian digitalized dictionaries from Center for Northeast Asian of the Tohoku University in Japan
- for usage see Digitizing the Mongolian Language: An Introduction to the Polyglot “Online Dictionaries and Full-text Search of Mongolian Languages and Written Manchu”
- it includes also IPA pronuncations for Mongolian words

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
bichig2cyrillic		bichig2cyrillic
datasets		datasets
forced_aligner		forced_aligner
image2bichig		image2bichig
misc		misc
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Datasets

Mongolian Text-to-Speech

Mongolian Language Model

Mongolian Speech Recognition

Mongolian Script

Mongolian Text Classification

Mongolian Named Entity Recognition

Misc

About

Uh oh!

Releases

Packages

Contributors 4

Languages

Uh oh!

Uh oh!

tugstugi/mongolian-nlp

Folders and files

Latest commit

History

Repository files navigation

Datasets

Mongolian Text-to-Speech

Mongolian Language Model

Mongolian Speech Recognition

Mongolian Script

Mongolian Text Classification

Mongolian Named Entity Recognition

Misc

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages