This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.
DATASET~8 hours Mongolian TTS dataset:MnTTS created from the Inner Mongolia University, ChinaDATASETLJSpeech like male voice TTS dataset created from the Mongolian Bible- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASETLJSpeech like Kalmyk (West Mongolian) female voice TTS dataset created from the Kalmyk Bible (2 hours)DATASET300 hours Kalmyk synthetic STT dataset created by a voice conversion model- each WAV has a different text created from Kalmyk books
- source voice is the Kalmyk Bible female TTS
- target voices are from the VCTK dataset
- an example WAV: https://twitter.com/tugstugi/status/1409111296897912835
DATASETEduge news classification dataset provided by Bolorsoft LLC- used to train the Eduge.mn production news classifier
- 75K news with 9 categories:
урлаг соёл,эдийн засаг,эрүүл мэнд,хууль,улс төр,спорт,технологи,боловсролandбайгал орчин
DATASET11-11.mn government agency complaint dataset- 80K with 5 categories:
санал хүсэлт,гомдол,шүүмжлэл,талархалandөргөдөл
- 80K with 5 categories:
DATASETonline news corpus- 700 million words
DATASETDigital Archive of Mongolian Newspapers 1990-1995 of the British Library- Common Crawl Mongolian dataset
- opendata.burtgel.gov.mn
DATASET220K Mongolian personal namesDATASET90K Mongolian clan/family namesDATASET192K Mongolian company names
DATASETMongolian provinces (aimags and sums) namesDATASET195 country (with capital cities) names in MongolianDATASET250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).- These words could be used also as the stop words.
DATASET500 Mongolian abbreviationsDATASETMongolian NER dataset created from Mongolian politics and sport newsDATASETMongolian POS dataset of the National University of Mongolia- 100k words
- used POS tagsets
DATASETTraditional Mongolian synthetic OCR dataset created from Mongolian song lyrics and dictionary- 80K images
- without any data augmentation, for augmenting data use external libraries like albumentations.
DATASETTraditional Mongolian OCR dataset- 164631 sample, 200 people
DATASETHandwritten Mongolian Cyrillic Characters Database of the Mongolian University of Science and Technology- 28x28 gray scale, 350k images
- dataset description
DATASETMongolian Wordnet of the National University of Mongolia- 26875 words, 2979 glosses, 23665 synsets, 213 examples
DATASETMongolian Inflectional Morphology from UniMorph 4.0- 2085 lemmas and 14592 inflections (+ morpheme segmentations)
DATASETMongolian Derivational Morphology from MorphyNet- 1410 lemmas, 1629 derivations, and 229 derivational suffixes.
DATASETMultilingual Spoken Words multilingual keyword spotting dataset- 2200 Mongolian keywords, 44000 audio files
- example keywords:
аав,байна,бэлдэж,дүрслэх,ламын,олов,сонирхож,түүний,хаанаас,хуулиар,чиглэсэн
DATASETSmall Kalmyk text corpus- newspaper, poetry etc.
PYTORCHtugstugi/pytorch-dc-ttsDEMOColab online demoDATASETLJSpeech like male voice dataset created from the Mongolian Bible
TFtugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible datasetDEMOColab online demoDEMOspeaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
PYTORCHChimege TTS demo- 1x female
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMOHMM TTS online demo of the National University of Mongolia- 1x male and 2x female voices
DEMOYet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК- demo server is currently down
- 1x male and 1x female
- female voice samples
SAMPLESTacotron2 TTS demo samples of Ikon.MN- 1x female (35h)
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMOHMM based TTS online demo of the Inner Mongolian university- 1x female
DEMOMTL-Tacotron TTS demo samples of the Inner Mongolian university & National University of Singapore- 1x female
TFttslr/MonTTS Inner Mongolian TTS training codeSAMPLESSpeech samplesDATASET SAMPLESMonSpeech of the Inner Mongolia University- dataset and pretrained models are not available
TFwalker-hyf/MnTTS Inner Mongolian TTS dataset and training codeSAMPLESSpeech samplesDATASETMnTTS of the Inner Mongolia UniversityPretrained Modeldownload link- dataset and pretrained models are available :)
PRODUCTNVDA/HTS screen reader developed by Innovation Development Center for the blind- 1x female (National University of Mongolia voice)
PYTORCH/DEMOKalmyk TTS demo Kalmyk is a Mongolic language spoken in Russia- dataset created from the Kalmyk Bible (2 hours)
- NVIDIA/tacotron2 + NVIDIA/waveglow
PYTORCH/DEMOKalmyk TTS demo from Silero Kalmyk is a Mongolic language spoken in Russia
MODEL5-gram binary LM generated by KenLM on a 670M word dirty corpus.- it can be used either with mozilla/DeepSpeech:
./generate_trie alphabet.txt mn_5gram.binary trie - or in tugstugi/mongolian-speech-recognition
- it can be used either with mozilla/DeepSpeech:
TF/PYTORCHtugstugi/mongolian-bert pretrained Mongolian BERT models- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCHbayartsogt-ya/albert-mongolian pretrained Mongolian ALBERTPYTORCHrobertritz/NLP ULMFiT experimentsPYTORCHhuggingface.co/bayartsogt/mongolian-gpt2 Mongolian GPT-2 modelPYTORCHhuggingface.co/bayartsogt/mongolian-roberta-base Mongolian Roberta base model
PYTORCHtugstugi/mongolian-speech-recognitionDEMOChimege Speech Recognition- a proprietary dataset is used
PRODUCTChinese and traditional Mongolian voice input from aicloud.comDEMOSpeech recognition of the Inner Mongolian university- seems to be non functional
PRODUCTHuawei cloud ASR supports minority languages such as Mongolian, Tibetan, and Uyghur.PRODUCTGoogle Cloud Speech-to-text- 20% WER on a 3000 audio private test dataset
PYTORCHWav2Vec2 XLSR finetuned on Mongolian Common VoiceDEMOColab online demo- 50% WER
PYTORCHWav2Vec2 XLSR trained on Kalmyk dataset- pretrained on 500 hours Kalmyk TV recordings and 1000 hours Mongolian speech recognition dataset
- finetuned on 300 hours synthetic Kalmyk STT dataset created by voice conversion
- 50% WER on a private test set created from Kalmyk TV recordnings, on clean voice recordings, it should have much lower WER
DEMOhttps://huggingface.co/tugstugi/wav2vec2-large-xlsr-53-kalmyk
TFcoqui.ai mongolian speech recognition trained on Mongolian CommonVoice- 90.08% WER
DEMOCyrillic to Mongolian script converter demo of the Inner Mongolian universityDEMOMongolian script OCR demo of the Inner Mongolian universityPYTORCHtugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converterPYTORCHtugstugi/image2bichig Traditional Mongolian OCR using CRNN
TF2sharavsambuu/mongolian-text-classificationSKLEARN/DEMOsimple SVM Colab notebook classifying the Eduge dataset with around 91% accuracy.- SentencePiece model from tugstugi/mongolian-bert is used as the text tokenizer.
DATASETMongolian NER dataset created from Mongolian politics and sport news- for more info see datasets
PYTORCHenod/mongolian-bert-ner BERT based Mongolian NER- uses tugstugi/mongolian-bert Mongolian pre-trained BERT models
DEMONER demo of the National University of Mongolia
PYTORCHtugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneasDEMOColab online demo
TF2cyrillic transliteration Colab notebook sharavsambuu/cyrillic-mongolian-transliterationDATASET1M back-translated MN->EN sentence dataset download linkDICTIONARYMongolian digitalized dictionaries from Center for Northeast Asian of the Tohoku University in Japan- for usage see Digitizing the Mongolian Language: An Introduction to the Polyglot “Online Dictionaries and Full-text Search of Mongolian Languages and Written Manchu”
- it includes also IPA pronuncations for Mongolian words