Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gzUse corus to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...For links to other datasets and their loaders see the Reference section.
Materials are in Russian:
corus supports Python 3.5+, PyPy 3.
$ pip install corus| Dataset | API from corus import | Tags | Texts | Uncompressed | Description | 
|---|---|---|---|---|---|
| Lenta.ru | |||||
| Lenta.ru v1.0 | load_lenta# | news | 739 351 | 1.66 Gb | wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz | 
| Lenta.ru v1.1+ | load_lenta2# | news | 800 975 | 1.94 Gb | wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2 | 
| Lib.rus.ec | load_librusec# | fiction | 301 871 | 144.92 Gb | Dump of lib.rus.ec prepared for RUSSE workshop wget http://panchenko.me/data/russe/librusec_fb2.plain.gz | 
| Rossiya Segodnya | load_ria_raw#load_ria# | news | 1 003 869 | 3.70 Gb | wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz | 
| Mokoron Russian Twitter Corpus | load_mokoron# | socialsentiment | 17 633 417 | 1.86 Gb | Russian Twitter sentiment markup Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql | 
| Wikipedia | load_wiki# | 1 541 401 | 12.94 Gb | Russian Wiki dump wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2 | |
| GramEval2020 | load_gramru# | 162 372 | 30.04 Mb | wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zipunzip master.zipmv GramEval2020-master/dataTrain trainmv GramEval2020-master/dataOpenTest devrm -r master.zip GramEval2020-masterwget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu | |
| OpenCorpora | load_corpora# | morph | 4 030 | 20.21 Mb | wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip | 
| RusVectores SimLex-965 | load_simlex# | embsim | wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsvwget https://rusvectores.org/static/testsets/ru_simlex965.tsv | ||
| Omnia Russica | load_omnia# | morphwebfiction | 489.62 Gb | Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf Manually download http://bit.ly/2ZT4BY9 | |
| factRuEval-2016 | load_factru# | nernews | 254 | 969.27 Kb | Manual PER, LOC, ORG markup prepared for 2016 Dialog competition wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zipunzip master.ziprm master.zip | 
| Gareev | load_gareev# | nernews | 97 | 455.02 Kb | Manual PER, ORG markup (no LOC) Email Rinat Gareev ([email protected]) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gzrm rus-ner-news-corpus.iob.tar.gz | 
| Collection5 | load_ne5# | nernews | 1 000 | 2.96 Mb | News articles with manual PER, LOC, ORG markup wget http://www.labinform.ru/pub/named_entities/collection5.zipunzip collection5.ziprm collection5.zip | 
| WiNER | load_wikiner# | ner | 203 287 | 36.15 Mb | Sentences from Wiki auto annotated with PER, LOC, ORG tags wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2 | 
| BSNLP-2019 | load_bsnlp# | ner | 464 | 1.16 Mb | Markup prepared for 2019 BSNLP Shared Task wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zipwget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zipunzip TRAININGDATA_BSNLP_2019_shared_task.zipunzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bgrm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip | 
| Persons-1000 | load_persons# | nernews | 1 000 | 2.96 Mb | Same as Collection5, only PER markup + normalized names wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip | 
| The Russian Drug Reaction Corpus (RuDReC) | load_rudrec# | ner | 4 809 | 1.73 Kb | RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC. wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json | 
| Taiga | Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gztar -xzvf retagged_taiga.tar.gz | ||||
| Arzamas | load_taiga_arzamas# | news | 311 | 4.50 Mb | |
| Fontanka | load_taiga_fontanka# | news | 342 683 | 786.23 Mb | |
| Interfax | load_taiga_interfax# | news | 46 429 | 77.55 Mb | |
| KP | load_taiga_kp# | news | 45 503 | 61.79 Mb | |
| Lenta | load_taiga_lenta# | news | 36 446 | 95.15 Mb | |
| Taiga/N+1 | load_taiga_nplus1# | news | 7 696 | 24.96 Mb | |
| Magazines | load_taiga_magazines# | 39 890 | 2.19 Gb | ||
| Subtitles | load_taiga_subtitles# | 19 011 | 909.08 Mb | ||
| Social | load_taiga_social# | social | 1 876 442 | 648.18 Mb | |
| Proza | load_taiga_proza# | fiction | 1 732 434 | 38.25 Gb | |
| Stihi | load_taiga_stihi# | 9 157 686 | 12.80 Gb | ||
| Russian NLP Datasets | Several Russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
| News | load_buriy_news# | news | 2 154 801 | 6.84 Gb | Dump of top 40 news + 20 fashion news sites. wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2 | 
| Webhose | load_buriy_webhose# | news | 285 965 | 859.32 Mb | Dump from webhose.io, 300 sources for one month. wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2 | 
| ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
| Interfax | load_ods_interfax# | news | 543 961 | 1.22 Gb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz | 
| Gazeta | load_ods_gazeta# | news | 865 847 | 1.63 Gb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz | 
| Izvestia | load_ods_izvestia# | news | 86 601 | 307.19 Mb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz | 
| Meduza | load_ods_meduza# | news | 71 806 | 270.11 Mb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz | 
| RIA | load_ods_ria# | news | 101 543 | 233.88 Mb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz | 
| Russia Today | load_ods_rt# | news | 106 644 | 187.12 Mb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz | 
| TASS | load_ods_tass# | news | 1 135 635 | 3.27 Gb | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz | 
| Universal Dependencies | |||||
| GSD | load_ud_gsd# | morphsyntax | 5 030 | 1.01 Mb | wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu | 
| Taiga | load_ud_taiga# | morphsyntax | 3 264 | 353.80 Kb | wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu | 
| PUD | load_ud_pud# | morphsyntax | 1 000 | 207.78 Kb | wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu | 
| SynTagRus | load_ud_syntag# | morphsyntax | 61 889 | 11.33 Mb | wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu | 
| morphoRuEval-2017 | |||||
| General Internet-Corpus | load_morphoru_gicrya# | morph | 83 148 | 10.58 Mb | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zipunzip GIKRYA_texts_new.ziprm GIKRYA_texts_new.zip | 
| Russian National Corpus | load_morphoru_rnc# | morph | 98 892 | 12.71 Mb | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rarunrar x RNC_texts.rarrm RNC_texts.rar | 
| OpenCorpora | load_morphoru_corpora# | morph | 38 510 | 4.80 Mb | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rarunrar x OpenCorpora_Texts.rarrm OpenCorpora_Texts.rar | 
| RUSSE Russian Semantic Relatedness | |||||
| HJ: Human Judgements of Word Pairs | load_russe_hj# | embsim | wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv | ||
| RT: Synonyms and Hypernyms from the Thesaurus RuThes | load_russe_rt# | embsim | wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv | ||
| AE: Cognitive Associations from the Sociation.org Experiment | load_russe_ae# | embsim | wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csvwget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csvwget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv | ||
| Toloka Datasets | |||||
| Lexical Relations from the Wisdom of the Crowd (LRWC) | load_toloka_lrwc# | embsim | wget https://tlk.s3.yandex.net/dataset/LRWC.zipunzip LRWC.ziprm LRWC.zip | ||
| The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) | load_ruadrect# | social | 9 515 | 2.09 Mb | This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zipunzip RuADReCT.ziprm RuADReCT.zip | 
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru
- Implement corus/sources/<source>.py
- Add import into corus/sources/__init__.py
- Add meta into corus/source/meta.py
- Add example into docs.ipynb(check meta table is correct)
- Run tests (readme is updated)
Dev env
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-corusLint + update docs
make lint
make exec-docsRelease
# Update setup.py version
git commit -am 'Up version'
git tag v0.10.0
git push
git push --tags