Links to publicly available russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gzUse corus to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...For links to other datasets and their loaders see the Reference section.
corus supports Python 2.7+, 3.4+ и PyPy 2, 3.
$ pip install corus| Dataset | API from corus import |
Tags | Texts | Uncompressed | Description |
|---|---|---|---|---|---|
| Lenta.ru |
load_lenta
|
#news | 739 351 | 1.66 Gb |
Dump of lenta.ru
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
|
| Lib.rus.ec |
load_librusec
|
#lit | 301 871 | 144.92 Gb |
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
|
| Rossiya Segodnya |
load_ria_raw
load_ria
|
#news | 1 003 869 | 3.70 Gb |
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
|
| factRuEval-2016 |
load_factru
|
#ner #news | 254 | 969.27 Kb |
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition.
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
|
| Gareev |
load_gareev
|
#ner #news | 97 | 455.02 Kb |
Manual PER, ORG markup.
Email Rinat Gareev ([email protected]) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
|
| Collection5 |
load_ne5
|
#ner #news | 1 000 | 2.96 Mb |
News articles with manual PER, LOC, ORG markup.
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
|
| WiNER |
load_wikiner
|
#ner | 203 287 | 36.15 Mb |
Sentences from Wiki auto annotated with PER, LOC, ORG tags.
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
|
| Mokoron Russian Twitter Corpus |
load_mokoron
|
#social | 17 633 417 | 1.86 Gb |
Russian tweets.
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
| Wikipedia |
load_wiki
|
1 541 401 | 12.94 Gb |
Russian Wiki dump.
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
|
|
| Taiga |
Large collection of russian texts from various sources: news sites, magazines, literacy, social networks.
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
|
||||
| Arzamas |
load_taiga_arzamas
|
#news | 311 | 4.50 Mb | Dump of arzamas.academy. |
| Fontanka |
load_taiga_fontanka
|
#news | 342 683 | 786.23 Mb | Dump of fontanka.ru. |
| Interfax |
load_taiga_interfax
|
#news | 46 429 | 77.55 Mb | Dump of interfax.ru. |
| KP |
load_taiga_kp
|
#news | 45 503 | 61.79 Mb | Dump of kp.ru. |
| Lenta |
load_taiga_lenta
|
#news | 36 446 | 95.15 Mb | Dump of lenta.ru. |
| Taiga/N+1 |
load_taiga_nplus1
|
#news | 7 696 | 24.96 Mb | Dump of nplus1.ru. |
| Magazines |
load_taiga_magazines
|
39 890 | 2.19 Gb | Dump of magazines.russ.ru | |
| Subtitles |
load_taiga_subtitles
|
19 011 | 909.08 Mb | ||
| Social |
load_taiga_social
|
#social | 1 876 442 | 648.18 Mb | |
| Proza |
load_taiga_proza
|
#lit | 1 732 434 | 38.25 Gb | Dump of proza.ru |
| Stihi |
load_taiga_stihi
|
9 157 686 | 12.80 Gb | Dump of stihi.ru | |
| Russian NLP Datasets | Several russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
| Lenta |
load_buriy_lenta
|
#news | 699 777 | 1.57 Gb |
Dump of lenta.ru.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/lenta.tar.bz2
|
| News |
load_buriy_news
|
#news | 2 154 801 | 6.84 Gb |
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
|
| Webhose |
load_buriy_webhose
|
#news | 285 965 | 859.32 Mb |
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/stress.tar.gz
|
| ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
| Interfax |
load_ods_interfax
|
#news | 543 962 | 1.22 Gb |
Dump of interfax.ru.
Manually download interfax_v1.csv.zip https://drive.google.com/file/d/1M7z0YoOgpm53IsJ3qOhT_nfiDnGUPeys/view |
| Gazeta |
load_ods_gazeta
|
#news | 865 847 | 1.63 Gb |
Dump of gazeta.ru.
Manually download gazeta_v1.csv.zip from https://drive.google.com/file/d/18B8CvHgmwwyz9GWBZ0TS6dE_x6gYnWCb/view |
MIT
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
Tests:
make testAdd new source:
- Implement
corus/sources/<source>.py - Add import into
corus/sources/__init__.py - Add meta into
corus/source/meta.py - Add example into
docs.ipynb(check meta table is correct) - Run tests (readme is updated)
Package:
make version
git push
git push --tags
make clean wheel upload