Codestin Search App

Links to publicly available russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Install

corus supports Python 2.7+, 3.4+ и PyPy 2, 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru	`load_lenta`	#news	739 351	1.66 Gb	Dump of lenta.ru `wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lib.rus.ec	`load_librusec`	#lit	301 871	144.92 Gb	Dump of lib.rus.ec prepared for RUSSE workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `load_ria`	#news	1 003 869	3.70 Gb	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
factRuEval-2016	`load_factru`	#ner #news	254	969.27 Kb	Manual PER, LOC, ORG markup prepared for 2016 Dialog competition. `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev`	#ner #news	97	455.02 Kb	Manual PER, ORG markup. Email Rinat Gareev ([email protected]) ask for dataset `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5`	#ner #news	1 000	2.96 Mb	News articles with manual PER, LOC, ORG markup. `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
WiNER	`load_wikiner`	#ner	203 287	36.15 Mb	Sentences from Wiki auto annotated with PER, LOC, ORG tags. `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
Mokoron Russian Twitter Corpus	`load_mokoron`	#social	17 633 417	1.86 Gb	Russian tweets. Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki`		1 541 401	12.94 Gb	Russian Wiki dump. `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
Taiga	Large collection of russian texts from various sources: news sites, magazines, literacy, social networks. `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas`	#news	311	4.50 Mb	Dump of arzamas.academy.
Fontanka	`load_taiga_fontanka`	#news	342 683	786.23 Mb	Dump of fontanka.ru.
Interfax	`load_taiga_interfax`	#news	46 429	77.55 Mb	Dump of interfax.ru.
KP	`load_taiga_kp`	#news	45 503	61.79 Mb	Dump of kp.ru.
Lenta	`load_taiga_lenta`	#news	36 446	95.15 Mb	Dump of lenta.ru.
Taiga/N+1	`load_taiga_nplus1`	#news	7 696	24.96 Mb	Dump of nplus1.ru.
Magazines	`load_taiga_magazines`		39 890	2.19 Gb	Dump of magazines.russ.ru
Subtitles	`load_taiga_subtitles`		19 011	909.08 Mb
Social	`load_taiga_social`	#social	1 876 442	648.18 Mb
Proza	`load_taiga_proza`	#lit	1 732 434	38.25 Gb	Dump of proza.ru
Stihi	`load_taiga_stihi`		9 157 686	12.80 Gb	Dump of stihi.ru
Russian NLP Datasets	Several russian news datasets from webhose.io, lenta.ru and other news sites.
Lenta	`load_buriy_lenta`	#news	699 777	1.57 Gb	Dump of lenta.ru. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/lenta.tar.bz2`
News	`load_buriy_news`	#news	2 154 801	6.84 Gb	Dump of top 40 news + 20 fashion news sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose`	#news	285 965	859.32 Mb	Dump from webhose.io, 300 sources for one month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/stress.tar.gz`
ODS #proj_news_viz	Several news sites scraped by members of #proj_news_viz ODS project.
Interfax	`load_ods_interfax`	#news	543 962	1.22 Gb	Dump of interfax.ru. Manually download interfax_v1.csv.zip https://drive.google.com/file/d/1M7z0YoOgpm53IsJ3qOhT_nfiDnGUPeys/view
Gazeta	`load_ods_gazeta`	#news	865 847	1.63 Gb	Dump of gazeta.ru. Manually download gazeta_v1.csv.zip from https://drive.google.com/file/d/18B8CvHgmwwyz9GWBZ0TS6dE_x6gYnWCb/view

Licence

MIT

Support

Chat — https://telegram.me/natural_language_processing
Issues — https://github.com/natasha/corus/issues

Development

Tests:

make test

Add new source:

Implement corus/sources/<source>.py
Add import into corus/sources/__init__.py
Add meta into corus/source/meta.py
Add example into docs.ipynb (check meta table is correct)
Run tests (readme is updated)

Package:

make version
git push
git push --tags

make clean wheel upload

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
corus		corus
data		data
i		i
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
Makefile		Makefile
README.md		README.md
docs.ipynb		docs.ipynb
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Usage

Install

Reference

Licence

Support

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

natasha/corus

Folders and files

Latest commit

History

Repository files navigation

Usage

Install

Reference

Licence

Support

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages