Thanks to visit codestin.com
Credit goes to github.com

Skip to content

natasha/corus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status Coverage Status

Links to publicly available russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Install

corus supports Python 2.7+, 3.4+ и PyPy 2, 3.

$ pip install corus

Reference

Dataset API from corus import Tags Texts Uncompressed Description
Lenta.ru load_lenta #news 739 351 1.66 Gb Dump of lenta.ru

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Lib.rus.ec load_librusec #lit 301 871 144.92 Gb Dump of lib.rus.ec prepared for RUSSE workshop

wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
Rossiya Segodnya load_ria_raw
load_ria
#news 1 003 869 3.70 Gb wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
factRuEval-2016 load_factru #ner #news 254 969.27 Kb Manual PER, LOC, ORG markup prepared for 2016 Dialog competition.

wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
Gareev load_gareev #ner #news 97 455.02 Kb Manual PER, ORG markup.

Email Rinat Gareev ([email protected]) ask for dataset
tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
Collection5 load_ne5 #ner #news 1 000 2.96 Mb News articles with manual PER, LOC, ORG markup.

wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
WiNER load_wikiner #ner 203 287 36.15 Mb Sentences from Wiki auto annotated with PER, LOC, ORG tags.

wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
Mokoron Russian Twitter Corpus load_mokoron #social 17 633 417 1.86 Gb Russian tweets.

Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia load_wiki 1 541 401 12.94 Gb Russian Wiki dump.

wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
Taiga Large collection of russian texts from various sources: news sites, magazines, literacy, social networks.

wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
Arzamas load_taiga_arzamas #news 311 4.50 Mb Dump of arzamas.academy.
Fontanka load_taiga_fontanka #news 342 683 786.23 Mb Dump of fontanka.ru.
Interfax load_taiga_interfax #news 46 429 77.55 Mb Dump of interfax.ru.
KP load_taiga_kp #news 45 503 61.79 Mb Dump of kp.ru.
Lenta load_taiga_lenta #news 36 446 95.15 Mb Dump of lenta.ru.
Taiga/N+1 load_taiga_nplus1 #news 7 696 24.96 Mb Dump of nplus1.ru.
Magazines load_taiga_magazines 39 890 2.19 Gb Dump of magazines.russ.ru
Subtitles load_taiga_subtitles 19 011 909.08 Mb
Social load_taiga_social #social 1 876 442 648.18 Mb
Proza load_taiga_proza #lit 1 732 434 38.25 Gb Dump of proza.ru
Stihi load_taiga_stihi 9 157 686 12.80 Gb Dump of stihi.ru
Russian NLP Datasets Several russian news datasets from webhose.io, lenta.ru and other news sites.
Lenta load_buriy_lenta #news 699 777 1.57 Gb Dump of lenta.ru.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/lenta.tar.bz2
News load_buriy_news #news 2 154 801 6.84 Gb Dump of top 40 news + 20 fashion news sites.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
Webhose load_buriy_webhose #news 285 965 859.32 Mb Dump from webhose.io, 300 sources for one month.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/stress.tar.gz
ODS #proj_news_viz Several news sites scraped by members of #proj_news_viz ODS project.
Interfax load_ods_interfax #news 543 962 1.22 Gb Dump of interfax.ru.

Manually download interfax_v1.csv.zip https://drive.google.com/file/d/1M7z0YoOgpm53IsJ3qOhT_nfiDnGUPeys/view
Gazeta load_ods_gazeta #news 865 847 1.63 Gb Dump of gazeta.ru.

Manually download gazeta_v1.csv.zip from https://drive.google.com/file/d/18B8CvHgmwwyz9GWBZ0TS6dE_x6gYnWCb/view

Licence

MIT

Support

Development

Tests:

make test

Add new source:

  1. Implement corus/sources/<source>.py
  2. Add import into corus/sources/__init__.py
  3. Add meta into corus/source/meta.py
  4. Add example into docs.ipynb (check meta table is correct)
  5. Run tests (readme is updated)

Package:

make version
git push
git push --tags

make clean wheel upload

About

Links to Russian corpora + Python functions for loading and parsing

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published