Thanks to visit codestin.com
Credit goes to data.statmt.org

CC-100: Monolingual Datasets from Web Crawl Data

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

References

Please cite the following if you found the resources in this corpus useful.

Download

af Afrikaans (305M)
am Amharic (133M)
ar Arabic (5.4G)
as Assamese (7.6M)
az Azerbaijani (1.3G)
be Belarusian (692M)
bg Bulgarian (9.3G)
bn Bengali (860M)
bn_rom Bengali Romanized (164M)
br Breton (21M)
bs Bosnian (18M)
ca Catalan (2.4G)
cs Czech (4.4G)
cy Welsh (179M)
da Danish (12G)
de German (18G)
el Greek (7.4G)
en English (82G)
eo Esperanto (250M)
es Spanish (14G)
et Estonian (1.7G)
eu Basque (488M)
fa Persian (20G)
ff Fulah (3.1M)
fi Finnish (15G)
fr French (14G)
fy Frisian (38M)
ga Irish (108M)
gd Scottish Gaelic (22M)
gl Galician (708M)
gn Guarani (1.5M)
gu Gujarati (242M)
ha Hausa (61M)
he Hebrew (6.1G)
hi Hindi (2.5G)
hi_rom Hindi Romanized (129M)
hr Croatian (5.7G)
ht Haitian (9.1M)
hu Hungarian (15G)
hy Armenian (776M)
id Indonesian (36G)
ig Igbo (6.6M)
is Icelandic (779M)
it Italian (7.8G)
ja Japanese (15G)
jv Javanese (37M)
ka Georgian (1.1G)
kk Kazakh (889M)
km Khmer (153M)
kn Kannada (360M)
ko Korean (14G)
ku Kurdish (90M)
ky Kyrgyz (173M)
la Latin (609M)
lg Ganda (7.3M)
li Limburgish (2.2M)
ln Lingala (2.3M)
lo Lao (63M)
lt Lithuanian (3.4G)
lv Latvian (2.1G)
mg Malagasy (29M)
mk Macedonian (706M)
ml Malayalam (831M)
mn Mongolian (397M)
mr Marathi (334M)
ms Malay (2.1G)
my Burmese (46M)
my_zaw Burmese (Zawgyi) (178M)
ne Nepali (393M)
nl Dutch (7.9G)
no Norwegian (13G)
ns Northern Sotho (1.8M)
om Oromo (11M)
or Oriya (56M)
pa Punjabi (90M)
pl Polish (12G)
ps Pashto (107M)
pt Portuguese (13G)
qu Quechua (1.5M)
rm Romansh (4.8M)
ro Romanian (16G)
ru Russian (46G)
sa Sanskrit (44M)
si Sinhala (452M)
sc Sardinian (143K)
sd Sindhi (67M)
sk Slovak (6.1G)
sl Slovenian (2.8G)
so Somali (78M)
sq Albanian (1.3G)
sr Serbian (1.5G)
ss Swati (86K)
su Sundanese (15M)
sv Swedish (21G)
sw Swahili (332M)
ta Tamil (1.3G)
ta_rom Tamil Romanized (68M)
te Telugu (536M)
te_rom Telugu Romanized (79M)
th Thai (8.7G)
tl Tagalog (701M)
tn Tswana (8.0M)
tr Turkish (5.4G)
ug Uyghur (46M)
uk Ukrainian (14G)
ur Urdu (884M)
ur_rom Urdu Romanized (141M)
uz Uzbek (155M)
vi Vietnamese (28G)
wo Wolof (3.6M)
xh Xhosa (25M)
yi Yiddish (51M)
yo Yoruba (1.1M)
zh-Hans Chinese (Simplified) (14G)
zh-Hant Chinese (Traditional) (5.3G)
zu Zulu (4.3M)