UNILEX data extended to ease downstream more usages.
git clone [email protected]:lingua-libre/unilex-extended.git # clone base repository
git submodule update --init --recursive # update/Install submodules locally
This directory contains scripts to convert from original unilex data :
./add-from-corpuscrawler.sh: if data is in google/corpuscrawler but not in current unicode-org/unilex repository, pull it in../to-sorted.sh: takes unilex/data/frequency/{IETF}.txt, converts into./frequency-sorted-count/{IETF}.txtand./frequency-sorted-hash/{IETF}.txt../unilex-to-letters.sh: for a given {IETF}.txt target, transform frequency files into n files, one per letter. See human-friendly inline comments. Default:mr(Marathi).
./frequency-sorted-count/: formatna 77968661, sorted by count descendant../frequency-sorted-hash/: format# na(wiki list), sorted by count descendant.
Data is under Unicode License (GNU-like).