web crawler , text extractor and (persian) text cleaner.
the aim of this project is to provide a corpus for Persian (or any other) language. aside from the wikipedia dataset for Persian (or any other) language which might not be large enough, it is a good idea to crawl the world wide web in order to extract web page materials to build our own data set for when dealing with problems which require a large data set of raw text.
in this project, except for the open source python packages, aaronsw/html2text is also used. for more info please visit https://github.com/aaronsw/html2text
for those of you who are only looking for presian raw text, there already is a repo containing roughly 70 GB of Persian raw text. here is the link: https://github.com/persiannlp/persian-raw-text
in order to create such dataset we go through the following steps:
- we create a list of initial web page URLs which we will strat from
- we will extract sublinks form the said initial list (in this project the BFS method has been used)
- we go through all the extracted URLs and download the contained html file in each.
- we go through the downloaded html files and remove unwanted text such as: html tags, front-end/back-end codes, redundant words, characters etc.
- finally, we save the clean and raw output text for further use.
- open the terminal in the prefered directory and run the command:
scrapy startproject text_extractor(needless to say, you should have the scrapy package installed) - go in the
./text_exctractordirectory (every thing is done here) then fill alistOfInitialSites.txtfile with the desired URLs - run
subLink_extractor.py(the results should be in asubLinkfolder) - run the
url_set_creator.pyto gather all the extracted URLs in one place (url_set.txt) - run the
spider_creator.pyto create the spiders - depending on the
num_spiderswhich is set at the begining of thespider_creator.pyyou should run as many terminals independently (default value is 7) - in each terminal run the command:
scrapy crawl spider$ -s JOBDIR=crawls/spider$and put the spider number instead of$(for example for the first spider to run you shoudl type:scrapy crawl spider1 -s JOBDIR=crawls/spider1)
NOTE: since this part could take several hours to several days (depending on the volume of your url_set) this process can be halted at any moment, and the progress is saved as it runs. thus if the program stops for any reason (ctrl+c, unwanted exception, power-cut etc.) the progress is not lost, and by running the exact same command -scrapy crawl spider$ -s JOBDIR=crawls/spider$- the program will run from the point it was interrupted.
- the downloaded content should be in a
downloaded_contentfolder - run the
text_cleaner.pyto have the final output. (placed in araw_textfolder)
have in mind that the text_cleaner.py which i wrote only works correctly on Persian language, and if you wish to use the project for any other, you should implement the 9th step on your own. (or change the code i've written which is not as hard as you might think!)
since i wrote this whole project to have some data to train a word embedding using fasttext, the needed code to train the model is also added by the name fasttext.ipynb
you might notice that is requires that you have the data in the input.txt file.
there is also an (optinal) section to download some of the data from the 70 GB repo and use that + maybe any additional data you might wanna add, to train the model.
all that matters is that you have all the data in a input.txt file before you try to train the model.
happy scraping and training !!!
please feel free to dm me if you come across any problems or questions.