web-crawler

web crawler , text extractor and (persian) text cleaner.

the aim of this project is to provide a corpus for Persian (or any other) language. aside from the wikipedia dataset for Persian (or any other) language which might not be large enough, it is a good idea to crawl the world wide web in order to extract web page materials to build our own data set for when dealing with problems which require a large data set of raw text.

in this project, except for the open source python packages, aaronsw/html2text is also used. for more info please visit https://github.com/aaronsw/html2text

update:

for those of you who are only looking for presian raw text, there already is a repo containing roughly 70 GB of Persian raw text. here is the link: https://github.com/persiannlp/persian-raw-text

how it works:

in order to create such dataset we go through the following steps:

we create a list of initial web page URLs which we will strat from
we will extract sublinks form the said initial list (in this project the BFS method has been used)
we go through all the extracted URLs and download the contained html file in each.
we go through the downloaded html files and remove unwanted text such as: html tags, front-end/back-end codes, redundant words, characters etc.
finally, we save the clean and raw output text for further use.

how to use it:

open the terminal in the prefered directory and run the command: scrapy startproject text_extractor (needless to say, you should have the scrapy package installed)
go in the ./text_exctractor directory (every thing is done here) then fill a listOfInitialSites.txt file with the desired URLs
run subLink_extractor.py (the results should be in a subLink folder)
run the url_set_creator.py to gather all the extracted URLs in one place (url_set.txt)
run the spider_creator.py to create the spiders
depending on the num_spiders which is set at the begining of the spider_creator.py you should run as many terminals independently (default value is 7)
in each terminal run the command: scrapy crawl spider$ -s JOBDIR=crawls/spider$ and put the spider number instead of $ (for example for the first spider to run you shoudl type: scrapy crawl spider1 -s JOBDIR=crawls/spider1)

NOTE: since this part could take several hours to several days (depending on the volume of your url_set) this process can be halted at any moment, and the progress is saved as it runs. thus if the program stops for any reason (ctrl+c, unwanted exception, power-cut etc.) the progress is not lost, and by running the exact same command -scrapy crawl spider$ -s JOBDIR=crawls/spider$- the program will run from the point it was interrupted.

the downloaded content should be in a downloaded_content folder
run the text_cleaner.py to have the final output. (placed in a raw_text folder)

have in mind that the text_cleaner.py which i wrote only works correctly on Persian language, and if you wish to use the project for any other, you should implement the 9th step on your own. (or change the code i've written which is not as hard as you might think!)

(optional) training on fastText:

since i wrote this whole project to have some data to train a word embedding using fasttext, the needed code to train the model is also added by the name fasttext.ipynb you might notice that is requires that you have the data in the input.txt file. there is also an (optinal) section to download some of the data from the 70 GB repo and use that + maybe any additional data you might wanna add, to train the model.

all that matters is that you have all the data in a input.txt file before you try to train the model.

happy scraping and training !!!

please feel free to dm me if you come across any problems or questions.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
README.md		README.md
fasttext.ipynb		fasttext.ipynb
fasttext.py		fasttext.py
listOfInitialSites.txt		listOfInitialSites.txt
raw_spider_code.txt		raw_spider_code.txt
spider_creator.py		spider_creator.py
subLink_extractor.py		subLink_extractor.py
text_cleaner.py		text_cleaner.py
url_set.rar		url_set.rar
url_set_creator.py		url_set_creator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

web-crawler

update:

how it works:

how to use it:

(optional) training on fastText:

About

Uh oh!

Releases

Packages

Languages

srmt99/web-crawler

Folders and files

Latest commit

History

Repository files navigation

web-crawler

update:

how it works:

how to use it:

(optional) training on fastText:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages