Web Crawler for gathering financial news for scientific research purposes.
The stack bellow was used mostly due to its ease of installation, configuration, and also efficiency and portability.
- Language: Python (3.7.4)
- Core crawler engine: Scrapy (1.8.0)
This system was developed in Ubuntu 16.04 but will work properly on any other Operational System(OS X, Windows, etc.).
However, this guide will only include instructions for plugins and packages that are not already installed on this OS. For this reason, we assume that technologies like a python interpreter and pipenv are ready for use, and should not be on the scope of this document.
- Now install pipenv dependency manager:
$ pip install --user pipenv
Now we'll start setting up the project.
- Clone the repo from GitHub and change to project root directory. After that, install project dependencies and go to python virtual env, by running:
$ pipenv install
$ pipenv shellNow we will run the project and get query answers.
- The project will do the following actions:
- Download a raw data like a JSON or an HTML in a controlled not abusive manner, respecting instructions from
robots.txtfiles - Save data to a specific folder according to webpage source and stock or asset selected
- Parse the HTML or JSON and extract the data to a CSV file selecting only preselected fields
- Download a raw data like a JSON or an HTML in a controlled not abusive manner, respecting instructions from
$ python scraper- To run just the step iii above, pass the argument
--step=gen_csv:
$ python scraper --step=gen_csv-
Only stocks present in the
companies_list.csvare allowed to run if--website=wsj_news. So if necessary, include new entries in this file following previous entries as examples. -
Required params:
--crawl_typeallowed values:api,website--stocka single stock ticker present in the filecompanies_list.csvor the valuelistto run for all stocks incompanies_list.csv -
When no params are passed the default mode of this scraper runs a crawler with the following characteristics:
--website=wsj_newsfor wall street journal API--mode=defaultdefault operation mode--step=allrun all steps mentioned before--max_requests=2maximum number of consecutive requests--max_elements=100gets a maximum number of elements--start_timeis the current time--end_timeis unused as the systems defaults to the number maximum of consecutive requests -
Example to run with all possible fields:
$ python scraper --crawl_type=api --website=wsj_news --stock=btcusd --mode=default --step=all --max_requests=2 --end_time=2019-10-18T18:59:08 --start_time=2019-11-12T22:30:00-
Example to run greedy mode:
Beware the greedy mode will run trough exhaustion and may lead to blockage if not used properly. This mode ignores
end_timeandmax_requests, but it can considerstart_time.
$ python scraper --website=wsj_news --stock=btcusd --mode=greedyTests
Unitary and integration tests will guarantee consistency and good maintainability
Job scheduler
Adding a feature for scheduling each job separately with improve performance
Work for any site
Right now the code only works for a specific news website, it will be adjusted for some other websites candidates:
-
Yahoo finance
-
Business Insider
-
News BTC
-
Be In Crypto
Design improvements
- It seem scrapy expects all loops in the same website are self-contained
Known issues
- See open issues in github
- Go play around! =)