This project was part of the course in Big Data in the Fall of 2014, during my masters.
According to Moat et al: “…data on changes in how often financially related Wikipedia pages were viewed may have contained early signs of stock market moves.(1)” and “By analyzing changes in Google query volumes for search terms related to finance, we find patterns that may be interpreted as “early warning signs” of stock market moves. (2)” Primary motivation for this project was to simply verify above claims using Hadoop as a tool. Apple inc’s stock performance was measured against sentiment analysis data from internet archives, and Wikipedia page counts related to apple products.
- Web-archive files which are archived web-content, publicly available, and stored in Amazon’s S3 buckets.
- Wikipedia page counts of Apple’s products
- Web archive files were accessible in Amazon’s S3 storage
- Data extraction stage involved reading data from S3 buckets, filtering content related to Apple’s products, and storing extracted data in text files. The setup involved 10 worker nodes, and 1 master node
- Sentiment Analysis stage involved reading data from the extracted text files and performing sentiment analysis, and storing results in a text file with dates
- Wikipedia page counts related to Apple’s products by date was scraped using a Python script
- Data was analyzed using Google Fusion tables.
| Folder | Description |
|---|---|
| ExtractionMRCode/ | Map-reduce code to extract data relevant to Apple products using Textblob library |
| cleanWithoutTextblob/ | Map-reduce code to extract data relevant to Apple products without Textblob library. This code was used on AWS EMR cluster as Textblob is not needed, and is simpler. |
| collectWikipediaPageCounts/ | Code to extract page views on Apple products on Wikipedia |
| sentiment/ | Map-reduce code to calculate sentiment of extracted data |
| stats/ | Extracted sentiment data, and stock values |
| utilityCode/ | Python scripts for extracting, cleaning, and filtering data |
The Presentation contains further details on data source (WARC), and overall data-flow.
Here is the data visualization from final data: data visualization
This course focused on the data engineering part. i.e., running Hadoop Map-reduce code using a cluster, which was successfully achieved.
However, at the end of the project I felt there is a lot more to this. There is a need to learn more and build upon this:
- Learn statistical methods, and machine learning to gain further insights, and possibly create predictive models.
- Use a custom visualization library like D3.js
- Use consistent data. The web-archive files were recorded at certain intervals. The data collected wasn't inline with the dates on which the actual posts were made.
- Quantifying Wikipedia Usage Patterns Before Stock Market Moves (Helen Susannah Moat, Chester Curme, Adam Avakian, Dror Y. Kenett, H. Eugene Stanley & Tobias Preis - http://rdcu.be/x9Eo)
- Quantifying Trading Behavior in Financial Markets Using Google Trends (Tobias Preis, Helen Susannah Moat & H. Eugene Stanley - http://rdcu.be/x9EN)