Sentiment Analysis

This project was part of the course in Big Data in the Fall of 2014, during my masters.

Description

According to Moat et al: “…data on changes in how often financially related Wikipedia pages were viewed may have contained early signs of stock market moves.(1)” and “By analyzing changes in Google query volumes for search terms related to finance, we find patterns that may be interpreted as “early warning signs” of stock market moves. (2)” Primary motivation for this project was to simply verify above claims using Hadoop as a tool. Apple inc’s stock performance was measured against sentiment analysis data from internet archives, and Wikipedia page counts related to apple products.

Data used

Web-archive files which are archived web-content, publicly available, and stored in Amazon’s S3 buckets.
Wikipedia page counts of Apple’s products

Map reduce jobs written in Python(using Hadoop streaming) to do the following main tasks:

Web archive files were accessible in Amazon’s S3 storage
Data extraction stage involved reading data from S3 buckets, filtering content related to Apple’s products, and storing extracted data in text files. The setup involved 10 worker nodes, and 1 master node
Sentiment Analysis stage involved reading data from the extracted text files and performing sentiment analysis, and storing results in a text file with dates
Wikipedia page counts related to Apple’s products by date was scraped using a Python script
Data was analyzed using Google Fusion tables.

Description of folders

Folder	Description
ExtractionMRCode/	Map-reduce code to extract data relevant to Apple products using Textblob library
cleanWithoutTextblob/	Map-reduce code to extract data relevant to Apple products without Textblob library. This code was used on AWS EMR cluster as Textblob is not needed, and is simpler.
collectWikipediaPageCounts/	Code to extract page views on Apple products on Wikipedia
sentiment/	Map-reduce code to calculate sentiment of extracted data
stats/	Extracted sentiment data, and stock values
utilityCode/	Python scripts for extracting, cleaning, and filtering data

Presentation:

The Presentation contains further details on data source (WARC), and overall data-flow.

Data visualization

Here is the data visualization from final data: data visualization

Improvements

This course focused on the data engineering part. i.e., running Hadoop Map-reduce code using a cluster, which was successfully achieved.

However, at the end of the project I felt there is a lot more to this. There is a need to learn more and build upon this:

Learn statistical methods, and machine learning to gain further insights, and possibly create predictive models.
Use a custom visualization library like D3.js
Use consistent data. The web-archive files were recorded at certain intervals. The data collected wasn't inline with the dates on which the actual posts were made.

References

Quantifying Wikipedia Usage Patterns Before Stock Market Moves (Helen Susannah Moat, Chester Curme, Adam Avakian, Dror Y. Kenett, H. Eugene Stanley & Tobias Preis - http://rdcu.be/x9Eo)
Quantifying Trading Behavior in Financial Markets Using Google Trends (Tobias Preis, Helen Susannah Moat & H. Eugene Stanley - http://rdcu.be/x9EN)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis

Description

Data used

Map reduce jobs written in Python(using Hadoop streaming) to do the following main tasks:

Description of folders

Presentation:

Data visualization

Improvements

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ExtractionMRCode		ExtractionMRCode
cleanWithoutTextblob		cleanWithoutTextblob
collectWikipediaPageCounts		collectWikipediaPageCounts
sentiment		sentiment
stats		stats
utilityCode		utilityCode
.gitignore		.gitignore
README.md		README.md
analysis of stock performance based on sentiment analysis.pdf		analysis of stock performance based on sentiment analysis.pdf

codebleeder/Sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis

Description

Data used

Map reduce jobs written in Python(using Hadoop streaming) to do the following main tasks:

Description of folders

Presentation:

Data visualization

Improvements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages