Codestin Search App

Coronavirus Twitter Analysis

This project included scanning all geotagged tweets sent in 2020 to monitor the spread of coronavirus on social media. The process included:

Working with large scale datasets
Working with multilingual text
Using the MapReduce divide-and-conquer paradigm to create parallel code

Process:

Created a mapper, src/map.py that tracks the usage of the hashtags on both a language and country level. The output of running them mapper included two files, one for the langauge dictionary and one for the country dictionary.
Created shell scrip run_maps.sh to loop over each file in the dataset and run the mapper on each file. (utilized the nohup command to ensure the program continued to run after any disconnect)
Reduced mapped files to combine all .lang files into a single file and all .country files into a different file. This was done using src/reduce.py.
Visualized output files of the MapReduce process as bar graphs using visualize.py. The horizontal axis of bar graph included keys of input file and the vertical axis included values of the input file. This included only the top 10 keys.
Created an alternative visualization file, src/alternative_reduce.py to combine the reduce and visualization steps. This file takes a list of hashtags as input and outputs a line plot.

Findings:

To visualize output files of MapReduce, I set the --input_path of the visualize.py file equal to both the country and lang files created in the reduce phase, and the --key to #coronavirus and #코로나바이러스. Results are arranged low to high.


Mentions of Coronavirus by Country
In this figure we can see the number of tweets that mention coronavirus for each country. The country with the largest number of tweets is shown to be the United States and the smallest is Türkiye.


Mentions of Coronavirus by Language
In this figure, we can see the top 10 languages of tweets mentioning coronavirus. English is shown to be the language with the largest number of tweets and Dutch is the smallest.


Mentions of 코로나바이러스 by Country
In this figure, we see the number of tweets that mention 코로나바이러스 by country. We can see that Korea has the largest number of tweets mentioning 코로나바이러스 and Australia has the smallest.


Mentions of 코로나바이러스 by Language
In this figure, we can see that the language that mentions 코로나바이러스 the most is Korean. Alternatively, the language that mentions 코로나바이러스 the least is Spanish.


Number of Tweets of #hashtag during year by Day of Year
In this figure, we can see the number of tweets a specified hashtag was mentioned per day. We can see tweets mentioning #coronavirus spiked to over 50,000 tweets.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
figures		figures
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
nohup.out		nohup.out
reduced.country		reduced.country
reduced.lang		reduced.lang
run_maps.sh		run_maps.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Coronavirus Twitter Analysis

Process:

Findings:

About

Uh oh!

Releases

Packages

Languages

sjanefullerton/twitter_coronavirus

Folders and files

Latest commit

History

Repository files navigation

Coronavirus Twitter Analysis

Process:

Findings:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages