This project included scanning all geotagged tweets sent in 2020 to monitor the spread of coronavirus on social media. The process included:
- Working with large scale datasets
- Working with multilingual text
- Using the MapReduce divide-and-conquer paradigm to create parallel code
- Created a mapper,
src/map.pythat tracks the usage of the hashtags on both a language and country level. The output of running them mapper included two files, one for the langauge dictionary and one for the country dictionary. - Created shell scrip
run_maps.shto loop over each file in the dataset and run the mapper on each file. (utilized the nohup command to ensure the program continued to run after any disconnect) - Reduced mapped files to combine all
.langfiles into a single file and all.countryfiles into a different file. This was done usingsrc/reduce.py. - Visualized output files of the MapReduce process as bar graphs using
visualize.py. The horizontal axis of bar graph included keys of input file and the vertical axis included values of the input file. This included only the top 10 keys. - Created an alternative visualization file,
src/alternative_reduce.pyto combine the reduce and visualization steps. This file takes a list of hashtags as input and outputs a line plot.
To visualize output files of MapReduce, I set the --input_path of the visualize.py file equal to both the country and lang files created in the reduce phase, and the --key to #coronavirus and #코로나바이러스. Results are arranged low to high.
| Mentions of 코로나바이러스 by Language |
| In this figure, we can see that the language that mentions 코로나바이러스 the most is Korean. Alternatively, the language that mentions 코로나바이러스 the least is Spanish. |