PySpark script files for searching json Twitter stream data. Primarily for finding the context of a particular emoji/character.
Searches Twitter archive data (from archive.org) to find the characters which occur before and after a chosen target within a certain window. This is useful for analyzing how emoji are used in context and how they are combined. For more information see the original non-Spark version here.
Uses Spark for parallelized read of large Twitter datasets.
Edit setup-submit.sh to change the Spark job.
Arguments to the full_search_spark.py job are:
data_path: Path to the Twitter archiveemoji_match: Name of emoji to matchwindow: Window size for adjacencytop: Number of top characters to output/displayverbose: Print info messages