twarc

twarc is command line tool for archiving the tweets in a Twitter search result. Twitter search results live for a week or so, and are highly volatile. Results are stored as line-oriented JSON (each line is a complete JSON document), and are exactly what is received from the Twitter API. twarc handles rate limiting and paging through large result sets. It also handles repeated runs of the same query, by using the most recent tweet in the last run to determine when to stop.

twarc was originally created to save tweets related to Aaron Swartz.

How To Use

pip install -r requirements.txt
cp config.py.example config.py
add twitter api credentials to config.py
./twarc.py aaronsw
cat aaronsw.json
:-(

Scrape Mode

The first time you fetch tweets for a query if you pass the --scrape option it will use search.twitter.com to discover tweet ids, and then use the Twitter REST API to fetch the JSON for each tweet. This is an expensive because each ID needs to be fetched from the API which counts as a request against your quota.

Twitter Search now supports drilling backwards in time, past the week cutoff of the REST API. Since individual tweets are still retrieved with the REST API, rate limits apply--so this is quite a slow process. Still, if you are willing to let it run for a while it can be useful to query for older tweets, until the official search REST API supports a more historical perspective.

Utils

In the utils directory there are some simple command line utilities for working with the json dumps like printing out the archived tweets as text or html, extracting the usernames, referenced urls, and the like. If you create a script that is handy please send me a pull request :-)

For example lets say you want to create a wall of tweets that mention 'nasa':

% ./twarc.py nasa
% utils/wall.py nasa-20130306102105.json > nasa.html

If you want the tweets ordered from oldest to latest:

% tail -r nasa-20130306102105.json | utils/wall.py > nasa.html

Or you want to create a word cloud of tweets you collected about nasa:

% ./twarc.py nasa
% utils/wordcloud.py nasa-20130306102105.json > nasa-wordcloud.html

Or if you want to filter out all the tweets that look like they were from women, and create a word cloud from them:

% ./twarc.py nasa
% utils/gender.py --gender female nasa-20130306102105.json | utils/wordcloud.py > nasa-female.html

Or if you want to create a D3 directed graph of mentions or retweets, in which nodes are users and arrows point from the original user to the user who mentions or retweets them:

% ./twarc.py nasa
% utils/directed.py --mode mentions nasa-20130306102105.json > nasa-directed-mentions.html
% utils/directed.py --mode retweets nasa-20130306102105.json > nasa-directed-retweets.html
% utils/directed.py --mode replies nasa-20130306102105.json > nasa-directed-replies.html

Or if you want to output GeoJSON from tweets where geo coordinates are available:

% ./twarc.py nasa
% utils/geojson.py nasa-20130306102105.json > nasa-20130306102105.geojson

License

CC0

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py.example		config.py.example
requirements.txt		requirements.txt
twarc.py		twarc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

twarc

How To Use

Scrape Mode

Utils

License

About

Uh oh!

Releases

Packages

Languages

baojie/twarc

Folders and files

Latest commit

History

Repository files navigation

twarc

How To Use

Scrape Mode

Utils

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages