This lesson is in the early stages of development (Alpha version)

Harvesting Twitter Data with twarc: twarc


twarc is a command line tool and Python library that is used to harvest and archive Tweets through the Twitter API. The great thing about twarc is that it’s accessible, free, and fairly easy to use once you get the hang of it. It’s also an active open-source project (that means it is essentially open to collaboration and use by anyone) and is well-documented on Github by an organization referred to as Documenting the Now.

Black Lives Matter and Documenting the Now

This lesson will introduce you to a Python application called twarc. twarc was created by Document the Now during the civil unrest in Ferguson, Missouri following the shooting and killing of Michael Brown, Jr. by a police officer. Document the Now develops open source tools and community centered practices that support the ethical collection, use, and preservation of publicly available content shared on the web and social media.

This lesson was prepared for UCSB Carpentries

We Carpentry practice what we Carpentry preach. Feel free to jump in.

This workshop was first presented in June of 2022 online and in-person at UCSB.


This lesson assumes you have access to a working verion of Python on a JupyterLab instance. It should work for those with a standalone Jupyter- install. For our workshop, we used jupyterhub hosted by UCSB’s LSIT.


Setup Download files required for the lesson
00:00 1. Introduction What is a tweet?
What is an API and how can I get started?
What is Twarc?
00:20 2. Getting familiar with JupyterLab What is JupyterLab?
How do I move around in JupyterLab?
How can I set up twarc on JupyterLab?
What’s a good way of running twarc in our Jupyter environment?
01:20 3. Anatomy of a tweet: structure of a tweet as JSONL What does raw Twitter data look like?
What are some built-in ways of looking at Twitter JSONL data with Jupyter?
Which pieces of a tweet should I pay attention to?
01:50 4. The Twitter Public API How exactly are we using the Twitter API?
What endpoints are available to me?
What are the limitations of our accounts?
How can I standardize my workflow?
02:45 5. Ethics and Twitter Can I avoid seeing hate speech and unsettling imagery and still analyze twitter?
What are some privacy or other ethical issues that you need to keep in mind when harvesting tweets with twarc?
How much personal information can we actually gather about a user given our twarc scrape?
What are some use cases that might be inappropriate?
03:35 6. Search and Stream How can we specify what Tweets to collect?
How can we collect Tweets as they are posted?
04:00 7. Plugins and Searches What other harvests are available?
Are there more twarc2 plugins?
How can I separate original content from retweets/replies?
04:30 8. TextBlob Sentiment Analysis How can I analyze my tweets beyond what twarc offers?
Is it possible to tell if tweets are expressing positive or negative feelings?
05:20 9. Data Management What are some best practices for handling our data ‘for the long term’?
05:35 10. Don't Map Twitter What can I do with the geographic information in tweets?
How does Twitter represent places?
05:55 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.