This lesson is in the early stages of development (Alpha version)

Harvesting Twitter Data with twarc: Documentation

Key Points

Introduction
  • Twitter is a microblogging platform that allows data collection from its API.

  • Twarc is a Python application an dlibrary that allows users to programmatically collect and archive Tweets.

Getting familiar with JupyterLab
  • Navigating Python in a JupyterLab environment

  • Configuring an application to work with an API

  • Arranging a directory structure and loading libraries

Anatomy of a tweet: structure of a tweet as JSONL
  • Tweets arrive as JSONL, a super common format.

  • We can use online viewers for a human-readable look at JSONL

  • Tweets come with a TON of associated data

The Twitter Public API
  • There are many online sources of Twitter data

  • Utilities and plugins come with twarc to help us out

  • A consistant ‘harvest > convert > examine’ pipeline will help us work with our data

Ethics and Twitter
Search and Stream
  • Search: collect pre-existing tweets that satisfy parameters

  • Stream: collect Tweets that satisfy parameters, as they are posted

Plugins and Searches
  • twarc2 has plug-ins that need separate installation

  • The network plug-in shows us how tweeters are related to each other

TextBlob Sentiment Analysis
  • TextBlob has a bunch of functions that we should learn

Data Management
  • Dehydrating your tweets solves a lot of issues

Don't Map Twitter
  • Determining the location of a tweet when it happened is fuzzy.

  • At best, it’s a proxy for ‘aboutness’

  • Proceed with Caution and Respect Humans’ Privacy

Documentation

Check out Documenting the Now’s extensive twarc documentation. The UCSB Library has also created a guide to using twarc with the v.1 API.

Glossary