# Twitter with twarc
A UCSB original Carpentry workshop

hashtag_gasprices.jsonl file was created as part of workshop prep. 
To run the workshop, run the code in the next cell.

*After that, this version of the file will harvest and process all of the data for the workshop as it is written on 5/25/2021. 

- This notebook WILL consume twitter quota
- It starts with the files that are in the setup instructions



In [1]:
# we made this file for you
# ! twarc2 search "#gasprices" > raw/hashtag_gasprices.jsonl




In [2]:
# administravia
# upon re-start we need to install twarc2 and other extensions
! pip install twarc-csv
! pip install emoji



# Episode 2


In [3]:
# hashes are comments

print('hello world')

hello world


In [4]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [5]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
import textblob
import nltk
import os
import emoji

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# this will come into play for ep 8
# !python -m textblob.download_corpora
# nltk.download('stopwords')

In [7]:
# and of course, it's important to know where we are working
# I can send a BASH command from my notebook with a !:
!pwd

/home/jovyan/twarc_run


In [8]:
# you can also do this with Python
os.getcwd()

'/home/jovyan/twarc_run'

In [9]:
# we can change if we need
# os.chdir(".....")

In [10]:
os.getcwd()

'/home/jovyan/twarc_run'

## Running twarc
Let's get the timeline of one of twarc's creators.

In [11]:
!twarc2 timeline BergisJules > raw/bjules.jsonl

API limit of 3200 reached:  18%|█▉         | 3146/17680 [00:38<02:56, 82.19it/s]


### Challenge 1
- Can you find the file called “bjules_flat.jsonl”?
- How many tweets did you get from Bergis? (we can't tell without flattening or looking at the output)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

In [12]:
!twarc2 timeline ecodatasci > raw/ecodatasci.jsonl
! twarc2 flatten raw/ecodatasci.jsonl > output/ecodatasci_flat.jsonl
! wc output/ecodatasci_flat.jsonl

100%|█████████████████████████████████████████| 473/473 [00:06<00:00, 72.10it/s]
    473  205817 2887298 output/ecodatasci_flat.jsonl


A straight harvest using search or stream doesn't need to be flattened 
to do our most basic analysis: wc. Do gas prices here?

## To flatten or not flatten

### Make your jsonl 1 tweet per line
Flattening will let you do our most basic unix-y analysis, turn
timelines into countable lists, and enable you to run twarc1
utilities later on in the workshop

In [13]:
# timeline objects need to be flattened in order to be analyzed as tweets
!twarc2 flatten raw/bjules.jsonl output/bjules_flat.jsonl

100%|██████████████| Processed 8.96M/8.96M of input file [00:01<00:00, 6.31MB/s]


## Convert to csv

In [14]:
!twarc2 csv raw/bjules.jsonl output/bjules.csv

100%|██████████████| Processed 8.96M/8.96M of input file [00:02<00:00, 3.63MB/s]

ℹ️
Parsed 3146 tweets objects from 33 lines in the input file.
Wrote 3146 rows and output 74 columns in the CSV.



## When we look at bjules, we really do need to flatten it.

In [15]:
! wc raw/bjules.jsonl

     33  846052 9399605 raw/bjules.jsonl


33 lines doesn't mean 33 tweets. I suspected there was more there because
I got an error message about hitting a limit of 3200. 

And below, the csv converter tells us there are 3143 tweets.

In [16]:
# convert
!twarc2 csv raw/bjules.jsonl output/bjules.csv

100%|██████████████| Processed 8.96M/8.96M of input file [00:05<00:00, 1.65MB/s]

ℹ️
Parsed 3146 tweets objects from 33 lines in the input file.
Wrote 3146 rows and output 74 columns in the CSV.



In [17]:
# once I flatten it, my wc will show the correct number
! wc output/bjules_flat.jsonl

    3146  1719030 23286097 output/bjules_flat.jsonl


In [18]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output/bjules_flat.jsonl
! wc output/bjules.csv

    3146  1719030 23286097 output/bjules_flat.jsonl
    3147   579107 11549036 output/bjules.csv


The csv is 1 line longer because it has column headers.
twarc2 csv takes flat or unflattened Twitter data files.

### Challenge 2

In [19]:
# commented line is a solution to challenge 1
!twarc2 timeline ecodatasci > raw/ecodatasci.jsonl

!twarc2 flatten raw/ecodatasci.jsonl > output/ecodatasci_flat.jsonl
!twarc2 csv output/ecodatasci_flat.jsonl > output/ecodatasci.csv 
ecodatasci_df = pandas.read_csv("output/ecodatasci.csv")


100%|█████████████████████████████████████████| 473/473 [00:04<00:00, 97.04it/s]


# Episode 3: examining tweets
What comes along with a tweet
- Look at one_tweet in Jupyter viewer
- Look at one_tweet with nano
- Look at tweet as csv
- Look at all the entities of a tweet

In [20]:
! wc raw/hashtag_gasprices.jsonl

! twarc2 flatten raw/hashtag_gasprices.jsonl > output/hashtag_gasprices_flat.jsonl

! wc output/hashtag_gasprices_flat.jsonl

     108  3346644 36403969 raw/hashtag_gasprices.jsonl
   10787  5007559 67146087 output/hashtag_gasprices_flat.jsonl


In [21]:
### Let's look at a single tweet as a csv:
!twarc2 flatten raw/one_tweet.jsonl output/one_tweet_flat.jsonl
!twarc2 csv output/one_tweet_flat.jsonl output/one_tweet.csv




100%|██████████████| Processed 4.63k/4.63k of input file [00:00<00:00, 14.0MB/s]
100%|██████████████| Processed 7.09k/7.09k of input file [00:00<00:00, 92.0kB/s]

ℹ️
Parsed 1 tweets objects from 1 lines in the input file.
Wrote 1 rows and output 74 columns in the CSV.



## first and last tweets

In [22]:
! cat output/4_tweets.jsonl

cat: output/4_tweets.jsonl: No such file or directory


In [23]:
!head -n 2 'output/bjules_flat.jsonl' > 'output/4_more_tweets.jsonl'
!tail -n 2 'output/bjules_flat.jsonl' >> 'output/4_more_tweets.jsonl'

Can we go back further on his timeline by looking
only for Bergis's original content?

Not really--it looks like the limit applies to the search,
not the results. 


But this does tell us that Jules is a prolific re-tweeter and/or replier. 

In [24]:
! wc output/bjules_original_flat.jsonl
! wc output/bjules_flat.jsonl

wc: output/bjules_original_flat.jsonl: No such file or directory
    3146  1719030 23286097 output/bjules_flat.jsonl


In [25]:
# save it as a csv so we can easily see the original writings of Jules
!twarc2 csv output/bjules_original_flat.jsonl output/bjules_original.csv

Usage: twarc2 csv [OPTIONS] [INFILE] [OUTFILE]
Try 'twarc2 csv --help' for help.

Error: Invalid value for '[INFILE]': 'output/bjules_original_flat.jsonl': No such file or directory


In [26]:
!head -n 20 'output/hashtag_gasprices_flat.jsonl' > 'output/20_tweets.jsonl'
!tail -n 20 'output/hashtag_gasprices_flat.jsonl' >> 'output/20_tweets.jsonl'

# Episode 4

In [27]:
# fishing around for good searches
# you can count without harvesting.
# kittens is an evergreen search. you should always see at lease
# dozens of mentions per hour
!twarc2 counts --text "ucsb"

2022-05-18T22:28:15.000Z - 2022-05-18T23:00:00.000Z: 14
2022-05-18T23:00:00.000Z - 2022-05-19T00:00:00.000Z: 14
2022-05-19T00:00:00.000Z - 2022-05-19T01:00:00.000Z: 13
2022-05-19T01:00:00.000Z - 2022-05-19T02:00:00.000Z: 7
2022-05-19T02:00:00.000Z - 2022-05-19T03:00:00.000Z: 8
2022-05-19T03:00:00.000Z - 2022-05-19T04:00:00.000Z: 11
2022-05-19T04:00:00.000Z - 2022-05-19T05:00:00.000Z: 12
2022-05-19T05:00:00.000Z - 2022-05-19T06:00:00.000Z: 9
2022-05-19T06:00:00.000Z - 2022-05-19T07:00:00.000Z: 6
2022-05-19T07:00:00.000Z - 2022-05-19T08:00:00.000Z: 2
2022-05-19T08:00:00.000Z - 2022-05-19T09:00:00.000Z: 5
2022-05-19T09:00:00.000Z - 2022-05-19T10:00:00.000Z: 4
2022-05-19T10:00:00.000Z - 2022-05-19T11:00:00.000Z: 5
2022-05-19T11:00:00.000Z - 2022-05-19T12:00:00.000Z: 5
2022-05-19T12:00:00.000Z - 2022-05-19T13:00:00.000Z: 2
2022-05-19T13:00:00.000Z - 2022-05-19T14:00:00.000Z: 11
2022-05-19T14:00:00.000Z - 2022-05-19T15:00:00.000Z: 6
2022-05-19T15:00:00.000Z - 2022-05-19T16:00:00.000Z: 13
202

In [28]:
# granularity makes the ouput shorter
# the twitter api is not case sensitive

!twarc2 counts --granularity "day" --text "(UCSB)"
!twarc2 counts --granularity "day" --text "(ucsb)"
!twarc2 counts --granularity "day" --text "(ucsb OR UCSB)"

2022-05-18T22:28:16.000Z - 2022-05-19T00:00:00.000Z: 28
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 342
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 384
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 352
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 306
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 307
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:16.000Z: 240
[32m
Total Tweets: 2,191
[0m
2022-05-18T22:28:18.000Z - 2022-05-19T00:00:00.000Z: 28
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 342
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 384
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 352
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 306
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 307
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:18.000Z: 240
[32m
Total Tweets: 2,191
[0m
2022-05-18T22:28:20.000Z - 2

In [29]:
## hashtags come along with strings. NOT vice versa
!twarc2 counts --text "(Poker OR poker)" --granularity "day"
!twarc2 counts --text "(Poker OR #poker)" --granularity "day"



2022-05-18T22:28:22.000Z - 2022-05-19T00:00:00.000Z: 1,164
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 18,920
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 16,036
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 17,466
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 12,625
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 12,384
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 18,578
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:22.000Z: 14,272
[32m
Total Tweets: 111,445
[0m
2022-05-18T22:28:23.000Z - 2022-05-19T00:00:00.000Z: 1,164
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 18,918
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 16,037
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 17,465
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 12,625
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 12,381
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 18,579
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:23.000Z: 14,271
[32m
Tot

In [30]:
!twarc2 counts --text "(Golf)" --granularity "day"
!twarc2 counts --text "(Basketball)" --granularity "day"
!twarc2 counts --text "(Baseball)" --granularity "day"
!twarc2 counts --text "(Football OR futbol)" --granularity "day"

2022-05-18T22:28:25.000Z - 2022-05-19T00:00:00.000Z: 3,288
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 55,893
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 48,813
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 43,010
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 46,567
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 45,246
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 82,755
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:25.000Z: 51,695
[32m
Total Tweets: 377,267
[0m
2022-05-18T22:28:27.000Z - 2022-05-19T00:00:00.000Z: 4,295
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 54,671
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 63,133
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 55,612
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 76,655
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 64,778
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 57,482
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:27.000Z: 60,180
[32m
Tot

In [31]:
## What's a lot?
!twarc2 counts --text "dog" --granularity "day"
!twarc2 counts --text "cat" --granularity "day"
!twarc2 counts --text "amazon" --granularity "day"
!twarc2 counts --text "right" --granularity "day"
!twarc2 counts --text "good" --granularity "day"


2022-05-18T22:28:32.000Z - 2022-05-19T00:00:00.000Z: 15,780
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 226,051
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 216,973
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 220,105
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 226,958
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 210,692
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 219,148
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:32.000Z: 189,959
[32m
Total Tweets: 1,525,666
[0m
2022-05-18T22:28:34.000Z - 2022-05-19T00:00:00.000Z: 20,640
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 355,159
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 314,793
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 287,581
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 299,186
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 305,661
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 330,011
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:34.000Z:

In [32]:
# search for hashtags when you really want hashtags. 
# search for a string returns both text and hashtage (an OR)
# NOT case sensitive
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb)"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

2022-05-18T22:28:43.000Z - 2022-05-19T00:00:00.000Z: 28
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 342
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 384
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 352
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 306
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 307
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:43.000Z: 240
[32m
Total Tweets: 2,191
[0m
2022-05-18T22:28:45.000Z - 2022-05-19T00:00:00.000Z: 1
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 11
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 14
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 6
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 3
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 14
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 16
2022-05-25T00:00:00.000Z - 2022-05-25T22:28:45.000Z: 5
[32m
Total Tweets: 70
[0m
2022-05-18T22:28:46.000Z - 2022-05-19T00:0

## Pipeline

In [33]:
## a SFW timeline
!twarc2 timeline --limit 500 ucsblibrary raw/ucsblib_timeline.jsonl

!twarc2 flatten raw/ucsblib_timeline.jsonl output/ucsblib_timeline_flat.jsonl
!twarc2 csv output/ucsblib_timeline_flat.jsonl output/ucsblib_timeline_flat.csv
library_timeline_df = pandas.read_csv("output/ucsblib_timeline_flat.csv")

Set --limit of 500 reached:  15%|█▊          | 500/3280 [00:05<00:28, 96.25it/s]
100%|██████████████| Processed 1.19M/1.19M of input file [00:00<00:00, 6.58MB/s]
100%|██████████████| Processed 2.57M/2.57M of input file [00:00<00:00, 5.45MB/s]

ℹ️
Parsed 500 tweets objects from 500 lines in the input file.
Wrote 500 rows and output 74 columns in the CSV.



In [34]:
ucsblib_timeline_df = pandas.read_csv("output/ucsblib_timeline_flat.csv")

In [35]:
# confirm the dataframe's existance
len(ucsblib_timeline_df)

500

In [36]:
# and view all column headers
list(ucsblib_timeline_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

In [37]:
ucsblib_timeline_df.head()

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1529550949823356928,1529550949823356928,,,,101367986,,,,2022-05-25T19:52:18.000Z,...,,,,,,,2022-05-25T22:28:59+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
1,1528843296411471876,1528843296411471876,,1.52765e+18,,101367986,,483231999.0,,2022-05-23T21:00:21.000Z,...,,,,,,,2022-05-25T22:28:59+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
2,1527348937379631108,1527348937379631108,,,,101367986,,,,2022-05-19T18:02:18.000Z,...,,,,,,,2022-05-25T22:28:59+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
3,1527319079727468544,1527319079727468544,,,,101367986,,,,2022-05-19T16:03:39.000Z,...,,,,,,,2022-05-25T22:28:59+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
4,1526956500354314241,1526956500354314241,,,,101367986,,,,2022-05-18T16:02:53.000Z,...,,,,,,,2022-05-25T22:28:59+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,


In [38]:
ucsblib_timeline_df.tail()

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
495,1350247410786656257,1350247410786656257,,,1.350207e+18,101367986,,,4085696000.0,2021-01-16T01:04:03.000Z,...,,,,,,,2022-05-25T22:29:03+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
496,1350139793909301248,1350139793909301248,,1.350138e+18,,101367986,,2542162000.0,,2021-01-15T17:56:25.000Z,...,,,,,,,2022-05-25T22:29:03+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
497,1350104526250942468,1350104526250942468,,,,101367986,,,,2021-01-15T15:36:17.000Z,...,,,,,,,2022-05-25T22:29:03+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
498,1349772447541653505,1349217632264744961,1.349763e+18,,,101367986,1.285968e+18,,,2021-01-14T17:36:43.000Z,...,,,,,,,2022-05-25T22:29:03+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,
499,1349771964525674500,1349502619018907650,1.349503e+18,,,101367986,130271800.0,,,2021-01-14T17:34:48.000Z,...,,,,,,,2022-05-25T22:29:03+00:00,https://api.twitter.com/2/users/101367986/twee...,2.10.4,


In [39]:
#grab only specified column
ucsblib_timeline_df['public_metrics.retweet_count']

0      0
1      2
2      0
3      0
4      0
      ..
495    0
496    1
497    2
498    0
499    0
Name: public_metrics.retweet_count, Length: 500, dtype: int64

In [40]:
sort_by_rt = ucsblib_timeline_df.sort_values('public_metrics.retweet_count', ascending=False)

#the first tweet from the sorted dataframe
most_rt = sort_by_rt.head(1)

#output the text of the most retweeted tweet
most_rt['text']

167    For the first time in history, sound recording...
Name: text, dtype: object

In [41]:
sort_by_rt['public_metrics.retweet_count'].head(20)

167    163
287    114
440     84
284     65
290     54
174     29
200     26
451     24
314     23
437     22
450     21
442     21
201     18
417     18
408     17
134     16
316     15
14      15
383     14
454     13
Name: public_metrics.retweet_count, dtype: int64

## final challenge: Cats of Instagram
Let’s make a bigger datafile. Harvest 5000 tweets that use the hashtag “catsofinstagram” and put the dataset through the pipeline to answer the following questions:

- Did you get exactly 5000?
- How far back in time did you get?
- What is the most re-tweeted recent tweet on #catsofinstagram?
- Which person has the most number of followers in your dataset?
- Is it really a person?

In [42]:
!twarc2 search --limit 500 "#catsofinstagram" raw/hashtagcats.jsonl
!twarc2 flatten raw/hashtagcats.jsonl output/hashtagcats_flat.jsonl
!twarc2 csv raw/hashtagcats.jsonl > output/hashtagcats.csv
hashtagcats_df = pandas.read_csv("output/hashtagcats.csv")
! wc output/hashtagcats.csv
hashtagcats_df["created_at"].head()

Set --limit of 500 reached:   4%| | Processed 6 hours/6 days [00:05<02:18, 500 t
100%|██████████████| Processed 1.54M/1.54M of input file [00:00<00:00, 9.71MB/s]
    501  106492 1895553 output/hashtagcats.csv


0    2022-05-25T22:28:51.000Z
1    2022-05-25T22:28:31.000Z
2    2022-05-25T22:28:15.000Z
3    2022-05-25T22:28:14.000Z
4    2022-05-25T22:27:50.000Z
Name: created_at, dtype: object

In [43]:
hashtagcats_df["created_at"].tail()

495    2022-05-25T16:10:21.000Z
496    2022-05-25T16:09:51.000Z
497    2022-05-25T16:09:45.000Z
498    2022-05-25T16:09:23.000Z
499    2022-05-25T16:09:20.000Z
Name: created_at, dtype: object

In [44]:
hashtagcats_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1529590345360711685,1529590345360711685,,1.529357e+18,,1230882149022171136,,9.144608e+17,,2022-05-25T22:28:51.000Z,...,,,,,,,2022-05-25T22:29:11+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
1,1529590262221221893,1529590262221221893,,1.529559e+18,,1498447340646113286,,1.369667e+18,,2022-05-25T22:28:31.000Z,...,,,,,,,2022-05-25T22:29:11+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
2,1529590194688647169,1529590194688647169,,1.529576e+18,,1236404015161843712,,3.541330e+08,,2022-05-25T22:28:15.000Z,...,,,,,,,2022-05-25T22:29:11+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
3,1529590190909497346,1529590190909497346,,1.529534e+18,,1104712327104847872,,3.541330e+08,,2022-05-25T22:28:14.000Z,...,,,,,,,2022-05-25T22:29:11+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
4,1529590088455401472,1529590088455401472,,1.529534e+18,,1073307423379767296,,3.541330e+08,,2022-05-25T22:27:50.000Z,...,,,,,,,2022-05-25T22:29:11+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1529495094801612800,1529495094801612800,,1.529489e+18,,1368189367471255555,,3.541330e+08,,2022-05-25T16:10:21.000Z,...,,,,,,,2022-05-25T22:29:15+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
496,1529494968921989120,1529494968921989120,,1.529402e+18,,1035329018428567552,,7.724110e+17,,2022-05-25T16:09:51.000Z,...,,,,,,,2022-05-25T22:29:15+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
497,1529494943919751168,1529494943919751168,,1.529489e+18,,1497575266985644034,,3.541330e+08,,2022-05-25T16:09:45.000Z,...,,,,,,,2022-05-25T22:29:15+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
498,1529494850671767553,1529494850671767553,,1.529489e+18,,599571728,,3.541330e+08,,2022-05-25T16:09:23.000Z,...,,,,,,,2022-05-25T22:29:15+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,


In [45]:
list(hashtagcats_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

In [46]:
# what dataframes do we have at this point?
%whos DataFrame

Variable              Type         Data/Info
--------------------------------------------
ecodatasci_df         DataFrame                          id <...>\n[473 rows x 74 columns]
hashtagcats_df        DataFrame                          id <...>\n[500 rows x 74 columns]
library_timeline_df   DataFrame                          id <...>\n[500 rows x 74 columns]
most_rt               DataFrame                          id <...>\n\n[1 rows x 74 columns]
sort_by_rt            DataFrame                          id <...>\n[500 rows x 74 columns]
ucsblib_timeline_df   DataFrame                          id <...>\n[500 rows x 74 columns]


# Episode 5: Ethics & Twitter

In [47]:
# what dataframes do we have?
# this version is more succinct, but not nicely formated
%who DataFrame


ecodatasci_df	 hashtagcats_df	 library_timeline_df	 most_rt	 sort_by_rt	 ucsblib_timeline_df	 


In [48]:
# our first full-text analysis
# a list of words with TextBlob

# first we need to munge the data. remember from:
# list(library_df.columns)
# the tweet is library_df['text']

# TextBlob has its own data format.

# break tweets test column into a list, 
# then .join into one long string 
library_string = ' '.join(library_timeline_df['text'].tolist())
# turn the string into a blob
library_blob = textblob.TextBlob(library_string)


In [49]:
# This produces a mess if we output it. 
# (library_blob)

In [50]:
# Let's count the words and sort by their frequency of use:
library_freq = library_blob.word_counts
library_sorted_freq = sorted(library_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_sorted_freq[1:25]

[('https', 603),
 ('to', 389),
 ('of', 349),
 ('and', 338),
 ('for', 249),
 ('a', 249),
 ('in', 190),
 ('ucsb', 183),
 ('our', 166),
 ('library', 162),
 ('s', 157),
 ('at', 132),
 ('you', 126),
 ('on', 125),
 ('from', 115),
 ('more', 111),
 ('this', 109),
 ('we', 105),
 ('is', 101),
 ('with', 101),
 ('by', 101),
 ('here', 95),
 ('amp', 94),
 ('are', 78)]

We can get the english stopwords out.

In [51]:
# load the stopwords to use
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')

In [52]:
# create a new text list that does
# NOT contain stopwords
library_str_stopped = [word for word in library_string.split() 
                       if word.lower() not in sw_nltk]
library_words_stopped = " ".join(library_str_stopped)

The output with the stopwords removed is a little better, but there's still some cruft that could be removed:

In [53]:
library_blob_stopped = textblob.TextBlob(library_words_stopped)
library_blob_stopped_freq = library_blob_stopped.word_counts
library_blob_stopped_sorted_freq = sorted(library_blob_stopped_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_blob_stopped_sorted_freq[1:50]

[('ucsb', 183),
 ('library', 162),
 ('s', 146),
 ('amp', 94),
 ('here', 86),
 ('’', 65),
 ('ucsblibrary', 63),
 ('research', 60),
 ('us', 57),
 ('uc', 56),
 ('today', 55),
 ('join', 55),
 ('open', 50),
 ('book', 50),
 ('new', 48),
 ('reads', 46),
 ('learn', 44),
 ('more', 43),
 ('week', 43),
 ('we', 40),
 ('access', 39),
 ('students', 38),
 ('nhttps', 34),
 ('tomorrow', 34),
 ('study', 34),
 ('community', 33),
 ('2022', 33),
 ('event', 33),
 ('campus', 33),
 ('info', 32),
 ('santa', 32),
 ('exhibit', 32),
 ('recordings', 31),
 ('barbara', 31),
 ('re', 31),
 ('art', 30),
 ('check', 29),
 ('collections', 27),
 ('resources', 27),
 ('day', 27),
 ('talk', 26),
 ('ted', 26),
 ('exhalation', 26),
 ('collection', 26),
 ('black', 26),
 ('chiang', 25),
 ('discussion', 25),
 ('register', 25),
 ('online', 25)]

In [54]:
# a more meaningful segment
library_blob_stopped_sorted_freq[7:57]

[('ucsblibrary', 63),
 ('research', 60),
 ('us', 57),
 ('uc', 56),
 ('today', 55),
 ('join', 55),
 ('open', 50),
 ('book', 50),
 ('new', 48),
 ('reads', 46),
 ('learn', 44),
 ('more', 43),
 ('week', 43),
 ('we', 40),
 ('access', 39),
 ('students', 38),
 ('nhttps', 34),
 ('tomorrow', 34),
 ('study', 34),
 ('community', 33),
 ('2022', 33),
 ('event', 33),
 ('campus', 33),
 ('info', 32),
 ('santa', 32),
 ('exhibit', 32),
 ('recordings', 31),
 ('barbara', 31),
 ('re', 31),
 ('art', 30),
 ('check', 29),
 ('collections', 27),
 ('resources', 27),
 ('day', 27),
 ('talk', 26),
 ('ted', 26),
 ('exhalation', 26),
 ('collection', 26),
 ('black', 26),
 ('chiang', 25),
 ('discussion', 25),
 ('register', 25),
 ('online', 25),
 ('one', 25),
 ('history', 24),
 ('pick', 24),
 ('work', 23),
 ('free', 23),
 ('librarian', 22),
 ('year', 22)]

In [55]:
# Challenge: for the Python wizzes. #FIXME
# do that in a tidy way?
# what do pandas pipes look like?

# Just the words, hold the gore

## Challenge: Insta-rrectionists

In [63]:
# how long is this?
riots_dehydrated_df = pandas.read_csv("raw/dehydratedCapitolRiotTweets.txt")
len(riots_dehydrated_df)

82308

### Warning
make sure you remove the first line of the csv before you attempt to hydrate.

In [62]:
# this takes a very long time.
!twarc2 hydrate raw/dehydratedCapitolRiotTweets.txt raw/riots.jsonl

100%|███████| Processed 82309/82309 lines of input file [19:03<00:00, 71.97it/s]


In [64]:
# regardless of how you slice this, it's about 
# 80 % of the content that is still there.

# these are very slow, so they are commented out
! twarc2 flatten raw/riots.jsonl output/riots_flat.jsonl
! twarc2 csv output/riots_flat.jsonl output/riots_flat.csv

! wc output/riots_flat.jsonl
! wc output/riots_flat.csv


100%|████████████████| Processed 236M/236M of input file [00:17<00:00, 13.8MB/s]
100%|████████████████| Processed 525M/525M of input file [01:13<00:00, 7.54MB/s]

ℹ️
Parsed 65211 tweets objects from 65211 lines in the input file.
233 were duplicates. Wrote 64978 rows and output 74 columns in the CSV.

    65211  31133313 550352897 output/riots_flat.jsonl
    64979  11965879 358548306 output/riots_flat.csv


In [65]:
# let's deal with just 10,000 of these chuckleheads:
! head -n 10000 output/riots_flat.jsonl > output/riots10k.jsonl
! twarc2 csv output/riots10k.jsonl > output/riots10k.csv

In [66]:
 riots_df = pandas.read_csv("output/riots10k.csv", low_memory=False)

In [67]:
riots_df.shape

(9997, 74)

In [68]:
riots_df.columns

Index(['id', 'conversation_id', 'referenced_tweets.replied_to.id',
       'referenced_tweets.retweeted.id', 'referenced_tweets.quoted.id',
       'author_id', 'in_reply_to_user_id', 'retweeted_user_id',
       'quoted_user_id', 'created_at', 'text', 'lang', 'source',
       'public_metrics.like_count', 'public_metrics.quote_count',
       'public_metrics.reply_count', 'public_metrics.retweet_count',
       'reply_settings', 'possibly_sensitive', 'withheld.scope',
       'withheld.copyright', 'withheld.country_codes', 'entities.annotations',
       'entities.cashtags', 'entities.hashtags', 'entities.mentions',
       'entities.urls', 'context_annotations', 'attachments.media',
       'attachments.media_keys', 'attachments.poll.duration_minutes',
       'attachments.poll.end_datetime', 'attachments.poll.id',
       'attachments.poll.options', 'attachments.poll.voting_status',
       'attachments.poll_ids', 'author.id', 'author.created_at',
       'author.username', 'author.name', 'author

In [69]:
# count the users
unique_users_df = riots_df.author_id.unique()
(unique_users_df.shape)

(9157,)

In [70]:
# I'm forgetting what this really does for us. The 9160 unique
# authors quoted 230 different other people?
users_quoted_df = riots_df.quoted_user_id.unique()
(users_quoted_df.shape)

(231,)

# Episode 6: Search and Filter

In [73]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
!twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" raw/kittens.jsonl
!twarc2 csv raw/kittens.jsonl output/kittens.csv

Set --limit of 800 reached:  92%|▉| Processed 6 days/6 days [00:08<00:00, 896 tw
100%|██████████████| Processed 2.57M/2.57M of input file [00:01<00:00, 1.60MB/s]

ℹ️
Parsed 896 tweets objects from 9 lines in the input file.
Wrote 896 rows and output 74 columns in the CSV.



In [74]:
kittens_df = pandas.read_csv("output/kittens.csv")

In [75]:
kittens_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1529602531445288960,1529602531445288960,,1.529357e+18,,1419813068658282497,,9.144608e+17,,2022-05-25T23:17:16.000Z,...,,,,,,,2022-05-25T23:26:22+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
1,1529602031719174146,1529602031719174146,,1.529495e+18,,969133628738465793,,1.513539e+18,,2022-05-25T23:15:17.000Z,...,,,,,,,2022-05-25T23:26:22+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
2,1529599217974628358,1529599217974628358,,1.529357e+18,,1244261209655914498,,9.144608e+17,,2022-05-25T23:04:06.000Z,...,,,,,,,2022-05-25T23:26:22+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
3,1529599029256019969,1529599029256019969,,1.529357e+18,,1520030542972006401,,9.144608e+17,,2022-05-25T23:03:21.000Z,...,,,,,,,2022-05-25T23:26:22+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
4,1529594974123220992,1529594974123220992,,1.529357e+18,,1355336563182379012,,9.144608e+17,,2022-05-25T22:47:14.000Z,...,,,,,,,2022-05-25T23:26:22+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,1527266234462527488,1527266234462527488,,1.526889e+18,,3103520718,,3.958303e+09,,2022-05-19T12:33:40.000Z,...,,,,,,,2022-05-25T23:26:30+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
892,1527259783149019136,1527259783149019136,,1.526889e+18,,2260128217,,3.958303e+09,,2022-05-19T12:08:02.000Z,...,,,,,,,2022-05-25T23:26:30+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
893,1527258740105400321,1527258740105400321,,1.526889e+18,,1087492717872119813,,3.958303e+09,,2022-05-19T12:03:53.000Z,...,,,,,,,2022-05-25T23:26:30+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
894,1527257778154135552,1527257778154135552,,,,1421621215135866881,,,,2022-05-19T12:00:04.000Z,...,,,,,,,2022-05-25T23:26:30+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,


In [76]:
list(kittens_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

# Search

Twitter search using Boolean logic 

### Using And 
In twarc, a space between operators will act as `AND`

In [77]:

!twarc2 search --limit 200 "grumpy cat" > raw/grumpy_throwaway.jsonl
# this will return tweets matching both conditions 
# running without a limit will crash the kernel the cats are so strong 

### Using Or

In twarc, using OR will return results where either condition is met 

In [78]:
!twarc2 search  --limit 200 "grumpy OR cat" > raw/grumpy_throwaway.jsonl

#this will return tweets where either search condition is met, grumpy or cat 
#running without a limit will crash the kernel grumpy people and cats are so strong 

### Not Logic 

What about negating certain terms within a search? Use a dash (-) followed by the keyword you want to avoid

In [79]:
# lets ignore Doja Cat from our internet cat search

!twarc2 search --limit 200 "grumpy OR cat -Dojacat" > raw/grumpy_throwaway.jsonl

### Searching by User 

Looking for mentions, tos/froms 

In [80]:
#to: will match any tweeet that is a reply to a particular user 
#You can only pass a single username per to: command 
!twarc2 search "(to:realgrumpycat)" > raw/grumpy_throwaway.jsonl

Likewise, from: will match any tweet from a specific user 

In [81]:
 !twarc2 search "(from:realgrumpycat)" > raw/grumpy_throwaway.jsonl

### Searching for Hashtags

In [82]:
# the below command would be a good way to blow your quota
# !twarc2 search "#meme" > raw/grumpy_throwaway.jsonl

### Order of Operations 

Of course, you can combine search queries to narrow down results but remember these rules. 
You can also use () to group terms to eliminate any uncertainty

* cats OR cole marmalade > cats OR (cole marmalade) 
* cats cole OR marmalade > (cats cole) OR marmalade

* (cats or cole) marmalade > some cats, some guys named Cole, and orange preservatives 
* cat (cole marmalade) > group them together, and give us the actual cat duo 

### Advanced Cat Challenge 

In [83]:
# !twarc2 search "(grumpy cat #meme)"
 
# !twarc2 search "(grumpy cat) OR (#meme has:images)"

# !twarc2 search "(cats OR puppies) has:media"

# !twarc2 search "(to:_We_Rate_Cats) lang:en"

# Stream

In [84]:
!twarc2 stream-rules add "#WorldGothDay"

[32m🚀  Added rule for [0m"#WorldGothDay"


In [85]:
!twarc2 stream-rules add "gothcats"

[32m🚀  Added rule for [0m"gothcats"


In [86]:
# press the square to interrupt this!

In [87]:
# !twarc2 stream > "raw/streamed_goth.jsonl"

In [88]:
! wc raw/streamed_goth.jsonl
! twarc2 flatten raw/streamed_goth.jsonl > output/streamed_goth_flat.jsonl
! wc output/streamed_goth_flat.jsonl 

wc: raw/streamed_goth.jsonl: No such file or directory
Usage: twarc2 flatten [OPTIONS] [INFILE] [OUTFILE]
Try 'twarc2 flatten --help' for help.

Error: Invalid value for '[INFILE]': 'raw/streamed_goth.jsonl': No such file or directory
0 0 0 output/streamed_goth_flat.jsonl


In [89]:
!twarc2 stream-rules delete "gothcats"
!twarc2 stream-rules delete "#WorldGothDay"

🗑  Deleted stream rule for gothcats
🗑  Deleted stream rule for #WorldGothDay


# Episode 7: twarc plug-ins

### install the plug-ins
you'll need to do this each time your kernel restarts

In [90]:
!pip install twarc-hashtags



In [91]:
!pip install twarc-network



In [92]:
# retweeted-by is a built-in command. no plug-in necessary. It takes a tweet ID:

In [93]:
!twarc2 retweeted-by 1522543998996414464 > 'raw/tinycarebot_rtby.jsonl'
!twarc2 flatten raw/tinycarebot_rtby.jsonl > output/tinycarebot_rtby_flat.jsonl

Speaking of retweeting, it's very good to figure out how much of your dataset 
is tweets, and how much of it is retweets and quotes.

# Retweets vs. tweets
How much original content is there?
Do this for both library timeline and catsofinstagrams

In [94]:
# via counting
retweet_count = hashtagcats_df["referenced_tweets.retweeted.id"].value_counts()
sum(retweet_count)


397

In [95]:
(sum(retweet_count) / len(hashtagcats_df))

0.794

69% of the tweets that used #catsofinstagram were retweets.

In [96]:
# so our pipeline on a stream would look like:


### Followers

In [97]:
# this is slow and uses quota
# that's 1000 followers.
!twarc2 followers --limit 1 tinycarebot >  'raw/tcb_followers.jsonl'

In [98]:
!twarc2 flatten raw/tcb_followers.jsonl > output/tcb_followers_flat.jsonl

In [99]:
! wc output/tcb_followers_flat.jsonl

   1000   53682 1208389 output/tcb_followers_flat.jsonl


In [100]:
# tiny challenge: do robots follow robots?
# look at the help!
! twarc2 csv --input-data-type users output/tcb_followers_flat.jsonl > output/tcb_followers.csv
# csv doesn't work on profiles.

### Most used hashtags

In [101]:
!twarc2 hashtags raw/hashtagcats.jsonl output/hashtagcats_hashtags.csv

100%|██████████████| Processed 1.54M/1.54M of input file [00:00<00:00, 33.3MB/s]


In [102]:
!twarc2 network raw/hashtagcats.jsonl output/hashtagcats_network.html

In [103]:
!twarc2 network raw/ecodatasci.jsonl output/ecodatasci_network.html

In [104]:
# this one is too big
# !twarc2 network raw/hashtag_gasprices.jsonl output/hashtag_gasprices_network.html

In [105]:
# how do I print file to cell?
# print(read(output/hashtagcats_hashtags.csv))

In [106]:
# this reminds you what DataFrames you have in memory
%who DataFrame

ecodatasci_df	 hashtagcats_df	 kittens_df	 library_timeline_df	 most_rt	 riots_dehydrated_df	 riots_df	 sort_by_rt	 ucsblib_timeline_df	 



In [107]:
# when do we do mentions?
# maybe we just mention them.
# ! twarc2 mentions ucsblibrary raw/ucsblibrary_mentions.jsonl

# ! twarc2 csv raw/ucsblibrary_mentions.jsonl output/ucsblibrary_mentions.csv 
# ucsb_library_mentions_df = pandas.read_csv('output/ucsblibrary_mentions.csv')

# Episode 8: Python text analysis

### Sentiment Analysis
To do this, we need to do a little Python

TextBlob is a text processing library that does sentiment analysis. 
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Before we use TextBlob for sentiment analysis, we need to download
datasets of words and their associated weights. These are called *corpora*.

In [108]:
# commented out because I put it up in ep 2
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/jovyan/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [109]:
# TextBlob needs a string, so this won't work.
# textblob.TextBlob(hashtagcats_df).sentiment

In [110]:
# even calling the column won't work:
# textblob.TextBlob(hashtagcats_df['text']).sentiment

In [111]:
# break tweets test column into a list, then .join into one long string 
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
# turn the string into a blob
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
# get the sentiment
hashtagcats_blob.sentiment

Sentiment(polarity=0.3337667641696417, subjectivity=0.6110313566890312)

In [112]:
# what dataframes are still here?
%whos DataFrame


Variable              Type         Data/Info
--------------------------------------------
ecodatasci_df         DataFrame                          id <...>\n[473 rows x 74 columns]
hashtagcats_df        DataFrame                          id <...>\n[500 rows x 74 columns]
kittens_df            DataFrame                          id <...>\n[896 rows x 74 columns]
library_timeline_df   DataFrame                          id <...>\n[500 rows x 74 columns]
most_rt               DataFrame                          id <...>\n\n[1 rows x 74 columns]
riots_dehydrated_df   DataFrame           134686307243517952<...>n[82308 rows x 1 columns]
riots_df              DataFrame                           id<...>n[9997 rows x 74 columns]
sort_by_rt            DataFrame                          id <...>\n[500 rows x 74 columns]
ucsblib_timeline_df   DataFrame                          id <...>\n[500 rows x 74 columns]


The overall sentiment of the language of kitty twitter is rather positive.
And the tweets tend to be subjective.

In [113]:
# What do you think the sentiment of gasprices might be?
# get the overall sentiment and see if it matches your prediction.

In [114]:
! twarc2 csv output/hashtag_gasprices_flat.jsonl > output/hashtag_gasprices_flat.csv
hashtag_gasprices_df = pandas.read_csv("output/hashtag_gasprices_flat.csv", low_memory=False)

In [115]:
gasprices_list = ' '.join(hashtag_gasprices_df['text'].tolist())
gasprices_blob = textblob.TextBlob(gasprices_list)
print("Hashtag Gas Prices: ") 
gasprices_blob.sentiment

Hashtag Gas Prices: 


Sentiment(polarity=0.0783091422742766, subjectivity=0.44952337845641754)

In [116]:
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
print("Hashtag Cats of Instagram: ") 
hashtagcats_blob.sentiment

Hashtag Cats of Instagram: 


Sentiment(polarity=0.3337667641696417, subjectivity=0.6110313566890312)

# Episode 9: Data Management