Combining DataFrames with pandas
OverviewTeaching: 20 min
Exercises: 25 minQuestions
Can I work with data from multiple sources?
How can I combine data from different data sets?Objectives
Combine data from multiple files into a single DataFrame using merge and concat.
Combine two DataFrames using a unique ID found in both DataFrames.
to_csvto export a DataFrame in CSV format.
Join DataFrames using common fields (join keys).
In many “real world” situations, the data that we want to use come in multiple
files. We often need to combine these files into a single DataFrame to analyze
the data. The pandas package provides various methods for combining
To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. In iPython:
import pandas as pd authors_df = pd.read_csv("authors.csv", keep_default_na=False, na_values=[""]) authors_df TCP Author 0 A00002 Aylett, Robert, 1583-1655? 1 A00005 Higden, Ranulf, d. 1364. Polycronicon. English... 2 A00007 Higden, Ranulf, d. 1364. Polycronicon. 3 A00008 Wood, William, fl. 1623, attributed name. 4 A00011 places_df = pd.read_csv("places.csv", keep_default_na=False, na_values=[""]) places_df A00002 London 0 A00005 London 1 A00007 London 2 A00008 The Netherlands? 3 A00011 Amsterdam 4 A00012 London 5 A00014 London
Take note that the
read_csv method we used can take some additional options which
we didn’t use previously. Many functions in python have a set of options that
can be set by the user if needed. In this case, we have told Pandas to assign
empty values in our CSV to NaN
More about all of the read_csv options here.
We can use the
concat function in Pandas to append either columns or rows from
one DataFrame to another. Let’s grab two subsets of our data to see how this
# read in first 10 lines of surveys table place_sub = places_df.head(10) # grab the last 20 rows place_sub_last10 = places_df.tail(20) #reset the index values to the second dataframe appends properly place_sub_last10 = place_sub_last10.reset_index(drop=True) # drop=True option avoids adding new index column with old index values
When we concatenate DataFrames, we need to specify the axis.
Pandas to stack the second DataFrame under the first one. It will automatically
detect whether the column names are the same and will stack accordingly.
axis=1 will stack the columns in the second DataFrame to the RIGHT of the
first DataFrame. To stack the data vertically, we need to make sure we have the
same columns and associated column format in both datasets. When we stack
horizonally, we want to make sure what we are doing makes sense (ie the data are
related in some way).
# stack the DataFrames on top of each other vertical_stack = pd.concat([place_sub, place_sub_last10], axis=0) # place the DataFrames side by side horizontal_stack = pd.concat([place_sub, place_sub_last10], axis=1)
Row Index Values and Concat
Have a look at the
vertical_stack dataframe? Notice anything unusual?
The row indexes for the two data frames
have been repeated. We can reindex the new dataframe using the
Writing Out Data to CSV
We can use the
to_csv command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash to the file
vertical_stack.to_csv('foldername/out.csv'). We use the ‘index=False’ so that
pandas doesn’t include the index number for each line.
# Write DataFrame to CSV vertical_stack.to_csv('out.csv', index=False)
Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into python to make sure it imports properly.
# for kicks read our output back into python and make sure all looks good new_output = pd.read_csv('out.csv', keep_default_na=False, na_values=[""])
Challenge - Combine Data
In the data folder, there are two catalogue data files:
1641.csv. Read the data into python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into python properly.
When we concatenated our DataFrames we simply added them to each other - stacking them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.
NOTE: This process of joining tables is similar to what we do with tables in an SQL database.
places.csv file is table that contains the place and EEBO id for some titles.
When we want to access that information, we can create a query that joins the additional
columns of information to the author data.
Storing data in this way has many benefits including:
Identifying join keys
To identify appropriate join keys we first need to know which field(s) are shared between the files (DataFrames). We might inspect both DataFrames to identify these columns. If we are lucky, both DataFrames will have columns with the same name that also contain the same data. If we are less lucky, we need to identify a (differently-named) column in each DataFrame that contains the same information.
>>> authors_df.columns Index([u'TCP', u'EEBO', u'VID', u'STC', u'Status', u'Author', u'Date', u'Title', u'Terms', u'Pages'], dtype='object') >>> places_df.columns Index([u'EEBO', u'Place'], dtype='object')
In our example, the join key is the column containing the identifier, which is called
Now that we know the fields with the common TCP ID attributes in each DataFrame, we are almost ready to join our data. However, since there are different types of joins, we also need to decide which type of join makes sense for our analysis.
The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.
Inner joins yield a DataFrame that contains only rows where the value being joins exists in BOTH tables. An example of an inner join, adapted from this page is below:
The pandas function for performing joins is called
merge and an Inner join is
the default option:
merged_inner = pd.merge(left=authors_df,right=places_df, left_on='TCP', right_on='TCP') # in this case `species_id` is the only column name in both dataframes, so if we skippd `left_on` # and `right_on` arguments we would still get the same result # what's the size of the output data? merged_inner.shape merged_inner
TCP Author Place 0 A00002 Aylett, Robert, 1583-1655? London 1 A00005 Higden, Ranulf, d. 1364. Polycronicon. English... London 2 A00007 Higden, Ranulf, d. 1364. Polycronicon. London 3 A00008 Wood, William, fl. 1623, attributed name. The Netherlands? 4 A00011 NaN Amsterdam
The result of an inner join of
places_df is a new DataFrame
that contains the combined set of columns from those tables. It
only contains rows that have two-letter species codes that are the same in
place_df DataFrames. In other words, if a row in
authors_df has a value of
TCP that does not appear in the
TCP, it will not be included in the DataFrame returned by an
inner join. Similarly, if a row in
places_df has a value of
that does not appear in the
TCP column of
places_df, that row will not
be included in the DataFrame returned by an inner join.
The two DataFrames that we want to join are passed to the
merge function using
right argument. The
left_on='TCP' argument tells
to use the
TCP column as the join key from
DataFrame). Similarly , the
right_on='TCP' argument tells
TCP column as the join key from
DataFrame). For inner joins, the order of the
right arguments does
merged_inner DataFrame contains all of the columns from
(TCP, Person) as well as all the columns from
merged_inner has fewer rows than
place_sub. This is an
indication that there were rows in
place_df with value(s) for
do not exist as value(s) for
What if we want to add information from
losing any of the information from
survey_sub? In this case, we use a different
type of join called a “left outer join”, or a “left join”.
Like an inner join, a left join uses join keys to combine two DataFrames. Unlike
an inner join, a left join will return all of the rows from the
DataFrame, even those rows whose join key(s) do not have values in the
DataFrame. Rows in the
left DataFrame that are missing values for the join
key(s) in the
right DataFrame will simply have null (i.e., NaN or None) values
for those columns in the resulting joined DataFrame.
Note: a left join will still discard rows from the
right DataFrame that do not
have values for the join key(s) in the
A left join is performed in pandas by calling the same
merge function used for
inner join, but using the
merged_left = pd.merge(left=places_df,right=authors_df, how='left', left_on='TCP', right_on='TCP') merged_left **OUTPUT:** TCP Place Author 0 A00002 London Aylett, Robert, 1583-1655? 1 A00005 London Higden, Ranulf, d. 1364. Polycronicon. English... 2 A00007 London Higden, Ranulf, d. 1364. Polycronicon. 3 A00008 The Netherlands? Wood, William, fl. 1623, attributed name. 4 A00011 Amsterdam NaN
The result DataFrame from a left join (
merged_left) looks very much like the
result DataFrame from an inner join (
merged_inner) in terms of the columns it
contains. However, unlike
merged_left contains the same
number of rows as the original
place_sub DataFrame. When we inspect
merged_left, we find there are rows where the information that should have
Author) is missing (they contain NaN values):
merged_inner[ pd.isnull(merged_inner.Author) ] **OUTPUT:** TCP Author Place 4 A00011 NaN Amsterdam 6 A00014 NaN London 8 A00018 NaN Germany?
These rows are the ones where the value of
authors_df does not occur in
Other join types
merge function supports two other join types:
- Right (outer) join: Invoked by passing
how='right'as an argument. Similar to a left join, except all rows from the
rightDataFrame are kept, while rows from the
leftDataFrame without matching join key(s) values are discarded.
- Full (outer) join: Invoked by passing
how='outer'as an argument. This join type returns the all pairwise combinations of rows from both DataFrames; i.e., the result DataFrame will
NaNwhere data is missing in one of the dataframes. This join type is very rarely used.
Challenge - Distributions
Create a new DataFrame by joining the contents of the
places.csvtables. Then calculate and plot the distribution of:
- title by author by place