Data Ingest & Visualization - Matplotlib & Pandas
Last updated on 2024-02-13 | Edit this page
Estimated time: 45 minutes
Overview
Questions
- What other tools can I use to create plots apart from ggplot?
- Why should I use Python to create plots?
Objectives
- Import the pyplot toolbox to create figures in Python.
Putting it all together
Up to this point, we have walked through tasks that are often
involved in handling and processing data using the workshop ready
cleaned
files that we have provided. In this wrap-up exercise, we will perform
many of the same tasks with real data sets. This lesson also covers data
visualization.
As opposed to the previous ones, this lesson does not give step-by-step directions to each of the tasks. Use the lesson materials you’ve already gone through as well as the Python documentation to help you along.
1. Obtain data
There are many repositories online from which you can obtain data. We
are providing you with one data file to use with these exercises, but
feel free to use any data that is relevant to your research. The file
eebo.csv
contains the complete EEBO metadata of books
between 1300-1700. If you’d like to use this dataset, please find it in
the data folder.
2. Clean up your data and open it using Python and Pandas
To begin, import your data file into Python using Pandas. Did it fail? Your data file probably has a header that Pandas does not recognize as part of the data table. Remove this header, but do not simply delete it in a text editor! Use either a shell script or Python to do this - you wouldn’t want to do it by hand if you had many files to process.
If you are still having trouble importing the data as a table using Pandas, check the documentation. You can open the docstring in an ipython notebook using a question mark. For example:
Look through the function arguments to see if there is a default
value that is different from what your file requires (Hint: the problem
is most likely the delimiter or separator. Common delimiters are
','
for comma, ' '
for space, and
'\t'
for tab).
Create a DataFrame that includes only the values of the data that are useful to you. In the streamgage file, those values might be the date, title, terms, and TCP. Convert and joined strings into individual ones. You can also change the name of the columns in the DataFrame like this:
PYTHON
df['Author'] = df['Author'].str.split(';') #split a column in the data frame
df = pd.DataFrame({'1stcolumn':[100,200], '2ndcolumn':[10,20]}) # this just creates a DataFrame for the example!
print('With the old column names:\n') # the \n makes a new line, so it's easier to see
print(df)
df.columns = ['FirstColumn','SecondColumn'] # rename the columns!
print('\n\nWith the new column names:\n')
print(df)
With the old column names:
1stcolumn 2ndcolumn
0 100 10
1 200 20
With the new column names:
FirstColumn SecondColumn
0 100 10
1 200 20
3. Make a line plot of your data
Matplotlib is a Python library that can be used to visualize data.
The toolbox matplotlib.pyplot
is a collection of functions
that make matplotlib work like MATLAB. In most cases, this is all that
you will need to use, but there are many other useful tools in
matplotlib that you should explore.
We will cover a few basic commands for formatting plots in this lesson. A great resource for help styling your figures is the matplotlib gallery (http://matplotlib.org/gallery.html), which includes plots in many different styles and the source code that creates them. The simplest of plots is the 2 dimensional line plot. These examples walk through the basic commands for making line plots using pyplots.
Challenge - Lots of plots
Make a variety of line plots from your data. If you are using the
streamgage data, these could include (1) a hydrograph of the entire
month of September 2013, (2) the discharge record for the week of the
2013 Front Range flood (September 9 through 15), (3) discharge vs. time
of day, for every day in the record in one figure (Hint: use loops to
combine strings and give every line a different style and color), and
(4) minimum, maximum, and mean daily discharge values. Add axis labels,
titles, and legends to your figures. Make at least one figure with
multiple plots using the function subplot()
.
Using pyplot:
First, import the pyplot toolbox:
By default, matplotlib will create the figure in a separate window. When using ipython notebooks, we can make figures appear in-line within the notebook by writing:
We can start by plotting the values of a list of numbers (matplotlib can handle many types of numeric data, including numpy arrays and pandas DataFrames - we are just using a list as an example!):
The command plt.show()
prompts Python to display the
figure. Without it, it creates an object in memory but doesn’t produce a
visible plot. The ipython notebooks (if using
%matplotlib inline
) will automatically show you the figure
even if you don’t write plt.show()
, but get in the habit of
including this command!
If you provide the plot()
function with only one list of
numbers, it assumes that it is a sequence of y-values and plots them
against their index (the first value in the list is plotted at
x=0
, the second at x=1
, etc). If the function
plot()
receives two lists, it assumes the first one is the
x-values and the second the y-values. The line connecting the points
will follow the list in order:
A third, optional argument in plot()
is a string of
characters that indicates the line type and color for the plot. The
default value is a continuous blue line. For example, we can make the
line red ('r'
), with circles at every data point
('o'
), and a dot-dash pattern ('-.'
). Look
through the matplotlib gallery for more examples.
The command plt.axis()
sets the limits of the axes from
a list of [xmin, xmax, ymin, ymax]
values (the square
brackets are needed because the argument for the function
axis()
is one list of values, not four separate numbers!).
The functions xlabel()
and ylabel()
will label
the axes, and title()
will write a title above the
figure.
A single figure can include multiple lines, and they can be plotted
using the same plt.plot()
command by adding more pairs of x
values and y values (and optionally line styles):
PYTHON
import numpy as np
# create a numpy array between 0 and 10, with values evenly spaced every 0.5
t = np.arange(0., 10., 0.5)
# red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
plt.plot(t, t, 'r--', t, t**2, 'bs-', t, t**3, 'g^:')
plt.xlabel('This is the x axis')
plt.ylabel('This is the y axis')
plt.title('This is the figure title')
plt.show()
We can include a legend by adding the optional keyword argument
label=''
in plot()
. Caution: We cannot add
labels to multiple lines that are plotted simultaneously by the
plt.plot()
command like we did above because Python won’t
know to which line to assign the value of the argument label. Multiple
lines can also be plotted in the same figure by calling the
plot()
function several times:
PYTHON
# red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
plt.plot(t, t, 'r--', label='linear')
plt.plot(t, t**2, 'bs-', label='square')
plt.plot(t, t**3, 'g^:', label='cubic')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.xlabel('This is the x axis')
plt.ylabel('This is the y axis')
plt.title('This is the figure title')
plt.show()
The function legend()
adds a legend to the figure, and
the optional keyword arguments change its style. By default [typing just
plt.legend()
], the legend is on the upper right corner and
has no shadow.
Like MATLAB, pyplot is stateful; it keeps track of the current figure
and plotting area, and any plotting functions are directed to those
axes. To make more than one figure, we use the command
plt.figure()
with an increasing figure number inside the
parentheses:
PYTHON
# this is the first figure
plt.figure(1)
plt.plot(t, t, 'r--', label='linear')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.title('This is figure 1')
plt.show()
# this is a second figure
plt.figure(2)
plt.plot(t, t**2, 'bs-', label='square')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.title('This is figure 2')
plt.show()
A single figure can also include multiple plots in a grid pattern.
The subplot()
command especifies the number of rows, the
number of columns, and the number of the space in the grid that
particular plot is occupying:
PYTHON
plt.figure(1)
plt.subplot(2,2,1) # two row, two columns, position 1
plt.plot(t, t, 'r--', label='linear')
plt.subplot(2,2,2) # two row, two columns, position 2
plt.plot(t, t**2, 'bs-', label='square')
plt.subplot(2,2,3) # two row, two columns, position 3
plt.plot(t, t**3, 'g^:', label='cubic')
plt.show()
4. Make other types of plots:
Matplotlib can make many other types of plots in much the same way
that it makes 2 dimensional line plots. Look through the examples in http://matplotlib.org/users/screenshots.html
and try a few of them (click on the “Source code” link and copy and
paste into a new cell in ipython notebook or save as a text file with a
.py
extension and run in the command line).
Challenge - Final Plot
Display your data using one or more plot types from the example gallery. Which ones to choose will depend on the content of your own data file. If you are using the streamgage file, you could make a histogram of the number of days with a given mean discharge, use bar plots to display daily discharge statistics, or explore the different ways matplotlib can handle dates and times for figures.