Content from Data visualization with Pandas and Matplotlib


Last updated on 2026-02-09 | Edit this page

Overview

Questions

  • How do you start exploring and visualizing data using Python?
  • How can you make and customize plots?

Objectives

  • Explain the difference between installing a library and importing it.
  • Load data from a CSV file into a pandas DataFrame and inspect its contents and structure.
  • Generate plots, such as scatter plots and box plots, directly from a pandas DataFrame.
  • Construct a Matplotlib figure containing multiple subplots.
  • Customize plot aesthetics like titles, axis labels, colors, and layout by passing arguments to plotting functions.
  • Export a completed figure to a file.

Loading libraries


A great feature in Python is the ability to import libraries to extend its capabilities. For now, we’ll focus on two of the most widely used libraries for data analysis: pandas and Matplotlib. We’ll be using pandas for data wrangling and manipulation, and Matplotlib for (you guessed it) making plots.

To be able to use these libraries in our code, we have to install and import them. Installation is needed as pandas and Matplotlib are third-party libraries that aren’t built into Python. You should have gone through the installation process during the setup for the workshop (if not, visit the setup page), so we’ll jump straight to showing you how to import libraries.

To import a library, we use the syntax import libraryName. If we want to give the library a nickname to shorten the command each time we call it, we can add as nickNameHere. Here is how you’d import pandas and Matplotlib using the common nicknames pd and plt, respectively.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt

If you got an error similar to ModuleNotFoundError: No module named '___', it means you haven’t installed the library. Check the setup page to install it and double check you typed it correctly.

If you’re asking yourself why we used matplotlib.pyplot instead of just matplotlib, good question! We are not importing the entire Matplotlib library, we are only importing the fraction of it we need for plotting, called pyplot. We will return to this topic later to explain more about matplotlib.pyplot and why that is the part that we need to import.

For the purposes of this workshop, you only need to install a library once on your computer. However, you must import it in every notebook or script where you plan to use it, since Python doesn’t automatically load installed libraries.

In your future work with Python, this may not always be the case. You might want to keep different projects separate by using a different Python environment for each one. In that case, you’ll need to install the same library in each environment. We’ll talk more about environments later in the workshop.

Loading and exploring data


For this lesson, we will be using the Portal Teaching data, a subset of the data from Ernst et al. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.

We will be using files from the Portal Project Teaching Database. This section will use the surveys_complete_77_89.csv file that can be downloaded here: https://datacarpentry.github.io/R-ecology-lesson/data/cleaned/surveys_complete_77_89.csv

We are studying the weight, hindfoot length, and sex of animals caught in sites of our study area. The data is stored as a .csv file: each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular site
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus the genus of the species
species the latin species name
taxa general taxonomic category
plot_type type of experimental manipulation conducted

We’ll load the data in the csv into Python and name this new object surveys. For this, we can use the pandas library and its .read_csv() function, like shown here:

PYTHON

surveys = pd.read_csv("../data/raw/surveys_complete_77_89.csv")

Here we have created a new object that we can reference later in our code. All objects in Python have a type, which determine what is possible to do with them. To know the type of an object, we can use the type() function.

PYTHON

type(surveys)

OUTPUT

pandas.core.frame.DataFrame

Notice how we didn’t use the = operator in this case. This means we didn’t create a new object, we just asked Python to show us the output of the function. When we created surveys with the = operator it didn’t print an output, but it stored the object in Python which we can later use.

From the output we can read that surveys is an object of pandas DataFrame type. We’ll explore in depth what a DataFrame is in the next episode. For now, we only need to keep in mind that our data is now contained in a DataFrame. And this is important as the methods we’ll cover now -.head(), .info(), and .plot()- are only available to DataFrame objects.

Now that our data is in Python, we can start working with it. A good place to start is taking a look at the data. With the .head() method, we can see the first five rows of the data set.

PYTHON

surveys.head()

OUTPUT

   record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
0          1      7   16  1977        2         NL   M             32.0
1          2      7   16  1977        3         NL   M             33.0
2          3      7   16  1977        2         DM   F             37.0
3          4      7   16  1977        7         DM   M             36.0
4          5      7   16  1977        3         DM   M             35.0

   weight      genus   species    taxa                 plot_type
0     NaN    Neotoma  albigula  Rodent                   Control
1     NaN    Neotoma  albigula  Rodent  Long-term Krat Exclosure
2     NaN  Dipodomys  merriami  Rodent                   Control
3     NaN  Dipodomys  merriami  Rodent          Rodent Exclosure
4     NaN  Dipodomys  merriami  Rodent  Long-term Krat Exclosure  

The .tail() methods will give us instead the last five rows of the data. If we want to override this default and instead display the last two rows, we can use the n argument. An argument is an input that a function or method takes to modify how it operates, and you set arguments using the = sign.

PYTHON

surveys.tail(n=2)

OUTPUT

       record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
16876      16877     12    5  1989       11         DM   M             37.0
16877      16878     12    5  1989        8         DM   F             37.0

       weight      genus   species    taxa plot_type
16876    50.0  Dipodomys  merriami  Rodent   Control
16877    42.0  Dipodomys  merriami  Rodent   Control

Another useful method to get a glimpse of the data is .info(). This tells us how many rows the data set has (# of entries), the number of columns, and for each of the columns it says its name, the number of non-null (or non-empty) rows it contains, and its data type.

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16878 entries, 0 to 16877
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   record_id        16878 non-null  int64
 1   month            16878 non-null  int64
 2   day              16878 non-null  int64
 3   year             16878 non-null  int64
 4   plot_id          16878 non-null  int64
 5   species_id       16521 non-null  object
 6   sex              15578 non-null  object
 7   hindfoot_length  14145 non-null  float64
 8   weight           15186 non-null  float64
 9   genus            16521 non-null  object
 10  species          16521 non-null  object
 11  taxa             16521 non-null  object
 12  plot_type        16878 non-null  object
dtypes: float64(2), int64(5), object(6)
memory usage: 1.7+ MB
Callout

Functions and methods in Python

As we saw, functions allow us to perform a given task. They take inputs, perform the task on those inputs, and return an output. For example, with the pd.read_csv() function, we gave the file path as input, and it returned the DataFrame as output.

Functions can be built-in, which means they come natively with Python, like type() or print(). Or they can come from an imported library, like pd.read_csv() comes from Pandas.

A method is similar to a function. The only difference is a method is associated with a given object type. That’s why we use the syntax object_name.method_name().

This is the case for .head(), .tail(), and .info(): they are methods that the pandas DataFrame object carries around with it.

Basic plotting with Pandas


pandas makes it easy to start plotting your data with the .plot() method. By passing arguments to this method, we can tell Pandas how we want the plot to look.

With the following code, we will make a scatter plot (argument kind = "scatter") to analyze the relationship between the weight (which will plot in the x axis, argument x = "weight") and the hindfoot length (in the y axis, argument y = "hindfoot_length") of the animals sampled at the study site.

PYTHON

surveys.plot(x = "weight", y = "hindfoot_length", kind = "scatter")

When coding, you’ll often find the case where you can get to the same result using different code. In this case, the creators of pandas make it possible to make the previous plot with the.plot.scatter method, without having to specify the “kind” argument.

PYTHON

surveys.plot.scatter(x = "weight", y = "hindfoot_length")

This scatter plot shows there seems to be a positive relation between weight and hindfoot length, where heavier animals tend to have bigger hindfeet. But you may have noticed that parts of our scatter plot have many overlapping points, making it difficult to see all the data. We can adjust the transparency of the points using the alpha argument, which takes a value between 0 and 1.

PYTHON

surveys.plot(x = "weight", y = "hindfoot_length",
                  kind = "scatter", alpha = 0.2)

With transparency added to the points, we can more clearly observe a clustering of data points into several more densely populated regions of the scatter plot.

There are multiple ways we can learn what other arguments we have available to modify the looks of our plot. One of those ways is reading the documentation of the library we are using. In the next episode, we’ll cover other ways to get help in Python.

Challenge

Challenge - Changing the color of points

Check the documentation of the pandas method .plot.scatter() to learn what argument you can use to change the color of the points in the scatter plot to green.

Here is the link to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html

Continuing from our last line of code where we added the “alpha” argument, we can add the argument c = "green" to achieve what we want.

PYTHON

surveys.plot(x = "weight", y = "hindfoot_length",
                  kind="scatter", alpha = 0.2, c = "green")

Similarly, we could make a box plot to explore the distribution of the hindfoot length across all samples. In this case, pandas wants us to use different arguments. We’ll use the column argument to specify what is the column we want to analyze.

PYTHON

surveys.plot(column = "hindfoot_length", kind = "box")

The box plot shows the median hindfoot length is around 32mm (represented by the line inside the box) and most values lie between 20 and 35 mm (which are the borders of the box, representing the 1st and 3rd quartile of the data, respectively).

We could further expand this analysis, and see the distribution of this variable across different plot types. We can add a by argument, saying by which variable we want do disaggregate the box plot.

PYTHON

surveys.plot(column = "hindfoot_length", by = "plot_type", kind = "box")

As shown in the previous image, the x-axis labels overlap with each other, which makes them unreadable. Furthermore, we’d like to start customizing the title and the axis labels. Or maybe make multiple subplots in the same figure. At this point we realize a more fine-grained control over our graph is needed, and here is where Matplotlib appears in the picture.

Advanced plots with Matplotlib


Matplotlib is a Python library that is widely used throughout the scientific Python community to create high-quality and publication-ready graphics. It supports a wide range of raster and vector graphics formats including PNG, PostScript, EPS, PDF and SVG.

Moreover, Matplotlib is the actual engine behind the plotting capabilities of Pandas, and other plotting libraries like seaborn and plotnine. For example, when we call the .plot() methods on pandas data objects, Matplotlib is actually being used “backstage”.

Our first step in the process is creating our figure and our axes (or plots), using the plt.subplots() function.

PYTHON

fig, axis = plt.subplots()

The fig object we are creating is the entire plot area, which can contain one or multiple axes. In this case, the function default is to create only one set of axes, which will be referenced as the axis object. For now, this results in an empty plot like a blank canvas for us to start plotting data on to.

We’ll add the previous box plot we made to this axis, by using the ax argument inside the .plot() function. This gives us a plot just as we had it before, but now onto an axis object that we can start customizing.

PYTHON

fig, axis = plt.subplots()
surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axis)

To start, we can rotate the x-axis labels 90 degrees to make them readable. For this, we use the .tick_params() method on the axis object.

PYTHON

fig, axis = plt.subplots()
surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axis)
axis.tick_params(axis = 'x', rotation = 90)

Axis objects have a lot of methods like .tick_params(), which can be used to adjust the layout and styling of the plot. For example, we can modify the title (the column name is the default) and add the x- and y-axis labels with the .set_title(), .set_xlabel(), and .set_ylabel() methods. By now, you may have begun copying and pasting the parts of the code that don’t change to save yourself some typing time. Some lines might only include subtle changes, so take care not to miss anything small and important when reusing lines written previously.

PYTHON

fig, axis = plt.subplots()
surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axis)
axis.tick_params(axis='x', rotation = 90)
axis.set_title("Distribution of hindfoot lenght across plot types")
axis.set_xlabel("Plot Types")
axis.set_ylabel("Hindfoot Length (mm)")

Making multiple subplots

If we want more than one plot in the same figure, we could specify the number of rows (nrows argument) and the number of columns (ncols) when calling the plt.subplots function. For example, let’s say we want two plots (or axes), organized in two columns and one row. This will be useful in a minute, when we arrange our scatter plot and box plot in a single figure.

PYTHON

fig, axes = plt.subplots(nrows = 1, ncols = 2) # note the variable name is "axes" here rather than "axis" used above

The axes object contains two objects, which correspond to the two sets of axes. We can access each of the objects inside axes using axes[0] and axes[1].

We’ll get more comfortable with this Python syntax during the workshop, but the most important thing to know now is that Python indexes elements starting with 0. This means the first element in a sequence has index 0, the second element has index 1, the third element is 2, and so forth.

Here is the code to have two subplots, one with the scatter plot (on axes[0]), and one with the box plot (on axes[1]):

PYTHON

fig, axes = plt.subplots(nrows = 1, ncols = 2)
surveys.plot(x = "weight", y = "hindfoot_length", kind="scatter",
                  alpha = 0.2, ax = axes[0])
surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axes[1])

As shown before, Matplotlib allows us to customize every aspect of our figure. So we’ll make it more professional by adding titles and axis labels. Notice at the end the introduction of the .suptitle() method, which allows us to add a super title to the entire figure. We also use the .tight_layout() method, to automatically adjust the padding between and around subplots. (Feel free to try the code below with and without the fig.tight_layout() line to better understand the difference it makes.)

PYTHON

fig, axes = plt.subplots(nrows = 1, ncols = 2)
surveys.plot(x = "weight", y = "hindfoot_length", kind="scatter", alpha = 0.2, ax = axes[0])
axes[0].set_title("Weight vs. Hindfoot Length")
axes[0].set_xlabel("Weight (g)")
axes[0].set_ylabel("Hindfoot Length (mm)")

surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axes[1])
axes[1].tick_params(axis = "x", rotation = 90)
axes[1].set_title("Hindfoot Length by Plot Type")
axes[1].set_xlabel("Plot Types")
axes[1].set_ylabel("Hindfoot Length (mm)")

fig.suptitle("Analysis of Hindfoot Length variable", fontsize=16)
fig.tight_layout()

You now have the basic tools to create and customize plots using pandas and Matplotlib! Let’s put it into practice.

Challenge

Challenge - Changing figure size

Our plot probably needs more vertical space, so let’s change the figure size. Take a look at Matplotlib’s figure documentation and answer:

  1. What is the argument we need to make this adjustment?
  2. In which units is the size specified (inches, pixels, centimeters)?
  3. What is the default figure size if we don’t change the argument?
  4. Use the argument in the plt.subplots() function to make the previous figure 7x7 inches in size.
  1. The argument we’re looking for is figsize.
  2. Figure dimension is specified as (width, height) in inches
  3. The default figure size is 6.4 inches wide and 4.8 inches tall.
  4. We can add the figsize = (7,7) argument like this:

PYTHON

fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (7,7))

And keep the rest of our plot the same.

Challenge

Challenge - Add a third axis

Let’s explore the data for the hindfoot length variable more deeply. Add a third axis to the plot we’ve been working on, for a histogram of the hindfoot length. For this you can add the argument kind = hist to an additional .plot() method call.

After that, there’s a few things you can do to make this graph look nicer. Try making each of these one at a time to see how it changes.

  • Increase the size of the figure to make it 10in wide and 7in long, just like in the previous challenge.
  • Make the histogram orientation horizontal and hide the legend. For this, add the orientation='horizontal' and legend = False arguments to your .plot() method.
  • As the y-axis of each subplot is the same, we could use a shared y-axis for all three. Explore the Matplotlib documentation some more to find out which parameter you can use to achieve this.
  • Add a nice title to your third subplot.

Here is the final code to achieve all of the previous goals. Notice what has changed from our previous code.

PYTHON

fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (10,7), sharey = True)
surveys.plot(x = "weight", y = "hindfoot_length", kind="scatter",
                  alpha = 0.2, ax = axes[0])
axes[0].set_title("Weight vs. Hindfoot Length")
axes[0].set_xlabel("Weight (g)")
axes[0].set_ylabel("Hindfoot Length (mm)")

surveys.plot(column = "hindfoot_length", by = "plot_type",
                  kind = "box", ax = axes[1])
axes[1].tick_params(axis =  "x", rotation = 90)
axes[1].set_title("Hindfoot Length by Plot Type")
axes[1].set_xlabel("Plot Types")
axes[1].set_ylabel("Hindfoot Length (mm)2")

surveys.plot(column = "hindfoot_length", orientation="horizontal",
                  legend = False, kind = "hist", ax = axes[2])
axes[2].set_title("Hindfoot Length Histogram")

fig.suptitle("Analysis of Hindfoot Length variable", fontsize=16)
fig.tight_layout()

Exporting plots

Until now our plots only live inside Python. Therefore, once we’re satisfied with the resulting plot, we need to save it with the .savefig() method on our figure object. The only required argument is the file path in your computer where you want to save it. Matplotlib recognizes the extension used in the filename and supports (on most computers) png, pdf, ps, eps and svg formats.

PYTHON

fig.savefig("images/hindfoot_analysis.png")
Discussion

Challenge - Make and save your own plot

Put all this knowledge to practice and come up with your own plot of the Portal Teaching data set. Use a different combination of variables, or a different kind of plot. Here is the pandas pd.plot() function documentation, where you can see what other values you can use in the kind argument.

Save your plot to the “images” folder with a .pdf extension.

Other plotting libraries


Learning to plot with pandas and Matplotlib can be overwhelming by itself. However, we want to invite you to explore other plotting libraries available in Python, as some types of plots will be easier to make with them. This will also take time and practice, as every new library come with its own functions, methods, and arguments. But with time and practice, you’ll get used to researching and learning about these new libraries and how they work.

This is where the true power of open source software lies. There is the big community of contributors who all the time are creating new libraries or improving the existing ones.

Some of the libraries you might find useful are:

  • Plotnine: Inspired by R’s ggplot2, it implements the “grammar of graphics” approach. If you come from Rworld and the tidyverse, you’ll definitely want to check it out.

    On a plotnine figure, a Matplotlib figure is returned when you use the .draw() method, so you can always start with plotnine and then customize using Matplotlib.

  • Seaborn: Focusing on statistical graphics, there’s a variety of plot types that will be simpler to make (require fewer lines of code) in seaborn. To name a few: statistical representations of averages or of simple linear regression models; plots of categorical data; multivariate plots, etc. Refer to the introduction to seaborn to learn more.

    For example, the following scatter plot which adds “sex” as a third variable would require more lines of code if done with pandas + Matplotlib alone.

    PYTHON

    import seaborn as sns
    sns.scatterplot(data = surveys, x="weight", y="hindfoot_length",
                  hue="sex", alpha = 0.2)
  • Plotly: A tool to explore if you want web-based interactive visualizations where you can hover over data points to get additional information or zoom in to get a closer look. Plots can be displayed in Jupyter notebooks, standalone HTML files, or be part of a web dashboard.

Key Points
  • Load your required libraries into Python and use common nicknames
  • Use pandas to load your data –pd.read_csv()– and to explore it –.head(), .tail(), and .info() methods.
  • The .plot() method on your DataFrame is a good plotting starting point.
  • Matplotlib allows you to customize every aspect of your plot. Start with the plt.subplots() function to create a figure object and the number of axes (or subplots) you need.
  • Export plots to a file using the .savefig() method.

Content from Exploring and understanding data


Last updated on 2026-02-09 | Edit this page

Overview

Questions

  • How can I do exploratory data analysis in Python?
  • How do I get help when I am stuck?
  • What impact does an object’s type have on what I can do with it?
  • How are expressions evaluated and values assigned to variables?

Objectives

  • Explore the structure and content of pandas dataframes
  • Convert data types and handle missing data
  • Interpret error messages and develop strategies to get help with Python
  • Trace how Python assigns values to objects

The pandas DataFrame


We just spent quite a bit of time learning how to create visualisations from the surveys data, but we did not talk much about what surveys is. You may remember that we loaded the data into Python with the pandas.read_csv function. The output of read_csv is a data frame: a common way of representing tabular data in a programming language. To be precise, surveys is an object of type DataFrame. In Python, pretty much everything you work with is an object of some type. The type function can be used to tell you the type of any object you pass to it.

PYTHON

type(surveys)

OUTPUT

pandas.core.frame.DataFrame

This output tells us that the DataFrame object type is defined by pandas, i.e. it is a special type of object not included in the core functionality of Python.

Exploring data in a dataframe

We encountered the plot, head and tail methods in the previous epsiode. Dataframe objects carry many other methods, including some that are useful when exploring a dataset for the first time. Consider the output of describe:

PYTHON

surveys.describe()

OUTPUT

           record_id 	       month           day          year       plot_id  hindfoot_length        weight
count 	16878.000000  16878.000000  16878.000000  16878.000000  16878.000000     14145.000000  15186.000000
mean     8439.500000      6.382214     15.595805   1983.582119     11.471442        31.982114     53.216647
std      4872.403257      3.411215      8.428180      3.492428      6.865875        10.709841     44.265878
min         1.000000      1.000000      1.000000   1977.000000      1.000000         6.000000      4.000000
25%      4220.250000      3.000000      9.000000   1981.000000      5.000000        21.000000     24.000000
50%      8439.500000      6.000000     15.000000   1983.000000     11.000000        35.000000     42.000000
75%     12658.750000      9.000000     23.000000   1987.000000     17.000000        37.000000     53.000000
max     16878.000000     12.000000     31.000000   1989.000000     24.000000        70.000000    278.000000

These summary statistics give an immediate impression of the distribution of the data. It is always worth performing an initial “sniff test” with these: if there are major issues with the data or its formatting, they may become apparent at this stage.

One property of the data that we might notice at this stage is that the record_id column contains a sequence of integers from 1 to 16878, the total number of rows in the dataframe. This tells us that the record_id column is an index of the rows: it contains a unique identifier for each record. We can specify that this column should be used as the index of the dataframe, which has several benefits including making it easier for us to keep track of which rows we are looking at when we begin filtering the dataframe later.

PYTHON

surveys = surveys.set_index('record_id')

If you know in advance which column you want to use as an index, you can also specify this when you load the data from the file:

PYTHON

surveys = pd.read_csv('../data/surveys_complete_77_89.csv', index_col=0)

We specify with the index_col argument that the first column in the CSV should be used as the index of the dataframe.

Returning to the high-level exploration of our dataframe, the info method provides an overview of the columns:

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
Index: 16878 entries, 1 to 16878
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   month            16878 non-null  int64
 1   day              16878 non-null  int64
 2   year             16878 non-null  int64
 3   plot_id          16878 non-null  int64
 4   species_id       16521 non-null  object
 5   sex              15578 non-null  object
 6   hindfoot_length  14145 non-null  float64
 7   weight           15186 non-null  float64
 8   genus            16521 non-null  object
 9   species          16521 non-null  object
 10  taxa             16521 non-null  object
 11  plot_type        16878 non-null  object
dtypes: float64(2), int64(4), object(6)
memory usage: 1.7+ MB

We get quite a bit of useful information here too. First, we are told that we have a DataFrame of 16878 entries, or rows, and 12 variables, or columns.

Next, we get a bit of information on each variable, including its column title, a count of the non-null values (that is, values that are not missing), and something called the dtype of the column.

Null values

A value may be missing because it was not possible to make a complete observation, because data was lost, or for any number of other reasons. Depending on the type of data stored in the column (more on which in a moment), missing values may appear as NaN (“Not a Number”), NA, <NA>, or NaT (“Not a Time”).

You may have noticed some during our initial exploration of the dataframe. (Note the NaN values in the first five rows of the weight column below.)

PYTHON

surveys.head()

OUTPUT

           month  day  year  plot_id  species_id  sex  hindfoot_length  weight      genus   species    taxa                 plot_type
record_id 												
1              7   16  1977        2          NL    M             32.0     NaN    Neotoma  albigula  Rodent                   Control
2              7   16  1977        3          NL    M             33.0     NaN    Neotoma  albigula  Rodent  Long-term Krat Exclosure
3              7   16  1977        2          DM    F             37.0     NaN  Dipodomys  merriami  Rodent                   Control
4              7   16  1977        7          DM    M             36.0     NaN  Dipodomys  merriami  Rodent          Rodent Exclosure
5              7   16  1977        3          DM    M             35.0     NaN  Dipodomys  merriami  Rodent  Long-term Krat Exclosure

From the output of surveys.info() above, we can tell that almost 1700 weight measurements and more than 2700 hindfoot length measurements are missing. Some of the other columns also have missing values.

In the rest of this episode of the training, we will learn what we need to be able to fill in the missing weight values.

Data types

The dtype property of a dataframe column describes the data type of the values stored in that column. There are three in the example above:

  • int64: this column contains integer (whole number) values.
  • object: this column contains string (non-numeric sequence of characters) values.
  • float64: this column contains “floating point” values i.e. numeric values containing a decimal point.

The 64 after int and float represents the level of precision with which the values in the column are stored in the computer’s memory. Other types with lower levels of precision are available for numeric values, e.g. int32 and float16, which will take up less memory on your system but limit the size and level of precision of the numbers they can store.

The dtype of a column is important because it determines the kinds of operation that can be performed on the values in that column. Let’s work with a couple of the columns independently to demonstrate this.

The Series object

To work with a single column of a dataframe, we can refer to it by name in two different ways:

PYTHON

surveys["species_id"]

or

PYTHON

surveys.species_id # this only works if there are no spaces in the column name (note the underscore used here)

OUTPUT

record_id
1        NL
2        NL
3        DM
4        DM
5        DM
         ..
16874    RM
16875    RM
16876    DM
16877    DM
16878    DM
Name: species_id, Length: 16878, dtype: object
Callout

Tip: use tab completion on column names

Tab completion, where you start typing the name of a variable, function, etc before hitting Tab to auto-complete the rest, also works on column names of a dataframe. Since this tab completion saves time and reduces the chance of including typos, we recommend you use it as frequently as possible.

The result of that operation is a series of data: a one-dimensional sequence of values that all have the same dtype (object in this case). Dataframe objects are collections of the series “glued together” with a shared index: the column of unique identifiers we associate with each row. record_id is the index of the series summarised above; the values carried by the series are NL, DM, AH, etc (short species identification codes).

If we choose a different column of the dataframe, we get another series with a different data type:

PYTHON

surveys['weight']

OUTPUT

record_id
1         NaN
2         NaN
3         NaN
4         NaN
5         NaN
         ...
16874    15.0
16875     9.0
16876    31.0
16877    50.0
16878    42.0
Name: weight, Length: 16878, dtype: float64

The data type of the series influences the things that can be done with/to it. For example, sorting works differently for these two series, with the numeric values in the weight series sorted from largest to smallest and the character strings in species_id sorted alphabetically:

PYTHON

surveys['weight'].sort_values()

OUTPUT

record_id
9790     4.0
5346     4.0
4052     4.0
9853     4.0
7084     4.0
        ...
16772    NaN
16777    NaN
16808    NaN
16846    NaN
16860    NaN
Name: weight, Length: 16878, dtype: float64

PYTHON

surveys['species_id'].sort_values()

OUTPUT

record_id
12345     AB
9861      AB
10970     AB
10963     AB
5759      AB
        ...
16453    NaN
16454    NaN
16488    NaN
16489    NaN
16539    NaN
Name: species_id, Length: 16878, dtype: object

This pattern of behaviour, where the type of an object determines what can be done with it and influences how it is done, is a defining characteristic of Python. As you gain more experience with the language, you will become more familiar with this way of working with data. For now, as you begin on your learning journey with the language, we recommend using the type function frequently to make sure that you know what kind of data/object you are working with, and do not be afraid to ask for help whenever you are unsure or encounter a problem.

Aside: Getting Help


You may have already encountered several errors while following the lesson and this is a good time to take a step back and discuss good strategies to get help when something goes wrong.

The built-in help function

Use help to view documentation for an object or function. For example, if you want to see documentation for the round function:

PYTHON

help(round)

OUTPUT

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.

    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.

The Jupyter Notebook has two ways to get help.

If you are working in Jupyter (Notebook or Lab), the platform offers some additional ways to see documentation/get help:

  • Option 1: Type the function name in a cell with a question mark after it, e.g. round?. Then run the cell.
  • Option 2: (Not available on all systems) Place the cursor near where the function is invoked in a cell (i.e., the function name or its parameters),
    • Hold down Shift, and press Tab.
    • Do this several times to expand the information returned.

Understanding error messages

The error messages returned when something goes wrong can be (very) long but contain information about the problem, which can be very useful once you know how to interpret it. For example, you might receive a SyntaxError if you mistyped a line and the resulting code was invalid:

PYTHON

# Forgot to close the quote marks around the string.
name = 'Feng

ERROR

  Cell In[129], line 1
    name = 'Feng
           ^
SyntaxError: unterminated string literal (detected at line 1)

There are three parts to this error message:

ERROR

  Cell In[129], line 1

This tells us where the error occured. This is of limited help in Jupyter, since we know that the error is in the cell we just ran (Cell In[129]), but the line number can be helpful especially when the cell is quite long. But when running a larger program written in Python, perhaps built up from multiple individual scripts, this can be more useful, e.g.

ERROR

  data_visualisation.py, line 42

Next, we see a copy of the line where the error was encountered, often annotated with an arrow pointing out exactly where Python thinks the problem is:

ERROR

    name = 'Feng
           ^

Python is not exactly right in this case: from context you might be able to guess that the issue is really the lack of a closing quotation mark at the end of the line. But an arrow pointing to the opening quotation mark can give us a push in the right direction. Sometimes Python gets these annotations exactly right. Occasionally, it gets them completely wrong. In the vast majority of cases they are at least somewhat helpful.

Finally, we get the error message itself:

ERROR

SyntaxError: unterminated string literal (detected at line 1)

This always begins with a statement of the type of error encountered: in this case, a SyntaxError. That provides a broad categorisation for what went wrong. The rest of the message is a description of exactly what the problem was from Python’s perspective. Error messages can be loaded with jargon and quite difficult to understand when you are first starting out. In this example, unterminated string literal is a technical way of saying “you opened some quotes, which I think means you were trying to define a string value, but the quotes did not get closed before the end of the line.”

It is normal not to understand exactly what these error messages mean the first time you encounter them. Since programming involves making lots of mistakes (for everyone!), you will start to become familiar with many of them over time. As you continue learning, we recommend that you ask others for help: more experienced programmers have made all of these mistakes before you and will probably be better at spotting what has gone wrong. (More on asking for help below.)

Error output can get really long!

Especially when using functions from libraries you have imported into your program, the middle part of the error message (the traceback) can get rather long. For example, what happens if we try to access a column that does not exist in our dataframe?

PYTHON

surveys["wegiht"] # misspelling the 'weight' column name

ERROR

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'wegiht'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[131], line 1
----> 1 surveys["wegiht"]

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'wegiht'

(This is still relatively short compared to some errors messages we have seen!)

When you encounter a long error like this one, do not panic! Our advice is to focus on the first couple of lines and the last couple of lines. Everything in the middle (as the name traceback suggests) is retracing steps through the program, identifying where problems were encountered along the way. That information is only really useful to somebody interested in the inner workings of the pandas library, which is well beyond the scope of this lesson! If we ignore everything in the middle, the parts of the error message we want to focus on are:

ERROR

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

[... skipping these parts ...]

KeyError: 'wegiht'

This tells us that the problem is the “key”: the value we used to lookup the column in the dataframe. Hopefully, the repetition of the value we provided would be enough to help us realise our mistake.

Other ways to get help

There are several other ways that people often get help when they are stuck with their Python code.

  • Search the internet: paste the last line of your error message or the word “python” and a short description of what you want to do into your favourite search engine and you will usually find several examples where other people have encountered the same problem and came looking for help.
    • StackOverflow can be particularly helpful for this: answers to questions are presented as a ranked thread ordered according to how useful other users found them to be.
    • Take care: copying and pasting code written by somebody else is risky unless you understand exactly what it is doing!
  • ask somebody “in the real world”. If you have a colleague or friend with more expertise in Python than you have, show them the problem you are having and ask them for help.
  • Sometimes, the act of articulating your question can help you to identify what is going wrong. This is known as “rubber duck debugging” among programmers.

Generative AI

It is increasingly common for people to use generative AI chatbots such as ChatGPT to get help while coding. You will probably receive some useful guidance by presenting your error message to the chatbot and asking it what went wrong. However, the way this help is provided by the chatbot is different. Answers on StackOverflow have (probably) been given by a human as a direct response to the question asked. But generative AI chatbots, which are based on an advanced statistical model, respond by generating the most likely sequence of text that would follow the prompt they are given.

While responses from generative AI tools can often be helpful, they are not always reliable. These tools sometimes generate plausible but incorrect or misleading information, so (just as with an answer found on the internet) it is essential to verify their accuracy. You need the knowledge and skills to be able to understand these responses, to judge whether or not they are accurate, and to fix any errors in the code it offers you.

In addition to asking for help, programmers can use generative AI tools to generate code from scratch; extend, improve and reorganise existing code; translate code between programming languages; figure out what terms to use in a search of the internet; and more. However, there are drawbacks that you should be aware of.

The models used by these tools have been “trained” on very large volumes of data, much of it taken from the internet, and the responses they produce reflect that training data, and may recapitulate its inaccuracies or biases. The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference). For more information see the AI Environmental Impact Primer developed by researchers at HuggingFace, an AI hosting platform. Concerns also exist about the way the data for this training was obtained, with questions raised about whether the people developing the LLMs had permission to use it. Other ethical concerns have also been raised, such as reports that workers were exploited during the training process.

We recommend that you avoid getting help from generative AI during the workshop for several reasons:

  1. For most problems you will encounter at this stage, help and answers can be found among the first results returned by searching the internet.
  2. The foundational knowledge and skills you will learn in this lesson by writing and fixing your own programs are essential to be able to evaluate the correctness and safety of any code you receive from online help or a generative AI chatbot. If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.
  3. As you start out with programming, the mistakes you make will be the kinds that have also been made – and overcome! – by everybody else who learned to program before you. Since these mistakes and the questions you are likely to have at this stage are common, they are also better represented than other, more specialised problems and tasks in the data that was used to train generative AI tools. This means that a generative AI chatbot is more likely to produce accurate responses to questions that novices ask, which could give you a false impression of how reliable they will be when you are ready to do things that are more advanced.

Data input within Python


Although it is more common (and faster) to input data in another format e.g. a spreadsheet and read it in, Series and DataFrame objects can be created directly within Python. Before we can make a new Series, we need to learn about another type of data in Python: the list.

Lists

Lists are one of the standard data structures built into Python. A data structure is an object that contains more than one piece of information. (DataFrames and Series are also data structures.) The list is designed to contain multiple values in an ordered sequence: they are a great choice if you want to build up and modify a collection of values over time and/or handle each of those values one at a time. We can create a new list in Python by capturing the values we want it to store inside square brackets []:

PYTHON

years_list = [2020, 2025, 2010]
years_list

OUTPUT

[2020, 2025, 2010]

New values can be added to the end of a list with the append method:

PYTHON

years_list.append(2015)
years_list

OUTPUT

[2020, 2025, 2010, 2015]
Challenge

Exploring list methods

The append method allows us to add a value to the end of a list but how could we insert a new value into a given position instead? Applying what you have learned about how to find out the methods that an object has, can you figure out how to place the value 2019 into the third position in years_list (shifting the values after it up one more position)? Recall that the indexing used to specify positions in a sequence begins at 0 in Python.

Using tab completion, the help function, or looking up the documentation online, we can discover the insert method and learn how it works. insert takes two arguments: the position for the new list entry and the value to be placed in that position:

PYTHON

years_list.insert(2, 2019)
years_list

OUTPUT

[2020, 2025, 2019, 2010, 2015]

Among many other methods is sort, which can be used to sort the values in the list:

PYTHON

years_list.sort()
years_list

OUTPUT

[2010, 2015, 2019, 2020, 2025]

The easiest way to create a new Series is with a list:

PYTHON

years_series = pd.Series(years_list)
years_series

OUTPUT

0    2010
1    2015
2    2019
3    2020
4    2025
dtype: int64

With the data in a Series, we can no longer do some of the things we were able to do with the list, such as adding new values. But we do gain access to some new possibilities, which can be very helpful. For example, if we wanted to increase all of the values by 1000, this would be easy with a Series but more complicated with a list:

PYTHON

years_series + 1000

OUTPUT

0    3010
1    3015
2    3019
3    3020
4    3025
dtype: int64

PYTHON

years_list + 1000

ERROR

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[126], line 1
----> 1 years_list + 1000

TypeError: can only concatenate list (not "int") to list

This illustrates an important principle of Python: different data structures are suitable for different “modes” of working with data. It can be helpful to work with a list when building up an initial set of data from scratch, but when you are ready to begin operating on that dataset as a whole (performing calculations with it, visualising it, etc), you will be rewarded for switching to a more specialised datatype like a Series or DataFrame from pandas.

Unexpected data types


Operations like the addition of 1000 we performed on years_series work because pandas knows how to add a number to the integer values in the series. That behaviour is determind by the dtype of the series, which makes that dtype really important for how you want to work with your data. Let’s explore how the dtype is chosen. Returning to the years_series object we created above:

PYTHON

years_series

OUTPUT

0    2010
1    2015
2    2019
3    2020
4    2025
dtype: int64

The dtype: int64 was determined automatically based on the values passed in. But what if the values provided are of several different types?

PYTHON

ages_series = pd.Series([2, 3, 5.5, 6, 8])
ages_series

OUTPUT

0    2.0
1    3.0
2    5.5
3    6.0
4    8.0
dtype: float64

Pandas assigns a dtype that allows it to account for all of the values it is given, converting some values to another dtype if needed, in a process called coercion. In the case above, all of the integer values were coerced to floating point numbers to account for the 5.5.

Challenge

Exercise: Coercion

Can you guess the dtype of each series created below? Run the code to check whether you were right.

PYTHON

int_str = pd.Series([1, "two", 3])
str_flt = pd.Series(["four", 5.0, "six"])

PYTHON

int_str

OUTPUT

0      1
1    two
2      3
dtype: object

PYTHON

str_flt

OUTPUT

0   four
1    5.0
2    six
dtype: object

In both cases, the numeric values are coerced into strings. When automatically coercing values between types like this, Python aims to minimise the amount of information lost.

In practice, it is much more common to read data from elsewhere (e.g. with read_csv) than to enter it manually within Python. When reading data from a file, pandas tries to guess the appropriate dtype to assign to each column (series) of the dataframe. This is usually very helpful but the process is sensitive to inconsistencies and data entry errors in the input: a stray character in one cell can cause an entire column to be coerced to a different dtype than you might have wanted.

For example, if the raw data includes a symbol added by a typo mistake (= instead of -):

name,latitude,longitude
Superior,47.7,-87.5
Victoria,-1.0,33.0
Tanganyika,=6.0,29.3

We see a non-numeric dtype for the latitude column (object) when we load the data into a dataframe.

PYTHON

lakes = pd.read_csv('lakes.csv')
lakes.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   name       3 non-null      object
 1   latitude   3 non-null      object
 2   longitude  3 non-null      float64
dtypes: float64(1), object(2)
memory usage: 200.0+ bytes

It is a good idea to run the info method on a new dataframe after you have loaded data for the first time: if one or more of the columns has a different dtype than you expected, this may be a signal that you need to clean up the raw data.

Recasting

A column can be manually coerced (or recast) into a different dtype, provided that pandas knows how to handle that conversion. For example, the integer values in the plot_id column of our dataframe can be converted to floating point numbers:

PYTHON

surveys['plot_id'] = surveys['plot_id'].astype('float')
surveys['plot_id']

OUTPUT

record_id
record_id
1         2.0
2         3.0
3         2.0
4         7.0
5         3.0
         ...
16874    16.0
16875     5.0
16876     4.0
16877    11.0
16878     8.0
Name: plot_id, Length: 16878, dtype: float64

But the string values of species_id cannot be converted to numeric data:

PYTHON

surveys.species_id = surveys.species_id.astype('int')

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[101], line 1
----> 1 surveys.species_id = surveys.species_id.astype('int64')

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/generic.py:6662, in NDFrame.astype(self, dtype, copy, errors)
   6656     results = [
   6657         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6658     ]
   6660 else:
   6661     # else, only a single dtype is given
-> 6662     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6663     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6664     return res.__finalize__(self, method="astype")

[... a lot more lines of traceback ...]

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:133, in _astype_nansafe(arr, dtype, copy, skipna)
    129     raise ValueError(msg)
    131 if copy or arr.dtype == object or dtype == object:
    132     # Explicit copy, or required since NumPy can't view from / to object.
--> 133     return arr.astype(dtype, copy=True)
    135 return arr.astype(dtype, copy=copy)

ValueError: invalid literal for int() with base 10: 'NL'
Challenge

Changing Types

  1. Convert the values in the column plot_id back to integers.
  2. Now try converting weight to an integer. What goes wrong here? What is pandas telling you? We will talk about some solutions to this later.

PYTHON

surveys['plot_id'].astype("int")

OUTPUT

record_id
1         2
2         3
3         2
4         7
5         3
         ..
16874    16
16875     5
16876     4
16877    11
16878     8
Name: plot_id, Length: 16878, dtype: int64

PYTHON

surveys_df['weight'].astype("int")

ERROR

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Pandas cannot convert types from float to int if the column contains missing values.

Missing Data

In addition to data entry errors, it is common to encounter missing values in large volumes of data. It is important to consider missing values while processing data because they can influence downstream analysis – that is, data analysis that will be done later – in unwanted ways when not handled correctly. pandas can distinguish missing values from the actual data and indeed they will be ignored for some tasks, such as calculation of the summary statistics provided by describe.

PYTHON

surveys.describe()

OUTPUT

              month           day 	       year       plot_id  hindfoot_length 	      weight
count  16878.000000  16878.000000  16878.000000  16878.000000     14145.000000  15186.000000
mean       6.382214     15.595805   1983.582119     11.471442        31.982114     53.216647
std        3.411215      8.428180      3.492428      6.865875        10.709841     44.265878
min        1.000000      1.000000   1977.000000      1.000000         6.000000      4.000000
25%        3.000000      9.000000   1981.000000      5.000000        21.000000     24.000000
50%        6.000000     15.000000   1983.000000     11.000000        35.000000     42.000000
75%        9.000000     23.000000   1987.000000     17.000000        37.000000     53.000000
max       12.000000     31.000000   1989.000000     24.000000        70.000000    278.000000

In other circumstances, like the recasting we attempted in the previous exercise, the missing values can cause trouble. It is up to us to decide how best to handle those missing values. We could remove the rows containing missing data, accepting the loss of all data for that observation:

PYTHON

surveys.dropna().head()

OUTPUT

           month  day  year  plot_id species_id sex  hindfoot_length  weight      genus   species    taxa                 plot_type
record_id
63             8   19  1977        3         DM   M             35.0    40.0  Dipodomys  merriami  Rodent  Long-term Krat Exclosure
64             8   19  1977        7         DM   M             37.0    48.0  Dipodomys  merriami  Rodent          Rodent Exclosure
65             8   19  1977        4         DM   F             34.0    29.0  Dipodomys  merriami  Rodent                   Control
66             8   19  1977        4         DM   F             35.0    46.0  Dipodomys  merriami  Rodent                   Control
67             8   19  1977        7         DM   M             35.0    36.0  Dipodomys  merriami  Rodent          Rodent Exclosure

But we should take note that this removes more than 3000 rows from the dataframe!

PYTHON

len(surveys)

OUTPUT

16878

PYTHON

len(surveys.dropna())

OUTPUT

13773

Instead, we could fill all of the missing values with something else. For example, let’s make a copy of the surveys dataframe then populate the missing values in the weight column of that copy with the mean of all the non-missing weights. There are a few parts to that operation, which are tackled one at a time below.

PYTHON

mean_weight = surveys['weight'].mean() # the 'mean' method calculates the mean of the non-null values in the column
df1 = surveys.copy() # making a copy to work with so that we do not edit our original data
df1["weight"] = df1['weight'].fillna(mean_weight) # the 'fillna' method fills all missing values with the provided value
df1.head()

OUTPUT

           month  day  year  plot_id species_id sex  hindfoot_length     weight      genus   species    taxa                 plot_type
record_id
1              7   16  1977        2         NL   M             32.0  53.216647    Neotoma  albigula  Rodent                   Control
2              7   16  1977        3         NL   M             33.0  53.216647    Neotoma  albigula  Rodent  Long-term Krat Exclosure
3              7   16  1977        2         DM   F             37.0  53.216647  Dipodomys  merriami  Rodent                   Control
4              7   16  1977        7         DM   M             36.0  53.216647  Dipodomys  merriami  Rodent          Rodent Exclosure
5              7   16  1977        3         DM   M             35.0  53.216647  Dipodomys  merriami  Rodent  Long-term Krat Exclosure

The choice to fill in these missing values rather than removing the rows that contain them can have implications for the result of your analysis. It is important to consider your approach carefully. Think about how the data will be used and how these values will impact the scientific conclusions made from the analysis. pandas gives us all of the tools that we need to account for these issues. But we need to be cautious about how the decisions that we make impact scientific results.

Assignment, evaluation, and mutability


Stepping away from dataframes for a moment, the time has come to explore the behaviour of Python a little more.

Challenge

Exercise: variable assignments

What is the value of y after running the following lines?

PYTHON

x = 2
y = x*3
x = 10

PYTHON

x = 2
y = x*3
x = 10
y

OUTPUT

6

Understanding what’s going on here will help you avoid a lot of confusion when working in Python. When we assign something to a variable, the first thing that happens is the righthand side gets evaluated. So when we first ran the line y = x*3, x*3 first gets evaluated to the value of 6, and this gets assigned to y. The variables x and y are independent objects, so when we change the value of x to 10, y is unaffected. This behaviour may be different to what you are used to, e.g. from experience working with data in spreadsheets where cells can be linked such that modifying the value in one cell triggers changes in others.

Multiple evaluations can take place in a single line of Python code and learning to trace the order and impact of these evaluations is a key skill.

PYTHON

x = 10
y = 5
z = 3-(x/y)
z

OUTPUT

1.0

In the example above, x/y is evaluated first before the result is subtracted from 3 and the final calculated value is assigned to z. (The brackets () are not needed in the calculation above but are included to make the order of evaluation clearer.) Python makes each evaluation as it needs to in order to proceed with the next, before assigning the final result to the variable on the lefthand side of the = operator.

This means that we could have filled the missing values in the weight column of our dataframe copy in a single line:

PYTHON

df1["weight"] = df1['weight'].fillna(df1["weight"].mean())

First, the mean weight is calculated (df1["weight"].mean() is evaluated). Then the result of that evaluation is passed into fillna and the result of the filling operation (df1['weight'].fillna(<RESULT OF PREVIOUS>)) is assigned to df1["weight"].

Variable naming

You are going to name a lot of variables in Python! There are some rules you have to stick to when doing so, as well as recommendations that will make your life easier.

  • Make names clear without being too long
    • wkg is probably too short.
    • weight_in_kilograms is probably too long.
    • weight_kg is good.
  • Names cannot begin with a number.
  • Names cannot contain spaces; use underscores instead.
  • Names are case sensitive: weight_kg is a different name from Weight_kg. Avoid uppercase characters at the beginning of variable names.
  • Names cannot contain most non-letter characters: +&-/*. etc.
  • Two common formats of variable name are snake_case and camelCase. A third “case” of naming convention, kebab-case, is not allowed in Python (see the rule above).
  • Aim to be consistent in how you name things within your projects. It is easier to follow an established style guide, such as Google’s, than to come up with your own.
Challenge

Exercise; good variable names

Identify at least one good variable name and at least one variable name that could be improved in this episode. Refer to the rules and recommendations listed above to suggest how these variable names could be better.

mean_weight and surveys_df are examples of reasonably good variable names: they are relatively short yet descriptive.

df2 is not descriptive enough and could be potentially confusing if encountered by somebody else/ourselves in a few weeks’ time. The name could be improved by making it more descriptive, e.g. surveys_df_duplicate.

Mutability

Why did we need to use the copy method to duplicate the dataframe above if variables are not linked to each other? Why not assign a new variable with the value of the existing dataframe object?

PYTHON

df2 = surveys

This gets to mutablity: a feature of Python that has caused headaches for many novices in the past! In the interests of memory management, Python avoids making copies of objects unless it has to. Some types of objects are immutable, meaning that their value cannot be modified once set. Immutable object types we have already encountered include strings, integers, and floats. If we want to adjust the value of an integer variable, we must explicitly overwrite it.

Other types of object are mutable, meaning that their value can be changed “in-place” without needing to be explictly overwritten. This includes lists and pandas DataFrame objects, which can be reordered etc. after they are created.

When a new variable is assigned the value of an existing immutable object, Python duplicates the value and assigns it to the new variable.

a = 3.5 # new float object, called "a"
b = a   # another new float object, called "b", which also has the value 3.5

When a new variable is assigned the value of an existing mutable object, Python makes a new “pointer” towards the value of the existing object instead of duplicating it.

some_species = ['NL', 'DM', 'PF', 'PE', 'DS'] # new list object, called "some_species"
some_more_species = some_species # another name for the same list object

This can have unintended consequences and lead to much confusion!

some_more_species[2] = "CV"
some_species

OUTPUT

['NL', 'DM', 'CV', 'PE', 'DS']

As you can see here, the “PV” value was replaced by “CV” in both lists, even if we didn’t intend to make the change in the some_species list.

This takes practice and time to get used to. The key thing to remember is that you should use the copy method to make a copy of your dataframes to avoid accidentally modifying the data in the original.

PYTHON

df2 = surveys.copy()
Key Points
  • pandas DataFrames carry many methods that can help you explore the properties and distribution of data.
  • Using the help function, reading error messages, and asking for help are all good strategies when things go wrong.
  • The type of an object determines what kinds of operations you can perform on and with it.
  • Python evaluates expressions in a line one by one before assigning the final result to a variable.