AI assisted plotting with R

Last updated on 2026-04-16 | Edit this page

Overview

Questions

  • What are potential accessibility benefits of AI for data visualization?
  • How can you use AI in ways that improve your comprehension of visualizations and code?
  • What are risks of using AI to help you do visualizations?

Objectives

  • Recreate the biodiversity and racial redlining bar chart from PNAS.
  • Use key statistical and visualization concepts when employing AI to help you write R code
  • Test and save the code using R Studio or Jupyter Lite.

A. Biodiversity and redlining bar chart revisited.


In the previous episode, we learned how to adapt the code we wrote for the Du Bois literacy bar chart to recreate a bar chart of biodiversity data in the Du Bois style.

We’re going to recreate the bar chart again by using AI to:

  1. write R code for generating the chart in the Du Boisian style.
  2. write comments explaining the code in the chart.
  3. gain an understanding of the chart by manually deleting portions of the code to make the graph simpler, and debugging problems that arise when we delete necessary parts of the code.

As noted in the prior episode, the biodiversity bar chart below comes from an article in the Proceedings of the National Academy of Science (PNAS) here The bar chart shows lower biodiversity in San Diego neighborhoods that were redlined in the middle of the 20th century because they had large numbers of non-white residents. The home owners loan corporation (HOLC) tracked neighborhoods that would have been redlined by the Federal Housing Adminsation and local lenders, making it harder to get affordable homeloans in the neighborhood and “serving as a proxy for numerous prior and existing racialized policies at the federal, state, and local level that reinforced racial segregation, discrimination, and disinvestment” (Estian et al. 2024). The chart shows lower biodersity scores (0-100) in neighborhoods with lower A to D HOLC letter grades (indicating worse redlining treatment of neighborhoods).

B. Potential benefits and risks in AI assisted visualization


Potential benefits from AI are possible from ways it can make computational visualization more accessible to beginners who do not yet know a language like R:

  • The latest AI tools allow us to write commands for AI models using natural language. So when we write an AI prompt, it’s like writing code but in the language we use in everyday discussion.
  • AI will often carry out natural language prompts using the same languages we use for data visualization like R and Python.

Major risks also exist when using AI for data visualization:

  • While AI models can generate reproduceable visualizations, they require consistent natural language commands based in an understanding of visualization practices and statistics. Even then, AI can produce inconsistent and incorrect results.
  • It’s always important to check results from AI by requesting and reviewing code in R that shows how a visualization is created.
  • If one does not know fundamental statistical or programming concepts, it is difficult or impossible to check the reliability of AI generated code and charts.



C. An epic fail without fundamental concepts


To see the risk of using AI without knowledge of key concepts, we’ll ask AI tools to help recreate the biodiversity chart:

  1. Open ChatGPT, Claude, or Gemini in another browser window.
  2. Right click the image above and paste it into the AI chat.
  3. Write and submit the command below in the chat, copy and past the url for our data.

AI Prompt: give me code for generating a bar chart in R using only the tidyverse library with this data: https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv

Next, copy and paste the code into an .Rmd file or a JupyterLite Du Bois Notebook. Or use another Jupyter Lite tool with R.

After running the code, click the dropdown below to see if you get the same code and epic failure of a result that we did.

Note, if you are using Jupyter Lite and AI suggests using read_csv, instead use read.csv.

Code and results ChatGPT gave the authors:

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

df <- read.csv("https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv")

df %>%
  count(grade) %>%
  ggplot(aes(x = grade, y = n)) +
  geom_col(fill = "steelblue") +
  labs(x = "Grade", y = "Count") +
  theme_minimal()



D. Using concepts for better natural language prompts


To use AI to reproduce the original graph, we need to write a natural language prompt that better communicates the content of our data. This requires some key concepts in statistics and graphing, namely that bar heights typically represent a statistical measure of a variable within different categories. But what statistical measure?

Look at the data from the csv as reported in the ChatGPT output. Or check it yourself by using df to print the full data frame. Then think about these questions:

  • What data is contained in each row of the grade column and the biodiversity column?
  • How does the original chart represent this data?
  • How does the ChatGPT chart treat the data differently than the original chart above?
  • What clue does the R code from ChatGPT provide about the statistic it is plotting?
  • What is a better AI prompt for generating R code?
Click here for some answers:

The data in the csv collapses the author’s original data into four observations, one for each HOLC redlining letter grade. The biodiversity column then contains the mean biodiversity score for each neighborhood receiving one of the four letter grades.

The original bar chart plots a bar with a bar height corresponding to the mean biodiversity score for each letter grade.

ChatGPT instead assumed that we were plotting bars that represent a count of how many observations there were for each letter grade – 1 for each letter grade.

Try a revised prompt that communicates the form of the data and the statistics that bars should represent.

AI prompt: treat the biodiversity column values as the the height of the bar. they represent the mean biodiversity score.

Don’t just try the code AI gives you back, but also read its supporting explanations to see if they make sense.

ChatGPT provided the authors with this code and result that more closely resembles the original graph:

R

library(tidyverse)

# load data
df <- read.csv("https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv")

# check column names
glimpse(df)

OUTPUT

Rows: 4
Columns: 2
$ grade        <chr> "A", "B", "C", "D"
$ biodiversity <int> 48, 24, 12, 4

R

# bar chart: biodiversity values as bar height
df %>%
  ggplot(aes(
    x = reorder(grade, biodiversity),  # grouping variable
    y = biodiversity                   # mean biodiversity score
  )) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Mean Biodiversity Score by Redlining Grade",
    x = "Redlining Grade",
    y = "Mean Biodiversity Score"
  ) +
  theme_minimal()



E. Add comments that explain what the code does


Once you think the code has the fundamentals right, you can use AI to help explain what each line of code does to help you read through the code and make sure you understand what it’s doing.

Let’s combine this with a small fix to the bar ordering in the prior graph:

AI Prompt: Change the order of the bars so that they are in order of grade from A to D. Add commentsthat explain what is done by each line R code for the graph.

Click to see the commented code ChatGPT gave us back:

R

# load the tidyverse (includes ggplot2, dplyr, readr, etc.)
library(tidyverse)

# read the CSV file directly from the provided URL into a dataframe
df <- read.csv("https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv")

# inspect the structure of the dataset to confirm column names and types
glimpse(df)

OUTPUT

Rows: 4
Columns: 2
$ grade        <chr> "A", "B", "C", "D"
$ biodiversity <int> 48, 24, 12, 4

R

# convert the grade column into an ordered factor so bars appear A → D
df <- df %>%
  mutate(
    grade = factor(grade, levels = c("A", "B", "C", "D"))
  )

# create the bar chart
df %>%
  ggplot(aes(
    x = grade,          # place redlining grade on the x-axis
    y = biodiversity    # use biodiversity values as bar height (mean scores)
  )) +
  geom_col(             # create bars with heights equal to y values
    fill = "steelblue"  # set bar color
  ) +
  labs(
    title = "Mean Biodiversity Score by Redlining Grade",  # chart title
    x = "Redlining Grade",                                 # x-axis label
    y = "Mean Biodiversity Score"                          # y-axis label
  ) +
  theme_minimal()      # apply a clean, minimal theme



F. Learn by commenting out unnecessary code.


Legible code is succinct code. You can better learn what different parts of this code do by commenting out lines that you think are unnecessary. In R, we do this by adding the # comment symbol before a line of code.

Then you can debug errors this creates or delete the # to restore the code.

Sometimes, different lines of code are connected to each other. So if you comment out or change a line of code, you have to change another.

What do you think can be just commented out without breaking other essential code?

What we comment out:

If you try commenting out this code, you’ll find it’s totally unnecessary.

R

# convert the grade column into an ordered factor so bars appear A → D
 df <- df %>%
  mutate(
    grade = factor(grade, levels = c("A", "B", "C", "D"))
  )



G. Using AI for stylization


Once you have R code for a chart with fundamentals you understand, you will be on stronger footing to ask AI to help modify your code and graph in a particular style.

Copy and paste your code into the prompt and try this:

AI prompt: revise this code to make the chart in the Du Bois style described in the Du Bois visualization tool kit.

AI prompt: revise this code to make the chart according to Kieran Healy’s best practices of making figures

ChatGPT response authors got for Du Bois style:

Chat GPT really wants to put back in extraneous mutation code.

To get more Du Boisian, you would need to direct AI to make more changes like using horizontal bars and the tall poster aspect ratio of Du Bois’ original posters.

R

# load tidyverse
library(tidyverse)

# read data (use read_csv for tidyverse consistency)
df <- read.csv("https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv")

# inspect structure
glimpse(df)

OUTPUT

Rows: 4
Columns: 2
$ grade        <chr> "A", "B", "C", "D"
$ biodiversity <int> 48, 24, 12, 4

R

# ensure grade is ordered
df <- df %>%
  mutate(grade = factor(grade, levels = c("A", "B", "C", "D")))

# summarize biodiversity by grade (important!)
summary_df <- df %>%
  group_by(grade) %>%
  summarize(mean_biodiversity = mean(biodiversity, na.rm = TRUE))

# Du Bois–style bar chart
ggplot(summary_df, aes(x = grade, y = mean_biodiversity, fill = grade)) +
  geom_col(
    width = 0.65,
    color = "black",
    linewidth = 1.2
  ) +
  geom_text(
    aes(label = round(mean_biodiversity, 1)),
    vjust = -0.6,
    family = "serif",
    size = 5
  ) +
  scale_fill_manual(values = c(
    "A" = "#2A9D8F",   # teal
    "B" = "#E9C46A",   # gold
    "C" = "#F4A261",   # orange
    "D" = "#E63946"    # red
  )) +
  labs(
    title = "Biodiversity by Redlining Grade",
    subtitle = "A Du Bois–Inspired Visualization",
    x = NULL,
    y = NULL
  ) +
  theme(
    # warm background (key Du Bois look)
    plot.background = element_rect(fill = "#F5E6C8", color = NA),
    panel.background = element_rect(fill = "#F5E6C8", color = NA),

    # remove gridlines completely
    panel.grid = element_blank(),

    # bold, centered titles
    plot.title = element_text(
      family = "serif",
      face = "bold",
      size = 20,
      hjust = 0.5
    ),
    plot.subtitle = element_text(
      family = "serif",
      size = 13,
      hjust = 0.5
    ),

    # x-axis styling (bold category labels)
    axis.text.x = element_text(
      family = "serif",
      face = "bold",
      size = 14
    ),

    # remove y-axis text/ticks for cleaner look
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),

    # emphasize baseline only
    axis.line.x = element_line(color = "black", linewidth = 1.2),

    # remove legend (color already encoded in x)
    legend.position = "none"
  )



ChatGPT response authors got for Healy style::

Healy prompt adds some notes on sourcing and explanation. ChatGPT again adds unnecessary code to summarize the data rather than plot the data as is.

R

# load tidyverse
library(tidyverse)

# read data
df <- read_csv("https://raw.githubusercontent.com/HigherEdData/Du-Bois-STEM/main/data/d_biodiversity_redlining.csv")

OUTPUT

Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): grade
dbl (1): biodiversity

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R

# check structure
glimpse(df)

OUTPUT

Rows: 4
Columns: 2
$ grade        <chr> "A", "B", "C", "D"
$ biodiversity <dbl> 48, 24, 12, 4

R

# clean + summarize data properly
summary_df <- df %>%
  mutate(grade = factor(grade, levels = c("A", "B", "C", "D"))) %>%
  group_by(grade) %>%
  summarize(
    mean_biodiversity = mean(biodiversity, na.rm = TRUE),
    .groups = "drop"
  )

# plot using Healy-style best practices
ggplot(summary_df, aes(x = grade, y = mean_biodiversity)) +
  geom_col(fill = "gray30", width = 0.7) +
  labs(
    title = "Mean Biodiversity Score by Redlining Grade",
    subtitle = "Average biodiversity decreases from A to D graded neighborhoods",
    x = "HOLC Redlining Grade",
    y = "Mean Biodiversity Score",
    caption = "Source: HigherEdData Du Bois STEM dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    # improve readability
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(margin = margin(b = 10)),
    
    # keep useful gridlines, remove clutter
    panel.grid.minor = element_blank(),
    
    # clearer axis text
    axis.text.x = element_text(face = "bold"),
    
    # better spacing
    plot.caption = element_text(size = 9, color = "gray40")
  )



Key Points
  • Using AI without fundamental statistics and visualization concepts can lead to epic fails
  • Using key visualization and statistics concepts for better natural language prompts
  • AI can add comments that explain what the code does - read them
  • Delete code that might be unnecessary to simplify code and only use code you can understand