This lesson is still being designed and assembled (Pre-Alpha version)

Getting Data into R

Overview

Teaching: 50 min
Exercises: 9 min
Questions
  • How do I import Data into R

Objectives
  • Introducing the data pipeline

  • Doing a quick nose around dataframes in R

  • Get data out of your spreadsheet

  • {“Importing”=>”Scripting your import (and file paths)”}

  • Google Sheets bonus round

The data pipeline

We strongly believe that the greatest strength you will gain from the work today is reproducibility. Learning ‘coding’ is a pain, and for a ‘one off’ process there will nearly always be some tempting ‘point-and-click’ alternative. However, learning to code, means being forced to write down what you did to get from your data to your result. You’ll only have to reproduce something once to appreciate the value of having everything documented. Whether it is an error in your data, an new observation, or your supervisor or collaborator asking you to repeat an analysis with a twist, you will always be revisiting and revising your work. And the initial investment in writing (i.e. coding) your analysis means that these adjustments become simple.

Traditional workflow

  1. Collect data on paper
  2. Transcribe paper forms into Excel or similar
  3. Import Excel into SPSS or another stats package
  4. Perform analyses and make graphs
  5. Copy/paste these into your final document (Word, Powerpoint) etc

Problems

A data pipeline (aka scripting)

Learn just enough code to write down your ‘workflow’. Ensure the workflow starts with the raw data, and then any changes in the workflow or the data will automatically propagate.

  1. Collect data electronically
  2. Write a script that converts your data into figures and tables
  3. Knit everything into a single report?

OK. Step 3 is showing off, and probably too much for today but as an FYI everything you are reading now was written in R including the text, the code, the tables and the figures. When we want to update something, we just edit and ‘knit’. Look at the course by Roger Peng at Johns Hopkins on Coursera if you want to learn more.

Getting data out of your spreadsheet: CSV files

We need a ‘common data language’ that all spreadsheet languages can write to, and all statistical and programming languages can read. Arguably one of the oldest, and most ubiquitous is the ‘comma separated values’ (CSV) file. This is easily exportable from Microsoft Excel, Apple Numbers, Open Office, Google Sheets…etc.

It’s a simple format. The top line are the column names, each seperated by a comma. The following lines are the observations in those columns, again, seperated by a comma.

It’s strength is in it’s simplicity. It only has data, no formulas, no tricks and is very well recognised amongst software packages as it is very easily supported. R has excellent support for CSV.

Export CSV From Excel

Exercise: Export an .xlsx file to .csv

  1. Find the cleaned .xlsx file that was produced in the Excel Hell lecture and export it as a .csv file. If all else fails then we have a copy here.

Once a sheet has been exported, it can be imported into R:

Getting CSV data into R

There’re many ways to do this but mainly:

  1. Use the read_csv function from the readr package.
  2. Make a connection to a database or an online Google Sheets spreadsheet. This is left to you to do as an exercise.

Scripting

We are going import out data using readr in 3 steps:

  1. Find the path to your file. We covered file paths in lesson 1, and this is just the formal address of your file on your computer. A quick way to do this in R is to use the file.choose function.
  2. We will take the path you have generated and name it myfile.
  3. We will use a function called read_csv from the readr library to import the data into a data frame (which we in turn name df for convenience).

Try right clicking a file in Finder (on a Mac) or Windows Explorer (on a PC). You’ll normally see an option for ‘info’ or ‘properties’ that will show you the path to your file.

install.packages("readr")       # install only needed the first time
library(readr)                  # load the readr family functions including read_csv
myfile <- "the/path/to/my/csv/file.csv"
dataframename <- read_csv(myfile)

You could have done this in one step, but it would have made things harder to read. Hard to read, means difficult to remember, and we are doing our best to avoid that!

There is a function read.csv provided by R but read_csv is better. The built in function has a couple of annoying ‘habits’. If you wish to use it then don’t forget to specify: (1) header=TRUE which tells the function to expect column names in row 1 instead of data, and (2) stringsAsFactors=FALSE which is a necessary but annoying reminder to R that you want it to leave ‘strings’ alone and not convert them to ‘labelled’ factors.

Exercise: Import the .csv into R

  1. Import the .csv file from the Excel Hell lesson as a dataframe and call it RCT
  2. Name all the columns in the dataframe.

Everything is a “data-frame”

Step 1 in your pipeline has to be getting your (now beautifully tidy) spreadsheet into R.

Spreadsheets in R are called ‘dataframes’.

A data-frame has columns, each identified by a name, and rows for observations. This means that however you import your data into R within your pipeline, those data will end up as a dataframe. Let’s do a quick tour of dataframes in R.

# Have a look at the first few rows of data you've imported
head(RCT)

# Have a look at the last few rows
tail(RCT)

# Have a look at the entire dataframe
View(RCT)

The ‘rows’ are contain measurements from different individuals, and the columns contain the measurements. Just like in Excel….

Most of the time the rows are ‘observations’ and we want to pick out ‘characteristics’ of those observations (i.e. the columns). We can just ask for a columm by name using the $ operator: e.g. dataframename$some_column

# First we need to know what column names exist
names(RCT)

RCT$Age
RCT$Gender

Exercise: Taking a quick look at the data

  1. What are the names of the columns? Hint: try names()

  2. Display in the console the items in the column PS_Prior_op

  3. Can you do the same for all the other columns in RCT?

  4. Can you find out how many patients were randomised to each arm? Hint: you can use the function table() or xtab()

  5. Can you find out how many subjects there are?

Exercise: Answers

  1. What are the names of the columns? names(RCT)

  2. Display in the console the items in the column RCT$PS_Prior_op

  3. Can you do the same for all the other columns in RCT? RCT$Gender, RCT$Age, etc.

  4. Can you find out how many patients were randomised to each arm? table(RCT$Randomisation)

  5. Can you find out how many subjects there are? nrow(RCT)

Google Sheets

We can also load data through a connection with Google Sheets. What’s good about this? Collaboration and live-updating. Let’s use this system to import a cleaned version of the breast RCT data we have been working on.

We’re going to use data in a shared sheet we’ve called rct-clean.

The sheet is available here.

The googlesheets library

First we need the functions someone else has kindly written that allow R to talk to google sheets. You will need to install the library first.

install.packages("googlesheets")

Now let’s have a look at the sheets in your account. We’ll use the gs_ls() function.

The first time you run this you will be asked to authenticate (a browser window will open and ask you to sign in to your google account). Behind the scenes R now saves a hidden file into your working directory. The next time you ever run your script, as long as you haven’t moved your code to a new directory, it won’t need to ask.

library(googlesheets)
gs_ls()

You can see a list of google sheets. It is now straightforward to read these into R. Let’s import the shared sheet rct-clean, and name it gsheet.

gsheet <- gs_title("rct-clean")

So far so good, but sheet isn’t a data frame, it’s just a connection to that sheet. There’s one more step.

df <- gs_read(gsheet)
df

Bonus: The googlesheets library is actually doing one better than making a dataframe, and instead makes a tibble. Say it quickly and it sounds like table? It’s just a fancy modern version of R’s basic dataframe. You don’t need to worry about the difference.

Exercise: Taking a quick look at the data

a. What are the names of the columns?

b. Can you have a look at the column containing the age of the patients?

c. Can you find out how many patients were randomised to each arm? Hint: you can use the function table() or xtab()

d. Can you find out how many subjects there are?

Answers

a. What are the names of the columns?

names(df)

b. Can you have a look at the column containing the age of the patients?

df$age

c. Can you find out how many patients were randomised to each arm? Hint: you can use the function table() or xtab()

table(df$random)

d. Can you find out how many subjects there are?

nrow(df)

Key Points

  • Importing data into R using commands provides long term benefits for reproducibility