Summary and Setup

More data are better than less data, right? When interpreted through sophisticated analytical skills the answer could be yes. Absent these skills, analysts can be tricked by patterns in “big data” that appear by chance. This lesson presents statistical skills and knowledge to help data analysts in the life sciences to avoid some of the most common pitfalls of big data. Lesson material is derived from the HarvardX Biomedical Data Science series, part of which is published as the book Data Analysis for the Life Sciences (Irizarry & Love, 2016).

Prerequisite

Prerequisites

This lesson assumes basic skills in the R statistical programming language and the RStudio integrated development environment.

To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.

Installation


R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.

  1. Install the latest version of R from CRAN.

  2. Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux.

  3. Start RStudio. The tidyverse contains several packages that work together for everyday use in data science. You can install them from the Console or from the RStudio Packages tab.

R

install.packages("tidyverse")

Make sure that the installation was successful by loading the tidyverse library. Do this in the Console as below, or check the box next to the tidyverse library in the RStudio Packages tab.

R

library(tidyverse)

Also install and load the libraries for downloader and rafalib by following the same procedure that you followed for the tidyverse.

Data files and project organization


  1. Make a new folder in your Desktop called inference. Move into this new folder.

  2. Create a data folder to hold the data, a scripts folder to house your scripts, and a results folder to hold results.

Alternatively, you can use the R console to run the following commands for steps 1 and 2.

R

setwd("~/Desktop")
dir.create("./inference")
setwd("~/Desktop/inference")
dir.create("./data")
dir.create("./scripts")
dir.create("./results")

Please download the following files and place them in your data folder. You can download the files from the URLs below and move the files the same way that you would for downloading and moving any other kind of data.

Alternatively, you can copy and paste the following into the R console to download the data.

R

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv", destfile = "data/femaleMiceWeights.csv")

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv", destfile = "data/femaleControlsPopulation.csv")
 
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv", destfile = "data/mice_pheno.csv")