Summary and Setup
More data are better than less data, right? When interpreted through sophisticated analytical skills the answer could be yes. Absent these skills, analysts can be tricked by patterns in “big data” that appear by chance. This lesson presents statistical skills and knowledge to help data analysts in the life sciences to avoid some of the most common pitfalls of big data. Lesson material is derived from the HarvardX Biomedical Data Science series, part of which is published as the book Data Analysis for the Life Sciences (Irizarry & Love, 2016).
Prerequisites
This lesson assumes basic skills in the R statistical programming language and the RStudio integrated development environment.
To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.
Installation
R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.
Install the latest version of R from CRAN.
Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux.
Start RStudio. The tidyverse contains several packages that work together for everyday use in data science. You can install them from the Console or from the RStudio Packages tab.
R
install.packages("tidyverse")
Make sure that the installation was successful by loading the
tidyverse
library. Do this in the Console as below, or
check the box next to the tidyverse
library in the RStudio
Packages tab.
R
library(tidyverse)
Also install and load the libraries for downloader
and
rafalib
by following the same procedure that you followed
for the tidyverse
.
Data files and project organization
Make a new folder in your Desktop called
inference
. Move into this new folder.Create a
data
folder to hold the data, ascripts
folder to house your scripts, and aresults
folder to hold results.
Alternatively, you can use the R console to run the following commands for steps 1 and 2.
R
setwd("~/Desktop")
dir.create("./inference")
setwd("~/Desktop/inference")
dir.create("./data")
dir.create("./scripts")
dir.create("./results")
Please download the following files and place them in your
data
folder. You can download the files from the URLs below
and move the files the same way that you would for downloading and
moving any other kind of data.
Alternatively, you can copy and paste the following into the R console to download the data.
R
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv", destfile = "data/femaleMiceWeights.csv")
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv", destfile = "data/femaleControlsPopulation.csv")
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv", destfile = "data/mice_pheno.csv")