Summary and Schedule
More data are better than less data, right? When interpreted through sophisticated analytical skills the answer could be yes. Absent these skills, analysts can be tricked by patterns in “big data” that appear by chance. This lesson presents statistical skills and knowledge to help data analysts in the life sciences to avoid some of the most common pitfalls of big data. Lesson material is derived from the HarvardX Biomedical Data Science series, part of which is published as the book Data Analysis for the Life Sciences (Irizarry & Love, 2016).
Prerequisites
This lesson assumes basic skills in the R statistical programming language and the RStudio integrated development environment.
To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction |
What is statistical inference? Why do biomedical researchers need to learn statistics now? |
Duration: 00h 05m | 2. Inference |
What does inference mean? Why do we need p-values and confidence intervals? What is a random variable? What exactly is a distribution? |
Duration: 02h 00m | 3. Populations, Samples and Estimates |
What is a parameter from a population? What are sample estimates? How can we use sample estimates to make inferences about population parameters? |
Duration: 02h 00m | 4. Central Limit Theorem and the t-distribution | What is a parameter from a population? |
Duration: 02h 50m | 5. Central Limit Theorem in practice | How is the CLT used in practice? |
Duration: 02h 50m | 6. t-tests in practice | How are t-tests used in practice? |
Duration: 02h 50m | 7. Confidence Intervals |
What is a confidence interval? When is it best to use a confidence interval? |
Duration: 03h 10m | 8. Power Calculations |
What is statistical power? How is power calculated? |
Duration: 03h 50m | 9. Monte Carlo simulation | How are Monte Carlo simulations used in practice? |
Duration: 04h 45m | 10. Permutations |
What is a permutation test? When is a permutation test helpful? |
Duration: 05h 25m | 11. Association tests |
What is the Chi-squared test? What is Fisher’s exact test? When would these tests be used? |
Duration: 06h 05m | 12. Exploratory Data Analysis |
How can data be visualized to reveal important relationships? What is exploratory data analysis? |
Duration: 06h 45m | 13. Plots to avoid | ? |
Duration: 07h 05m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Installation
R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.
Install the latest version of R from CRAN.
Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux.
Start RStudio. The tidyverse contains several packages that work together for everyday use in data science. You can install them from the Console or from the RStudio Packages tab.
R
install.packages("tidyverse")
Make sure that the installation was successful by loading the
tidyverse
library. Do this in the Console as below, or
check the box next to the tidyverse
library in the RStudio
Packages tab.
R
library(tidyverse)
Also install and load the libraries for downloader
and
rafalib
by following the same procedure that you followed
for the tidyverse
.
Data files and project organization
Make a new folder in your Desktop called
inference
. Move into this new folder.Create a
data
folder to hold the data, ascripts
folder to house your scripts, and aresults
folder to hold results.
Alternatively, you can use the R console to run the following commands for steps 1 and 2.
R
setwd("~/Desktop")
dir.create("./inference")
setwd("~/Desktop/inference")
dir.create("./data")
dir.create("./scripts")
dir.create("./results")
Please download the following files and place them in your
data
folder. You can download the files from the URLs below
and move the files the same way that you would for downloading and
moving any other kind of data.
Alternatively, you can copy and paste the following into the R console to download the data.
R
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv", destfile = "data/femaleMiceWeights.csv")
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv", destfile = "data/femaleControlsPopulation.csv")
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv", destfile = "data/mice_pheno.csv")