Summary and Schedule

More data are better than less data, right? When interpreted through sophisticated analytical skills the answer could be yes. Absent these skills, analysts can be tricked by patterns in “big data” that appear by chance. This lesson presents statistical skills and knowledge to help data analysts in the life sciences to avoid some of the most common pitfalls of big data. Lesson material is derived from the HarvardX Biomedical Data Science series, part of which is published as the book Data Analysis for the Life Sciences (Irizarry & Love, 2016).

Prerequisite

Prerequisites

This lesson assumes basic skills in the R statistical programming language and the RStudio integrated development environment.

To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.

Setup Instructions

Download files required for the lesson

00h 00m

1. Introduction

What is statistical inference?
Why do biomedical researchers need to learn statistics now?

00h 05m

2. Inference

What does inference mean?
Why do we need p-values and confidence intervals?
What is a random variable?
What exactly is a distribution?

02h 00m

3. Populations, Samples and Estimates

What is a parameter from a population?
What are sample estimates?
How can we use sample estimates to make inferences about population parameters?

02h 00m

4. Central Limit Theorem and the t-distribution

What is a parameter from a population?

02h 50m

5. Central Limit Theorem in practice

How is the CLT used in practice?

02h 50m

6. t-tests in practice

How are t-tests used in practice?

02h 50m

7. Confidence Intervals

What is a confidence interval?
When is it best to use a confidence interval?

03h 10m

8. Power Calculations

What is statistical power?
How is power calculated?

03h 50m

9. Monte Carlo simulation

How are Monte Carlo simulations used in practice?

04h 45m

10. Permutations

What is a permutation test?
When is a permutation test helpful?

05h 25m

11. Association tests

What is the Chi-squared test?
What is Fisher’s exact test?
When would these tests be used?

06h 05m

12. Exploratory Data Analysis

How can data be visualized to reveal important relationships?
What is exploratory data analysis?

06h 45m

13. Plots to avoid

07h 05m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Installation

R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.

Install the latest version of R from CRAN.
Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux.
Start RStudio. The tidyverse contains several packages that work together for everyday use in data science. You can install them from the Console or from the RStudio Packages tab.

R

install.packages("tidyverse")

Make sure that the installation was successful by loading the tidyverse library. Do this in the Console as below, or check the box next to the tidyverse library in the RStudio Packages tab.

R

library(tidyverse)

Also install and load the libraries for downloader and rafalib by following the same procedure that you followed for the tidyverse.

Data files and project organization

Make a new folder in your Desktop called inference. Move into this new folder.
Create a data folder to hold the data, a scripts folder to house your scripts, and a results folder to hold results.

Alternatively, you can use the R console to run the following commands for steps 1 and 2.

R

setwd("~/Desktop")
dir.create("./inference")
setwd("~/Desktop/inference")
dir.create("./data")
dir.create("./scripts")
dir.create("./results")

Please download the following files and place them in your data folder. You can download the files from the URLs below and move the files the same way that you would for downloading and moving any other kind of data.

Alternatively, you can copy and paste the following into the R console to download the data.

R

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv", destfile = "data/femaleMiceWeights.csv")

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv", destfile = "data/femaleControlsPopulation.csv")
 
download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv", destfile = "data/mice_pheno.csv")