Statistical Inference for Biology: All Images

Plotting these heights as bars is what we call a histogram. It is a more useful plot because we are usually more interested in intervals, such and such percent are between 70 inches and 71 inches, etc., rather than the percent less than a particular height. Here is a histogram of heights:

Figure 4

Figure 5

Note that the X is now capitalized to distinguish it as a random variable and that the equation above defines the probability distribution of the random variable. Knowing this distribution is incredibly useful in science. For example, in the case above, if we know the distribution of the difference in mean of mouse weights when the null hypothesis is true, referred to as the null distribution, we can compute the probability of observing a value as large as we did, referred to as a p-value. In a previous section we ran what is called a Monte Carlo simulation (we will provide more details on Monte Carlo simulation in a later section) and we obtained 10,000 outcomes of the random variable under the null hypothesis.

Figure 6

Null distribution with observed difference marked with vertical red line.

Figure 7

Populations, Samples and Estimates

Figure 1

Does mean of Y less mean of X equal zero? We can even compare more than two means. The three normal curves below help to visualize a question comparing the means of each curve to one of the others.

Figure 2

Normal curves with different means and standard deviations

Figure 3

Figure 4

Figure 5

Central Limit Theorem and the t-distribution

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Histograms of all weights for both populations.

Figure 8

Quantile-quantile plots of all weights for both populations.

Central Limit Theorem in practice

Figure 1

Quantile versus quantile plot of simulated differences versus theoretical normal distribution for four different sample sizes.

Figure 2

Quantile versus quantile plot of simulated ratios versus theoretical normal distribution for four different sample sizes.

Figure 3

The result above tells us the distribution of the following random variable:

What does the CLT tell us is the mean of Z (you don’t need code)?

Figure 4

Now we introduce the concept of a null hypothesis. We don’t know μ_x nor μ_y. We want to quantify what the data say about the possibility that the diet has no effect: μ_x = μ_y. If we use CLT, then we approximate the distribution of X̄ as normal with mean μ_X and standard deviation σ_X and the distribution of Ȳ as normal with mean μ_Y and standard deviation σ_Y. This implies that the difference Ȳ - X̄ has mean 0. We described that the standard deviation of this statistic (the standard error) is

and that we estimate the population standard deviations σ_X and σ_Y with the sample estimates. What is the estimate of

Figure 5

Normal with mean 0 and standard deviation

t-tests in practice

Figure 1

Figure 2

Quantile-quantile plots for sample against theoretical normal distribution.

Confidence Intervals

Figure 1

Figure 2

Figure 3

We show 250 random realizations of 95% confidence intervals. The color denotes if the interval fell on the parameter or not.

Figure 4

We show 250 random realizations of 95% confidence intervals, but now for a smaller sample size. The confidence interval is based on the CLT approximation. The color denotes if the interval fell on the parameter or not.

Figure 5

We show 250 random realizations of 95% confidence intervals, but now for a smaller sample size. The confidence is now based on the t-distribution approximation. The color denotes if the interval fell on the parameter or not.

Figure 6

with

Figure 7

Figure 8

Figure 9

This suggests that either

Figure 10

Figure 11

This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0.05 (approximately, for a more exact calculation use qnorm(.05/2) instead of 2). The same calculation can be made if we use the t-distribution instead of CLT (with qt(.05/2, df = 2 * N-2)). In summary, if a 95% or 99% confidence interval does not include 0, then the p-value must be smaller than 0.05 or 0.01 respectively.