Introduction
- Novel technologies produce data in great complexity and scale, requiring more sophisticated understanding of statistics.
Inference
- Inference uses sampling to investigate population parameters such as mean and standard deviation.
- A p-value describes the probability of obtaining a specific value.
- A distribution summarizes a set of numbers.
Populations, Samples and Estimates
- Parameters are measures of populations.
- Statistics are measures of samples drawn from populations.
- Statistical inference is the process of using sample statistics to answer questions about population parameters.
Central Limit Theorem and the t-distribution
- We can calculate the exact probabilities of events through using mathematical formulas.
- The Central Limit Theorem and t-distribution have different assumptions.
- Both can be used to calculate probabilities.
Central Limit Theorem in practice
- .
- .
- .
- .
t-tests in practice
- .
- .
- .
- .
Confidence Intervals
- A confidence intervals is a better alternative to reporting p-values alone.
- A confidence interval estimates effect size and the uncertainty associated with this estimate.
Power Calculations
- Power is the probability of detecting a real effect.
- Effect size with a confidence interval is a more meaningful measure than a p-value.
Monte Carlo simulation
- Random number generators can simulate imitate random variables from the real world.
- Simulated data lets us examine properties of random variables and test out ideas or competing methods computationally.
Permutations
- Permutation randomly shuffles data so that the null is true.
- Permutation tests are useful when there is no good approximation such as that provided by the CLT.
Association tests
- Binary, categorical and ordinal data require different tests.
- Chi-squared and Fisher’s exact tests are useful in these cases.
Exploratory Data Analysis
- Plot data to evaluate relationships between variables.
Plots to avoid
- Pie charts, donut charts, and 3-dimensional plots should not be used to display data.
- Box plots are a better alternative to bar plots.
- If data are small, data points can be shown in the plot.