Introduction


  • Novel technologies produce data in great complexity and scale, requiring more sophisticated understanding of statistics.

Inference


  • Inference uses sampling to investigate population parameters such as mean and standard deviation.
  • A p-value describes the probability of obtaining a specific value.
  • A distribution summarizes a set of numbers.

Populations, Samples and Estimates


  • Parameters are measures of populations.
  • Statistics are measures of samples drawn from populations.
  • Statistical inference is the process of using sample statistics to answer questions about population parameters.

Central Limit Theorem and the t-distribution


  • We can calculate the exact probabilities of events through using mathematical formulas.
  • The Central Limit Theorem and t-distribution have different assumptions.
  • Both can be used to calculate probabilities.

Central Limit Theorem in practice


  • .
  • .
  • .
  • .

t-tests in practice


  • .
  • .
  • .
  • .

Confidence Intervals


  • A confidence intervals is a better alternative to reporting p-values alone.
  • A confidence interval estimates effect size and the uncertainty associated with this estimate.

Power Calculations


  • Power is the probability of detecting a real effect.
  • Effect size with a confidence interval is a more meaningful measure than a p-value.

Monte Carlo simulation


  • Random number generators can simulate imitate random variables from the real world.
  • Simulated data lets us examine properties of random variables and test out ideas or competing methods computationally.

Permutations


  • Permutation randomly shuffles data so that the null is true.
  • Permutation tests are useful when there is no good approximation such as that provided by the CLT.

Association tests


  • Binary, categorical and ordinal data require different tests.
  • Chi-squared and Fisher’s exact tests are useful in these cases.

Exploratory Data Analysis


  • Plot data to evaluate relationships between variables.

Plots to avoid


  • Pie charts, donut charts, and 3-dimensional plots should not be used to display data.
  • Box plots are a better alternative to bar plots.
  • If data are small, data points can be shown in the plot.