8  Hypothesis testing I

One of the key problems in data science is assessing whether a pattern you notice is :

  1. a pattern inherent in the overall population from which the observations are drawn,

OR

  1. a spurious pattern specific to the sample of observations you have.

In this case, you learnt a fundamental tool to approach this problem called statistical hypothesis testing. You should know how to conduct a hypothesis test, analyze its outcome, and identify its shortcomings.

8.1 Preliminary modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels
from scipy import stats 
from pingouin import pairwise_tests #this is for performing the pairwise tests
from pingouin import pairwise_ttests #this is for performing the pairwise tests

8.2 Hypothesis testing overview

Hypothesis help you rule out sampling variation as an explanation for an observed pattern in a data set. It helps us lift a pattern from a sample to a population. The idea is this - every data set we have seen in this course is comprised of a small subset of a population we are interested in learning about. That small subset is called a sample. For instance, in the previous case, we were interested in the population of trips taken on the rentable bikes. We had access to a small subset of the trips taken - this is the sample. Now, there is a small problem with the results of any data analysis. The subset of observations, or sample of observations, that we have access to is random. We could have easily drawn a different subset and a different subset will give different output in data analysis - that is - different plots and summary statistics. For instance, the average length of a bike trip could be 45 minutes in one subset and 30 minutes in another. The true average bike trip length for the population could be anything - say 39. The purpose of hypothesis testing is to rule out the situation where the sample may not resemble the population. In other words, it is useful to say that the conclusions made in our data analysis are likely to reflect that of the population, and are not due to drawing a bad sample.

In this course, we will mainly focus on saying something about the mean of potentially several populations. In general, the hypothesis testing procedure concerns two hypotheses - the null and alternative hypothesis. These contain mutually exclusive statements about a population.

For example: \[H_0: \text{The population trip length is } 40\ vs.\ H_a: \text{The population trip length is not } 40.\]

\(H_0\) is called the null hypothesis. The null hypothesis often corresponds to the situation of “no effect” - but not always.

In opposition to the null hypothesis, we define an alternative hypothesis (often indicated with \(H_1\) or \(H_a\)) to challenge the null hypothesis. In general, this is the hypothesis of there existing some pattern or effect in the data.

For instance, we may suspect that the average trip length is 40, and we may be concerned about whether our data shows that the trip length is longer than this. This may be of concern for computing bike maintenance costs. In that case, we would define \(H_a: \text{The population trip length is greater than } 40.\)

The next step in a hypothesis test is to compute a \(p\)-value using our sample. The \(p\)-value measures the evidence against the null hypothesis. It is a number between 0 and 1. The closer to 0 the \(p\)-value is, the more evidence there is against the null hypothesis. The \(p\)-value can be interpreted as follows: The \(p\)-value is the probability that, assuming the null hypothesis is true, we drew a sample that differs from the null hypothesis at least as much as the one at hand.

Let’s unpack this statement using our example. First, before computing our \(p\)-value we assume that the null hypothesis is true. In our example, we would assume that the mean population trip length is 40 minutes. Then, if the mean population trip length is 40 minutes we can measure how far the sample mean is from 40. Say that our sample mean is 45, and so then we would have observed a sample whose mean is 5 minutes higher than that of the assumed population mean. The \(p\)-value is then the probability that, assuming the true mean is 40, we saw a sample whose sample mean is at least 5 minutes away from 40. You will learn in a later course how to compute such a probability. Intuitively, the higher the distance of the sample mean from 40, the lower the \(p\)-value. Therefore, if we observe a sample mean far from 40, the \(p\)-value will be very close to 0. We can interpret the \(p\)-value as the probability, assuming the mean population trip length is 40, that we drew a sample whose sample mean differs from 40 at least as much as the sample mean of the data set at hand.

Caution

The \(p\)-value DOESN’T mean that the probability of \(H_a\) is 1-(\(p\) - value).

In general, \(p\)-values close to 0, as in <0.1 or <0.05 present evidence against the null hypothesis. This threshold can depend on the application.

When doing a formal hypothesis tests, there are two possible outcomes for a test:

  1. We conclude \(H_0\) is false, and say we reject \(H_0\). In this case we will conclude that there is statistical evidence for the alternative \(H_a\).
  2. We fail to reject \(H_0\). In this case, we conclude that there is not enough statistical evidence to say that \(H_0\) is false.
Caution

Notice that in the second case we cannot say that the null hypothesis is true - it might be that we just don’t have enough data to rule it out.

In a formal hypothesis test, we define a threshold \(\alpha\), where if the \(p\)-value falls below this threshold, then we reject the null hypothesis. In general, less formal cases, one may just compute the \(p\)-value and use it to inform decision making, along with other factors.

To summarize:

  1. A hypothesis test is used to confirm that a pattern is a feature of the population, and is not due to sampling variation.
  2. To conduct a hypothesis test, we define a null and alternative hypothesis.
  3. After defining the hypohtheses, we then compute the evidence against the null hypothesis, given by the \(p\)-value.
  4. If the \(p\)-value is small, we reject the null hypothesis and conclude the alternative. Otherwise, we fail to reject the null hypothesis.
  5. You should always interpret the \(p\)-value and state your conclusion as the final step in the hypothesis test.

To elaborate on point 5. - the point of any statistical analysis is not to run the correct code, but to properly choose the analysis procedure and interpret the results appropriately.

Warning

A hypothesis test cannot tell you which scenario is certainly true - we would have to have access to the whole population to know that. It can, however, tell us that certain patterns are features of the population with very high certainty, which is sufficient for most situations.

The hypothesis tests introduced in Case 8 concern only testing for the mean of one or more populations. We cover each of those in turn.

8.3 Testing for the mean in a single population

We first cover how to perform a hypothesis test concerning the mean value of a single population. Define the population mean of a single population to be \(\mu\). The null hypothesis in this case is in the form of \(H_0\colon \mu=\mu_0\). For instance, above, \(\mu_0=40\).

We have three different ways to define an alternative hypothesis:

  1. \(H_a: \mu \neq \mu_0\) (two-sided test)

  2. \(H_a: \mu > \mu_0\) (one-sided test)

  3. \(H_a: \mu < \mu_0\) (one-sided test)

A common way to perform this test is to use the one-sample \(t\)-test.

The syntax for perfomring the test is given as follows:

stats.ttest_1samp(Series you want to test, popmean= mu_0, alternative= specify which of (1-3) here)

The \(p\)-value is listed in the output, and can be interpreted as instructed above.

Example:

# Create a Pandas Series with sample data
data = pd.Series([2.3, 1.9, 2.7, 2.5, 2.1])

# Perform a one-sample t-test
t_stat, p_value = stats.ttest_1samp(data, popmean=2.0, alternative='greater')

# Print the results
print(f'T-statistic: {t_stat}, P-value: {p_value}')
T-statistic: 2.121320343559641, P-value: 0.0505957536091478

8.4 Testing for a difference in mean for two populations

We now cover how to perform a hypothesis test concerning the difference in the mean values between two populations. We would like to test whether two populations have different population means. That is, whether the difference between the population mean of group 1 (\(\mu_1\)) is different from the population mean of group 2 (\(\mu_2\)). The hypotheses look like: \[ H_0: \mu_1=\mu_2\] \[H_a: \mu_1 \neq \mu_2\]

A common way to perform this test is to use the two-sample \(t\)-test with unequal variance. (If you think the two populations have approximately the same variance, you can set equal_var=True in the following code.)

The syntax for perfomring the test is given as follows:

stats.ttest_ind(data for group 1 , data for group 2, equal_var=False)

The \(p\)-value is listed in the output, and can be interpreted as instructed above.

Example:

# Create two Pandas Series with sample data for two groups
group1 = pd.Series([2.3, 1.9, 2.7, 2.5, 2.1])
group2 = pd.Series([3.1, 3.6, 3.2, 3.8, 3.0])

# Perform an independent two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

# Print the results
print(f'T-statistic: {t_stat}, P-value: {p_value}')
T-statistic: -4.980696683149987, P-value: 0.0011005506250116467

8.5 Testing for a difference in mean for several populations

Lastly, if we would like to perform a hypothesis test for whether or not all the population means are the same when considering \(k>2\) populations, we can do the following.

First, the hypotheses are given by

\[H_0: \mu_1=\mu_2\ldots\mu_3=\mu_k,\] vs. \[H_a : \mathrm{At \,least\, one\, of\, the\, means\,} \mu_j \mathrm{\,is \,different\, from\, the \,others}.\]

To test this hypothesis we need an extension of the capabilities of the \(t\) - tests (which can test at most only two groups at the same time). This test is called Analysis of Variance (ANOVA).

The syntax is given as follows:

# This is code you can fill in to perform this test
mod = ols('quantity of interest ~ grouping variable', data= YOUR_DATAFRAME).fit()  
sm.stats.anova_lm(mod, typ=2)

The \(p\)-value is listed in the output, and can be interpreted as instructed above.

Example:

# Create a sample DataFrame
data = pd.DataFrame({
    'quantity': [2.3, 1.9, 2.7, 2.5, 2.1, 3.1, 3.6, 3.2, 3.8, 3.0],
    'group': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']
})

# Fit the linear model using OLS
mod = ols('quantity ~ group', data=data).fit()

# Perform ANOVA
anova_results = sm.stats.anova_lm(mod, typ=2)

# Print the results
print(anova_results)
          sum_sq   df          F    PR(>F)
group      2.704  1.0  24.807339  0.001079
Residual   0.872  8.0        NaN       NaN

8.6 Errors in hypothesis testing

There are two ways that a test can lead us to an incorrect decision:

  1. When \(H_0\) is true and we reject it. This is called Type 1 Error. It corresponds to obtaining a false positive.
  2. When \(H_0\) is false and we do not reject it. This is called Type 2 Error. It corresponds to having a false negative.

This can be summarized as follows:

\(H_0\) is true \(H_0\) is False
Reject \(H_0\) Type I error Correct Decision (True Positive)
Fail to Reject \(H_0\) Correct Decision (True negative) Type II error

In general, a Type I error is thought to be more serious and so it is standard practice to control the probability of making a Type I error. In a formal hypothesis test, when the null hypothesis is true, the probability of making a Type I error is the threshold \(\alpha\), also known as the significance level, that we introduced above. Often, we choose our significance level \(\alpha\) to be small, e.g., \(1\%,5\%,10\%\). Lowering the \(\alpha\) value (say to \(1\%\)) will decrease the probability of making a false positive conclusion, when the null hypothesis is true. Of course, because we control \(\alpha\), we cannot control the Type II error we make. Note that lowering \(\alpha\) is not without consequence, as, if the alternative hypothesis is true, then a lower threshold increases the probability of a Type II error.

In summary, it is important to be aware of the two types of error and evaluate the gravity of what making each error would mean for your context. In particular, you should evaluate the consequences of making a Type I error and choose your the threshold \(\alpha\) accordingly. Lastly, you should know that there is a trade-of between Type I error and Type II error in a given hypothesis test.

8.7 Misc. Python functions introduced in Case 8

  1. plt.subplot(rows, cols, curr_plot) - use this function to position multiple plots in a single figure. For example, plt.subplot(2, 3, 4) creates a grid with 2 rows and 3 columns and activates the 4th subplot.

  2. sns.countplot - creates a barplot.

  3. enumerate Use this function to get both the index and the value from an iterable in a loop. Example:

fruits = ['apple', 'banana', 'cherry']
for index, fruit in enumerate(fruits, start=1):
    print(f"{index}: {fruit}")
1: apple
2: banana
3: cherry
  1. plt.xticks(rotation = 90) used to rotate the \(x\)-axis labels.