Stats Review: The Most Dangerous Equation

2021-06-05

Notes from Causal Inference for the Brave and True

https://matheusfacure.github.io/python-causality-handbook/03-Stats-Review-The-Most-Dangerous-Equation.html

Stats Review: The Most Dangerous Equation

The Standard Error of Mean is a dangerous equation to not know:

$SE = \frac{\sigma}{\sqrt{n}}$

With $\sigma$ as the standard deviation and $n$ as the sample size.

Looking at a dataset of ENEM scores (Brazillian standardized high school scores similar to SAT) from different schools over three years:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
style.use("fivethirtyeight")

df = pd.read_csv("data/enem_scores.csv")
df.sort_values(by="avg_score", ascending=False).head(10)

	year	school_id	number_of_students	avg_score
16670	2007	33062633	68	82.97
16796	2007	33065403	172	82.04
16668	2005	33062633	59	81.89
16794	2005	33065403	177	81.66
10043	2007	29342880	43	80.32
18121	2007	33152314	14	79.82
16781	2007	33065250	80	79.67
3026	2007	22025740	144	79.52
14636	2007	31311723	222	79.41
17318	2007	33087679	210	79.38

Initial observation is that top performing schools seem to have a low number of students - Taking a look at the top 1%:

plot_data = (df
             .assign(top_school = df["avg_score"] >= np.quantile(df["avg_score"], .99))
             [["top_school", "number_of_students"]]
             .query(f"number_of_students<{np.quantile(df['number_of_students'], .98)}"))

plt.figure(figsize=(6,6))
sns.boxplot(x="top_school", y="number_of_students", data=plot_data)
plt.title("Number of Students in 1% top schools (right)")

Text(0.5, 1.0, 'Number of Students in 1% top schools (right)')

png

The data does suggest that top schools do have less students, which makes sense intuitively.

The trap appears when we just take this at face value and make decisions on it: - What if we looked at bottom 1% of schools too?

q_99 = np.quantile(df['avg_score'], .99)
q_01 = np.quantile(df['avg_score'], .01)

plot_data = (df
             .sample(10000)
             .assign(group = lambda d: np.select([d['avg_score'] > q_99,
                                                  d['avg_score'] < q_01],
                                                 ['Top', 'Bottom'],
                                                 'Middle')))

plt.figure(figsize=(10,5))
sns.scatterplot(y="avg_score",
                x="number_of_students",
                hue="group",
                data=plot_data)
plt.title("ENEM Score by Number of Students in the School")

Text(0.5, 1.0, 'ENEM Score by Number of Students in the School')

png

plot_data.groupby('group')['number_of_students'].median()

group
Bottom     95.5
Middle    104.0
Top        77.0
Name: number_of_students, dtype: float64

Bottom 1% of schools also have less students as well

As the number of units in a sample grows, the variance of the sample decreases and averages get more precise. - Smaller samples can have a lot of variance in their expected outcome variables due to chance

Speaks to a fundamental fact about the reality of information: it is always imprecise. - The question becomes: can we quantify how imprecise it is? - Probabilty is an acceptance of the lack of certainty in our knowledge and the development of methods for dealing with our ignorance

The Standard Error of our Estimates

We can test and see if the $ATE$ from the last chapter (GPA scores for students in traditional classrooms vs. online) was significant. First step is calculating the standard error $SE$:

data = pd.read_csv("data/online_classroom.csv")
online = data.query("format_ol ==  1")["falsexam"]
face_to_face = data.query("format_ol == 0 & format_blended == 0")["falsexam"]

def se(y: pd.Series):
    return y.std() / np.sqrt(len(y))

print(f"SE for Online: {se(online)}")
print(f"SE for Face to Face: {se(face_to_face)}")

SE for Online: 1.5371593973041635
SE for Face to Face: 0.8723511456319106

Confidence Intervals

The Standard Error $SE$ of an estimate is a measure of confidence.

Has a different interpretation depending on the different views of statistics (Frequentist and Bayesian).

Frequentist view: - The data is a manifestation of a true data generating process - If we could run multiple experiments and collect multiple datasets, all would resemble the underlying process

For the sake of an example lets say that the true distribution of a student's test score is normal distribution $N(\mu, \sigma^{2})$ with $\mu = 74$ and $\sigma = 2$.

Run 10,000 experiments, collecting 500 units per sample:

true_std = 2
true_mean = 74

n = 500
def run_experiment():
    return np.random.normal(true_mean, true_std, 500)

np.random.seed(42)

plt.figure(figsize=(8,5))
freq, bins, img = plt.hist([run_experiment().mean() for _ in range(10000)],
                           bins=40,
                           label="Experiment's Mean")
plt.vlines(true_mean, ymin=0, ymax=freq.max(), linestyles="dashed", label="True Mean")
plt.legend()

<matplotlib.legend.Legend at 0x7fbb7ab33e90>

png

This is the distribution of sample means; the sample distribution
The standard error is the standard deviation of this distribution
With the standard error we can create an interval that will contain the true mean 95% of the time (95% CI)
We take the desired $Z$ score for the normal distribution, in this case Z = 1.96 for 95% CDF and $\pm$ that multipled by $SE$ to get the confidence interval around a point estimate
$SE$ serves as our estimate for the means' distribution of our experiments

from scipy.stats import norm

z = norm.ppf(0.975)

1.959963984540054

np.random.seed(321)
exp_data = run_experiment()
exp_se = exp_data.std() / np.sqrt(len(exp_data))
exp_mu = exp_data.mean()
ci = (exp_mu - z * exp_se, exp_mu + z * exp_se)

print(ci) # 95% of the time the data's true mean will fall within this interval

(73.83064660084463, 74.16994997421483)

We can construct a confidence interval for $ATE$ on GPA scores in our classroom example:

def ci(y: pd.Series, confidence = 0.975):
    return (y.mean() - norm.ppf(confidence) * se(y), y.mean() + norm.ppf(confidence) * se(y))

print(f"95% CI for Online: {ci(online)}")
print(f"95% CI for Face to Face: {ci(face_to_face)}")

95% CI for Online: (70.62248602789292, 76.64804014231983)
95% CI for Face to Face: (76.8377077560225, 80.2572614106441)

We can see that there's no overlap between the two groups' CIs: this is evidence that the results were not by chance
Very likely that there is a significant causal decrease in academic performance once you switch from face to face to online classes.

There is a nuance to confidence intervals: - The population mean is constant; you shouldn't really say that the confidence interval contains the population mean with 95% chance. Since it is constant, it is either in the interval or not. - The 95% refers to the frequency that such confidence intervals, computed in many other studies, contain the true mean. - 95% confidence in the algorithm used to compute the 95% CI, not on a particular interval itself - Bayesian statistics and the use of probability intervals are able to say that an interval contains the distribution mean 95% of the time.

Hypothesis Testing

Is the difference in two means significant, or statistically different from zero/another value? - Recall that the sum or difference of normal distributions is also a normal distribution - The resulting mean will be the sum or difference of the two distributions' means, while the variance will always be the sum of variance

$$N(\mu_1, \sigma_{1}^{2}) - N(\mu_2, \sigma_{2}^{2}) = N(\mu_1 - \mu_2, \sigma_{1}^{2} + \sigma_{2}^{2})$$ $$N(\mu_1, \sigma_{1}^{2}) + N(\mu_2, \sigma_{2}^{2}) = N(\mu_1 + \mu_2, \sigma_{1}^{2} + \sigma_{2}^{2})$$

np.random.seed(123)

n1 = np.random.normal(4, 3, 30000)
n2 = np.random.normal(1, 4, 30000)
n_diff = n1 - n2

sns.distplot(n1, hist=False, label="N(4,3)")
sns.distplot(n2, hist=False, label="N(1,4)")
sns.distplot(n_diff, hist=False, label="N(4,3) - N(1,4) = N(3,5)")

<matplotlib.axes._subplots.AxesSubplot at 0x7fbb7ec4a550>

png

If we take the distribution of the means of our two groups and subtract one from the other, we get a third distribution equaling the difference in the means and the standard deviation of the distribution will be the square root of the sum of the standard deviations:

$$ \mu_{diff} = \mu_1 - \mu_2 $$ $$ SE_{diff} = \sqrt{SE_1 + SE_2} = \sqrt{\frac{\sigma_{1}^{2}}{n_1} + \frac{\sigma_{2}^{2}}{n_2}} $$

Constructing the distribution of the difference with the classroom example:

diff_mu = online.mean() - face_to_face.mean()
diff_se = np.sqrt(online.var()/len(online) + face_to_face.var()/len(face_to_face))
ci = (diff_mu - 1.96 * diff_se, diff_mu + 1.96 * diff_se)

ci

(-8.376410208363357, -1.4480327880904964)

Plot the confidence interval with the distribution of differences between online and face-to-face groups:

diff_dist = stats.norm(diff_mu, diff_se)
x = np.linspace(diff_mu - 4 * diff_se, diff_mu + 4 * diff_se, 100)
y = diff_dist.pdf(x)

plt.plot(x,y)
plt.vlines(ci[0], ymin=0, ymax = diff_dist.pdf(ci[0]))
plt.vlines(ci[1], ymin=0, ymax = diff_dist.pdf(ci[1]), label = "95% CI")
plt.legend()
plt.show()

png

We can say that we're 95% that the true difference between groups falls within this interval of (-8.37, -1.44)

We can create a z statistic by dividing the difference in means by the standard error of the differences:

$$ z = \frac{\mu_{diff} - H_0}{SE_{diff}} = \frac{(u_1 - u_2) - H_0}{\sqrt{\frac{\sigma_{1}^{2}}{n_1} + \frac{\sigma_{2}^{2}}{n_2}}} $$

Where $H_0$ is the value which we want to test our difference against
The z statistic is a measure of how extreme the observed difference is
With the null hypothesis we ask: "how likely is it that we would observe this difference if the true/population difference was actually zero/[insert whatever your null hypothesis is]?"
The z statistic is computed through the data to be standardized to the standard normal distribution; i.e. if the difference were indeed zero we would see z be within two standard deviations of the mean 95% of the time.

z = diff_mu / diff_se
print(z)

-2.7792810791031064

x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, 0, 1)


plt.plot(x, y, label="Standard Normal")
plt.vlines(z, ymin=0, ymax= .05, label="Z statistic", color="C1")
plt.legend()
plt.show()

png

We can see that the Z statistic is a pretty extreme value (more than 2 standard devations away from the mean)
Interesting point about hypothesis tests: they're less conservative than checking if the 95% CIs from the two groups overlap; i.e. they can overlap but still be a result that's statistically significant
If we pretend that the face-to-face group has an average score of 74 with a standard error of 7 and the online group has an average score of 71 with a standard error of 1:

ctrl_mu, ctrl_se = (71, 1)
test_mu, test_se = (74, 7)

diff_mu = test_mu - ctrl_mu
diff_se = np.sqrt(ctrl_se + test_se)

groups = zip(['Control', 'Test', 'Diff'], [[ctrl_mu, ctrl_se],
                                           [test_mu, test_se],
                                           [diff_mu, diff_se]])

for name, stats in groups:
    print(f"{name} 95% CI:", (stats[0] - 1.96 * stats[1], stats[0] + 1.96 * stats[1],))

Control 95% CI: (69.04, 72.96)
Test 95% CI: (60.28, 87.72)
Diff 95% CI: (-2.5437171645025325, 8.543717164502532)

The CI for the difference between groups contains 0, so maybe the example provided by the author above isn't a great one... moving on...

P-Values

From Wikipedia: “the p-value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct”
i.e. the probablity of seeing the results given that the null hypothesis is true
It is not equal to the probability of the null hypothesis being true!
Not $P(H_{0} \vert data)$, but rather $P(data \vert H_{0})$
To obtain, just compute the area under the standard normal distribution before or after the z statistic
Simply plug the z statistic into the CDF of the standard normal distribution:

print(f'P-value: {norm.cdf(z)}')

P-value: 0.0027239680835564706

Means that there's a 0.2% chance of observing this z statistic given the null hypothesis is true; this falls within the accepted significance level to reject the null hypothesis
The p-value avoids us having to specify a confidence level
We can know exactly at which confidence our test will pass or fail though, given the p-value; with a P-value of 0.0027, we will have significance up to the 0.2% level
A 95% CI and a 99% CI for the difference won't contain zero, but a 99.9% CI will:

diff_mu = online.mean() - face_to_face.mean()
diff_se = np.sqrt(online.var()/len(online) + face_to_face.var()/len(face_to_face))
print("95% CI:", (diff_mu - norm.ppf(.975)*diff_se, diff_mu + norm.ppf(.975)*diff_se))
print("99% CI:", (diff_mu - norm.ppf(.995)*diff_se, diff_mu + norm.ppf(.995)*diff_se))
print("99.9% CI:", (diff_mu - norm.ppf(.9995)*diff_se, diff_mu + norm.ppf(.9995)*diff_se))

95% CI: (-8.37634655308288, -1.4480964433709733)
99% CI: (-9.464853535264012, -0.3595894611898425)
99.9% CI: (-10.72804065824553, 0.9035976617916743)

Key Ideas

The standard error enables us to put degrees of certainty around our estimates by enabling us to calculate confidence intervals around our point estimates as well as the statistical significance of a result given hypothesis testing
Wrapping everything up, we can create an A/B testing function to automate all the work done above:

def AB_test(test: pd.Series, control: pd.Series, confidence=0.95, h0=0):
    mu1, mu2 = test.mean(), control.mean()
    se1, se2 = test.std()/np.sqrt(len(test)), control.std()/np.sqrt(len(control))

    diff = mu1 - mu2
    se_diff = np.sqrt(test.var()/len(test) + control.var()/len(control))

    z_stats = (diff - h0)/se_diff
    p_value = norm.cdf(z_stats)

    def critical(se): return -se*norm.ppf((1 - confidence)/2)

    print(f"Test {confidence*100}% CI: {mu1} +- {critical(se1)}")
    print(f"Control {confidence*100}% CI: {mu2} +- {critical(se2)}")
    print(f"Test - Control {confidence*100}% CI: {diff} +- {critical(se_diff)}")
    print(f"Z statistic: {z_stats}")
    print(f"P-Value: {p_value}")

AB_test(online, face_to_face)

Test 95.0% CI: 73.63526308510637 +- 3.0127770572134565
Control 95.0% CI: 78.5474845833333 +- 1.7097768273108005
Test - Control 95.0% CI: -4.912221498226927 +- 3.4641250548559537
Z statistic: -2.7792810791031064
P-Value: 0.0027239680835564706

Stats Review: The Most Dangerous Equation

2021-06-05

Stats Review: The Most Dangerous Equation

The Standard Error of our Estimates

Confidence Intervals

Hypothesis Testing

P-Values

Key Ideas

Tags

Archive