Chapter 14 Two Sample Hypothesis Testing

14.1 Overview

In chapters 11 and 12 we learned about methods for statistical inference and hypothesis testing when evaluating one sample of data, whether it was proportions or continuous data, and in this latter case whether \(\sigma\) was known or not. In these situations, we were comparing our sample results to some prior estimate of \(\mu\) or \(p\) or using our sample to estimate a likely range of where the true value of \(\mu\) or \(p\) lied.

In this chapter we will discuss methods of statistical inference and hypothesis testing for comparing two separate samples of data. For example, maybe these samples are taken from different groups and we want to evaluate if their means are the same.

Specifically, we will discuss analysis of:

The differences between proportions, e.g. \(H_0: p_2-p_1 = 0\)
The difference of two means, e.g. \(H_0: \mu_1 = \mu_2\)
Paired data, for repeated measures on the same experimental unit, e.g. \(H_0: \mu_{diff}=0\)

In each of these situations, we will see that our general framework for hypothesis testing and inference remains the same, and similar to how we approached the t-distribution, we will slightly vary our null distribution, standard error calculation, test statistic depending on the specific test.

14.2 Analysis of the Differences Between Proportions

We’ve previously looked at how to evaluate if an observed proportion reasonably matches our prior estimate. For example, we could ask “does an observed sample reasonably come from a population with \(p=0.5\)?”, or “Is there a statistically significant difference between our sample and \(p=0.5\)?”

Now, we’re going to learn how to evaluate whether two samples have the same value of \(p\). As an obvious example, think of a medical trial. We give an experimental drug to one group and a placebo (sugar pill) to another group, and then we want to evaluate if the responses are statistically the same (or different) between these two groups. For this section, we’ll assume the response will be binary, i.e. either the treatment worked or not.

Our general approach for hypothesis testing will be the same as we previously did:

Develop a null and alternative hypothesis and determine our “\(\alpha\)-level”
Figure out our null distribution and critical values based on \(\alpha\)
Collect the data and calculate our “test-statistic”
Calculate how likely our observed results are under the null distribution, including the p-value
Reject or fail to reject \(H_0\) (if for example the p-value is less than \(\alpha = 0.05\))

Importantly our null distribution, standard error (SE), and test-statistic will all be slightly different than what we found when evaluating a single sample.

14.2.1 Learning Objectives

After this section, you should be able to:

Run a hypothesis test for the difference of proportions
Describe the calculations of the null distribution, standard error (SE), and test-statistic for hypothesis testing of difference of proportions
Calculate a confidence interval on the true difference between two observed proportions

14.2.2 Comparing Proportions in Two Different Groups

Let’s start with an example.

A survey of 827 randomly sampled registered voters in California about their views on drilling for oil and natural gas off the Coast of California asked, “Do you support? Or do you oppose? Or do you not know enough to say?” Below is the distribution of responses, separated based on whether or not the respondent graduated from college.

Results	College	Not
Support	154	132
Oppose	180	126
Do Not Know	104	131
Total	438	389

Based on this data, we might ask: “Is there strong evidence to suggest that the proportion of non-college graduates who support off-shore drilling in CA is different than the proportion of college graduates who do?”

For this question, what we’re really asking is if the proportion of college graduates who support off-shore drilling (we’ll call this \(p_1\)) is equal to the proportion of non college graduates who support off-shore drilling (we’ll call this \(p_2\)).

Based on the above definitions, our null hypothesis is \(H_0: p_1-p_2 = 0\) (i.e. no difference between the two) and our alternate hypothesis is \(H_A: p_1-p_2 \ne 0\).

Of course, we might see an observed difference that isn’t exactly zero, but what we are asking is if it makes sense that any observed difference occurs simply as the result of random sampling.

One slight difference with this test (compared with our one sample version) is that we can’t fully setup our null distribution before collecting the data. We need to know the size of each group, \(n_1\) and \(n_2\), as well as the overall proportion.

So, based on our data, what is the total proportion of individuals who support off shore drilling, regardless of education level? From above we find this as \(p=\frac{total\ \#\ who\ support}{total\ \#\ surveyed} = \frac{154+132}{438+389} = 0.3458\).

Note that by saying \(p_1=p_2\), we’re basically saying they both equal \(p\), the overall proportion, and that the group someone is in (college graduate or not) does not affect their views about offshore drilling.

14.2.3 Determining the Null Distribution

Our next step is to determine our null distribution. Based on the CLT we’ll write \(\hat p_1-\hat p_2 \sim N(0, SE)\), basically saying that the expected difference is centered at 0 with a normal distribution. Actually, while this is technically correct, we will actually modify it slightly. More details on that below AND first we need to determine the value of the standard error, \(SE\), which requires a different calculation then before.

To find the standard error note that we now have two proportions to deal with so we will need to combine them. We’ll use the following equation based on the pooled proportion \(p\) (as determined above) and the size of each group: \[SE = \sqrt{p(1-p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

Using our specific values for \(p\), \(n_1\) and \(n_2\) we have:

se <- sqrt(0.3458*(1-0.3458)*(1/438+1/389))
se

## [1] 0.03313665

(I won’t derive why this works here, but for those of you interested there’s more information further below in the notes.)

Now we’re in a position to fully define our null distribution. As we discussed at the end of chapter 12, we could either use \(\hat p_1-\hat p_2 \sim N(0, SE)\) or in fact a better approach is to “z-score” this and instead write: \[\frac{(\hat p_1-\hat p_2)-0}{SE} \sim N(0,1)\]

This latter version makes finding critical values and p-values a little easier, at the expense of a slightly more complex test statistic.

Using this latter form, we can easily find our critical values (at \(\alpha=0.05\)) as

cat("lower:", round(qnorm(0.025, 0, 1), 3), "\n")
cat("upper:", round(qnorm(0.975, 0, 1), 3), "\n")

## lower: -1.96 
## upper: 1.96

but we probably knew that already…

14.2.4 Calculating the Test Statistic

Now we are at the point were we can now calculate our test statistic. First, let’s find \(p_1\) and \(p_2\) as \(p_1 = \frac{154}{438} = 0.3516\) and \(p_2 = \frac{132}{389} = 0.3393\). Obviously these are NOT the same, but is the difference significant?

From here we can calculate our test statistic, using the general form of \(\frac{(p_1-p_2)-0}{SE}\) (where we subtract 0 as a placeholder because that is the value of the difference in the null hypothesis), or

(0.3516-0.3393)/0.03314

## [1] 0.3711527

As mentioned above, under the null hypothesis of no difference between groups, this test statistics will have a standard Normal distribution), i.e. \(\frac{\hat p_1 - \hat p_2}{SE} \sim N(0,1)\).

So our question of interest is, how likely are we to see a value as extreme as 0.3711 given our null distribution?

Based on our critical values, we already know we will fail to reject \(H_0\) since this value is within the critical values. Similarly, since this is a two-sided test, we can calculate our p-value as:

2* pnorm(0.3711, 0, 1, F)

## [1] 0.7105631

Since this is much larger than \(\alpha=0.05\) we fail to reject our \(H_0\) and conclude there is no statistically significant difference between the college educated or not in their support for off-shore drilling.

14.2.5 Using `prop.test()` in R

There is a built in function in R that approximates our results. We can use the prop.test() function as:

prop.test(c(154, 132), c(438, 389), correct=F)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  c(154, 132) out of c(438, 389)
## X-squared = 0.13703, df = 1, p-value = 0.7113
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.05264371  0.07717682
## sample estimates:
##    prop 1    prop 2 
## 0.3515982 0.3393316

Be careful about how you pass the data and note the first parameter c(154,132) is the vector of the number of successes for each group, and the second parameter c(438, 389) is a vector of the sample size of each group. As shown, use c() to create these vectors based off your data. Also, it is important to turn off the continuity correction using correct=F.

The prop.test() function gives an approximate result, and might yield results with slightly more precision, particularly for small sample sizes. In the above analysis, we used the CLT to create our null distribution, however assumptions of normality may not entirely be valid. So be aware of the potential for slight differences in both test statistic and p-values.

Part of the output of this function call is X-squared = 0.13703. This is (approximately) the value of our previously calculated test statistic (0.3712) “squared”:

0.3712^2

## [1] 0.1377894

For the reasons noted above these are approximately equal, and should except in rare cases, lead to the same conclusion.

14.2.6 Optional: The null hypothesis, point estimate and test statistic

Imagine have two different groups, each with their own number of trials and successes. Let \(Y_1\) be the number of successes in the first group and \(n_1\) be the total number of participants. Then \(p_1 = \frac{Y_1}{n_1}\). Similarly, \(p_2 = \frac{Y_2}{n_2}\).

Our null hypothesis (\(H_0\)) here is there is no statistically significant difference between the proportions of the two groups. In particular, we can quantify this as \(p_1-p_2 = 0\). Our alternative hypothesis is that there is some difference, i.e. \(p_1-p_2 \ne 0\).

To determine our test statistic, we start with \(\hat p_1-\hat p_2\) (using the “hats” because its observed data), which we will then scale by the standard error. The value \(\hat p_1-\hat p_2\) is our point estimate. This represents our best estimate of the difference between the two groups.

Our general approach will be: \[Z = \frac{(point\ estimate - null\ value)}{ standard\ error}\]

So, in this case, under the null hypothesis of no difference, our test statistic is then going to be \(\frac{(\hat p_1-\hat p_2) - 0}{SE}\). Importantly, this has a distribution of \(N(0, 1)\).

Note: This is equivalent to the transformation we used our when we said that if \(X \sim N(\mu, \sigma)\), then \(\frac{X-\mu}{\sigma} \sim N(0,1)\).

14.2.7 Optional: Calculating the Standard Error

For comparing the proportions of two groups, we want to know about the standard error of \(p_1-p_2\). In chapter 11 we found the standard error of one group, \(p\), as \(SE = \sqrt{\frac{p(1-p)}{n}}\). We’ll see a similar, although slightly more complicated result here.

The standard error of \(p_1-p_2\) is the square root of the variance of \(p_1-p_2\).

As previously discussed, “variances add”, which means we can write:

\[Var[p_1-p_2] = Var[p_1] + Var[p_2]\]

Now, we know \(Var[p_1] = \frac{p_1(1-p_1)}{n_1}\) (see chapter 11) and so we can rewrite the above equation as:

\[Var[p_1-p_2] = \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}\]

and so for comparing the proportions of two groups, our standard error is:

\[SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\] This is the general case and is true regardless of \(H_0\).

Now, when we assume that \(p_1\) and \(p_2\) are the same (which is a typical \(H_0\)), then we are really assuming that they both equal the overall \(p\) for the combined or ‘pooled’ groups. This calculation, repeated from above, is:

\[p = \frac{Y_1+Y_2}{n_1+n_2} = \frac{p_1n_1+p_2n_2}{n_1+n_2}\]

In this special case we can substitute \(p_1=p\) and \(p_2=p\) and then simplify our standard error as:

\[SE = \sqrt{\frac{p(1-p)}{n_1} + \frac{p(1-p)}{n_2}} = \sqrt{p(1-p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

which is the result presented above. Note this approach is similar to our one sample standard error where we basically “averages” (in a slightly odd way) the two group sizes.

14.2.8 Inference on the True difference between \(p_1\) and \(p_2\)

As before we can also compute a confidence interval on the true value of \(p_1-p_2\). Here though we’ll use our more general calculation of SE (from the previous section), since we don’t know \(p\):

\[SE = \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}}\] Plugging in the values for the off-shore drilling question we find:

se <- sqrt(0.3516*(1-0.3516)/438 + 0.3393*(1-0.3393)/389)
se

## [1] 0.03311772

which is very similar to our previous estimate (i.e. when we just used the overall proportion).

We can then use the \(\hat p_1-\hat p_2 \pm 1.96*SE\) approach to find the limits of the 95% confidence interval on the true value of \(p_1-p_2\) as:

(0.3516-0.3393)-1.96*0.03312
(0.3516-0.3393)+1.96*0.03312

## [1] -0.0526152
## [1] 0.0772152

Note that here the center of our interval is the observed difference in proportions (i.e. the point estimate).

What do you notice about the interval? What value does it contain?

14.2.9 Guided Practice

In a random sample of 1500 First Nations children in Canada, 162 were in child welfare care. In a different random sample of 1600 non-Aboriginal children, 23 were in child welfare care. Many people believe that the large proportion of indigenous children in government care is a humanitarian crisis. Do these data give significant evidence that a greater proportion of First Nations children in Canada are in child welfare care than the proportion of non-Aboriginal children in child welfare care?

(from Barron’s AP Stats p272)

Do this (i) manually and (ii) using the prop.test() function in R. Confirm that you get the same results.

14.2.10 Summary of the Hypothesis Testing Statistics

value	Difference in Proportions (for test of no difference)
null hypothesis	\(H_0: p_1-p_2 = 0\)
null distribution	\(N(0, 1)\)
point estimate	\(\hat p_1 - \hat p_2\)
SE	\(SE = \sqrt{p(1-p)(\frac{1}{n_1} + \frac{1}{n_2})}\), where \(p = \frac{Y_1+Y_2}{n_1+n_2}= \frac{\hat p_1n_1+\hat p_2n_2}{n_1+n_2}\)
test statistic	\(\frac{(\hat p_1-\hat p_2) - 0}{SE}\)

Note that I’ve introduced the term point estimate here, to distinguish the observed difference between our proportions from the test statistic, particularly since the test statistic has its scaled form. The point estimate is our best estimate of the true difference between the groups.

14.2.11 Review of Learning Objectives

After this section, you should be able to:

Run a hypothesis test for the difference of proportions
Describe the calculations of the standard error, test statistic and null distribution for hypothesis testing of difference of proportions
Calculate a confidence interval on the true difference between two observed proportions

14.3 Difference of Means & Paired Data

In this section, we’ll continue to look at situations where we compare two-samples, now examining those cases where we are analyzing continuous data. As was the case in chapter 12, because of concerns about small samples and not knowing \(\sigma\), we will typically use the t-distribution here as our null distribution.

Specifically, here we will look at two types of data and hypothesis tests:

difference of means, where we have two data sets from two different populations that are not necessarily related or connected, and where we want to test if the means are the same or different. Note that the two sample sizes do NOT need to be the same.
paired data, where we have two data sets and where each observation in one set has a corresponding observation in the second data set. Think about patients who are measured twice (before and after?) or locations where the temperature (or other environmental variable) is measured on different dates. Here, we will often want to test if the means of the before and after group are the same or different.

14.3.1 Learning Objectives

After this section, you should be able to:

Run a hypothesis test for both the difference of means and paired data
Describe the calculations of the standard error, test statistic and null distribution for hypothesis testing of difference of means and paired data
Explain the use of the point estimate
Calculate a confidence interval on the true difference between means or paired data

14.3.2 Summary of the Hypothesis Testing Statistics

Instead of describing the tests in detail, here we will start with a summary table of the statistics used in our hypothesis testing approach:

value	Difference of Means	Paired Data
null hypothesis	\(H_0: \mu_1-\mu_2 = 0\)	\(H_0: \mu_{diff} = 0\)
null distribution	\(t_{df}\)	\(t_{df}\)
degrees of freedom, \(df\)	\(min(n_1-1, n_2-1)\)	\(n_{diff} - 1\)
critical values (at \(\alpha=0.05\))	`qt(0.025, df), qt(0.975, df)`	`qt(0.025, df), qt(0.975, df)`
point estimate	\(\bar X_1 - \bar X_2\)	\(\bar X_{diff} = (\sum X_{1i} - X_{2i})/n\)
Standard Error, \(SE\)	\(\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\)	\(\frac{s_{diff}}{\sqrt{n_{diff}}}\)
test statistic	\(\frac{(\bar X_1 - \bar X_2)}{SE}\)	\(\frac{\bar X_{diff}}{SE_{diff}}\)

where again I’m using the term point estimate. Importantly, we compare the point estimate to the null hypothesis whereas we compare the test statistic to the null distribution. This is a useful distinction when we’re scaling the point estimate (and converting it into the test statistic - the “t-score”) before comparing it to our null distribution.

14.3.3 Notes on Difference of Means Testing

When testing difference of means, we’ll take samples from each group, and then have two separate observed means and sample sizes: \(\bar X_1\) and \(n_1\) for the first group and \(\bar X_2\) and \(n_2\) for the second group. Each group also has an observed standard deviation, \(s_1\) and \(s_2\) respectively. Since we’re using the t-distribution we need to calculate a degrees of freedom, and for that we’ll use one less than the smaller group size. Finally note that the form of the standard error is slightly different than before, particularly in that we are taking the square root of the whole expression and the standard deviations are squared (aka variances).

There is some disagreement about the proper calculation of degrees of freedom for difference of means testing. Some texts suggest using \(n_1+n_2-2\) and some suggest a more complicated formula that we will not discuss here. Using \(min(n_1-1, n_2-1)\) as suggested above is a conservative approach. Suffice it to say that p-values shouldn’t vary too much based on the different approach to choosing the degrees of freedom, AND when you should remember that if your p-value is close to \(\alpha\), caution in your conclusion is warranted.

14.3.4 Notes on Paired Data Testing

The purpose of paired data testing is to take two (repeated) samples on a given experimental unit and to evaluate if the average difference is reasonably equal to zero (or some other prior value).

Here we have two samples of equal sizes \(n\) and all data points will be “paired”, again think two measurements from the same experimental unit. The typical situation is that the two samples result because each unit was measured twice.

To calculate the point estimate, we first create a vector of length \(n=n_{diff}\) which is the observed difference between the two measurements: \(X_{diff} = X_1-X_2\). We then take the average of this \(X_{diff}\) vector, and this average becomes our best estimate of the true difference between the samples. We also calculate the standard deviation of this vector to use in our standard error calculation.

In essence, under paired data testing, once we’ve created the “difference” vector, we then proceed as we would under a one sample t-test approach.

14.3.5 Inference for Difference of Means and Paired Data

To create confidence intervals, in both cases, we’ll use our point estimate (i.e. observed results) as the center of the interval, and we’ll add/subtract an appropriate number of standard errors, depending on our confidence level.

For example, a 95% confidence interval on the true difference of means using a t-distribution with 20 degrees of freedom would be:

\[\bar x_1-\bar x_2 \pm 2.086*SE\]

since

qt(0.025, 20)

## [1] -2.085963

Similarly, a 99% confidence interval on the true difference of paired data using a t distribution with 15 degrees of freedom would be

\[\bar x_{diff}\pm 2.131 *SE \]

since

qt(0.025, 15)

## [1] -2.13145

14.3.6 Guided Practice

We have now done enough hypothesis testing where the overall framework is the same, that you should be able to proceed without a worked example. We’ll instead jump straight to guided practice, using the summary table above.

Difference in Means:

The following summary statistics represent a random sample of 150 cases of mothers and the birth weights of their newborn infants in North Carolina over a year. (p269 os4.pdf) The data represent the birth weights of newborns. Our question: Is there a statistically significant difference between birth weights of infants in mothers who smoke compared to those who don’t?

statistic	smoker	nonsmoker
sample mean, \(\bar x\)	6.78	7.18
sample standard deviation, \(s\)	1.43	1.60
sample size, \(n\)	50	100

Use a difference of means test to calculate this at the \(\alpha=0.05\) level.
What is a 95% confidence interval around the true difference in means?

.
.

You are interested in evaluating the tensile strength of two different filaments (ABS vs PLA) used in 3D printing. Tensile strength is the maximum stress a material can withstand before breaking when being pulled apart, measured in Newtons/\(m^2\) or Pascals.

You run an experiment and collect the following data. What are the average tensile strengths (\(N/m^2\)) of the two materials? Is there a statistically significant difference between the two materials in terms of their average tensile strength?

abs <- c(30.6, 28.1, 28.9, 23.6, 29.5, 31.7, 28.2, 28.7, 32.4, 26.6, 35.7)
pla <- c(34.1, 37.0, 32.4, 31.4, 29.8, 36.2, 34.4, 32.4, 30.5, 34.4, 28.0, 36.3)

Paired Data:

Let’s consider a limited set of climate data, examining temperature differences in 1948 vs 2018. We sampled 197 locations from the National Oceanic and Atmospheric Administration’s (NOAA) historical data, where the data was available for both years of interest. We want to know: Are there statistically differences in the number of days with temperatures exceeding 90°F between 2018 and 1948?

First we calculated the number of days exceeding 90°F at each location in both 1948 and 2018. The 1948 data represents one sample and the 2018 data represents a second sample, both of length \(n_{diff}=197\).

Next, we determined the difference in number of days exceeding 90°F (number of days in 2018 - number of days in 1948) for each of the 197 locations. The average of these differences was \(\bar X_{diff} = 2.9\) days with a standard deviation of \(s_{diff} = 17.2\) days.

Perform a paired data hypothesis test to see if this difference is statistically significant at \(\alpha = 0.05\). Be sure to write out your null hypothesis and follow all of our steps. What do you conclude?
Calculate a 90% confidence interval on the true value of the difference in number of days. How does your confidence interval relate to your results from part (a)?

.
.

Imagine a certain medical test where 18 patients are measured before and after treatment:

before <- c(19.6, 22.3, 21.6, 18.2, 21.2, 20.0, 20.5, 21.4, 20.3, 19.9, 18.7, 19.3, 20.8, 19.4, 20.8, 22.6, 20.9, 22.1)
after <- c(19.6, 20.0, 22.0, 20.5, 22.4, 23.1, 20.3, 23.1, 24.5, 21.6, 21.4, 23.6, 22.8, 20.4, 21.2, 22.8, 20.0, 21.9)

(NOTE: If you’re given both data sets, you should first create a new vector that is the difference between our groups. Then, proceed similar to how you would with a one sample test.)

Create a vector of length 18 that contains the changes per patient that occurred after treatment.
Plot a histogram of this new vector. What do you notice? What does this histogram suggest about whether a difference exists in the sample as a result of treatment?
Calculate the mean and standard deviation of your vector.
Run a hypothesis test at \(\alpha = 0.10\) to evaluate if there is a statistically significant difference in patient results before and after treatment.

14.3.7 Review of Learning Objectives

After this section, you should be able to:

Run a hypothesis test for both the difference of means and paired data
Describe the calculations of the standard error, test statistic and null distribution for hypothesis testing of difference of means and paired data
Explain the use of the point estimate
Calculate a confidence interval on the true difference between means or paired data

14.4 Review of One-sided and Two-sided tests

We discussed one-sided and two-sided tests in our chapter on one sample hypothesis testing. The same ideas generally apply to two sample testing although it can be a bit more complicated.

The particular issue to be aware of is that how we order our samples matters and it depends on whether we’re looking for results that are bigger or smaller than 0. This order needs to be correct and internally consistent with respect to the following three (3) steps:

the null and alternative hypothesis
the point estimate
the side where the critical value lies

14.4.1 Some Guidelines

Suppose we really want to know if the mean/proportion of group \(A\) is bigger than that of group \(B\). We will conclude that it is ONLY IF there is ample evidence that \(A\) is much bigger than \(B\), where the difference didn’t occur simply by randomness.

This guides us that we will reject \(H_0\) only if our test statistic is too big.

Hence our null hypothesis will be that \(A\) isn’t bigger than \(B\) (i.e. \(A\) is either smaller or the same as \(B\)), which we’ll write as \(H_0: A \le B\), and in line with this our alternative hypothesis will then be that \(A\) is actually bigger than \(B\), so \(H_A: A > B\).

Based on how we’ll reject, (only if the difference is too big) we only want an upper critical value. And finally our test statistic needs to be \(\frac{A-B}{SE}\), because if this is big (positive) it means that \(A\) is bigger than \(B\).

The same approach works for testing if the mean/proportion of group \(C\) is less than that of group \(D\), except the other way.

(Note that if we wrote this the other way, we’d be assuming it is true to start, and only rejecting \(H_0\) if the difference, the other way, was too large. This doesn’t work because it doesn’t all for \(A\) and \(B\) to be the same.)

14.4.2 General Examples

1. How do we write the null and alternative hypothesis for these cases?

We have two different observed proportions \(\hat p_a\) and \(\hat p_b\) and we want to know if the \(p_a\) is greater.

\(H_0: p_a \le p_b\), \(H_A: p_a > p_b\)

We have samples from two different groups, \(\bar x_a\) and \(\bar x_b\) and we want to know if group b has a lower mean.

\(H_0: \mu_b \ge \mu_a\), \(H_A: \mu_b < \mu_a\)

We have before and after samples from a set of patients measuring health outcomes and we want to know if patients were improved (assume a lower test value is better) after the intervention.

Let \(\mu_d\) be the true mean of the difference, calculated as “after - before” (i.e. first subtract those vectors then take the mean). Then, \(H_0: \mu_d \ge 0\), \(H_A: \mu_d < 0\)

In summary, the alternative hypothesis typically describes what we want to test or what we’re hoping to find.

2. When we’re calculating a critical value, particularly when our Null distribution is \(N(0,1)\) or a \(t\) distribution, how does the sign of the critical value relate to the inequality in the null hypothesis?

We only need care about the sign of the critical value because both the t-distribution and the \(N(0,1)\) distribution are centered around 0.

In fact it’s the inequality in the alternative hypothesis that points to where the critical value is located. If the alternative hypothesis is \(H_A: \mu_b < \mu_a\), then our point estimate will be \(\mu_b - \mu_a\) and our critical value will be negative. If the inequality points the other way, the critical value will be positive.

3. Specifically how do we calculate the one sided critical value when we have a \(N(0,1)\) or a \(t\)-distribution with \(df\) degrees of freedom

Assuming \(\alpha=0.05\), for a lower critical value we would use either the qnorm(0.05) or qt(0.05, df) functions. Similarly, for an upper critical value, we would use the qnorm(0.95) or qt(0.95, df).

14.4.3 Guided Practice

Answer the following questions in each of the given scenarios:

A. The federal government deems it “safe” if the average level of lead in water is less than 15 ppb (parts per billion). You collect data from a local school. What are appropriate null and alternative hypothesis if you want to evaluate if the sample you just collected is above the safe limit? What type of hypothesis test would you use? At \(\alpha=0.05\), what would your critical value be?

B. Forest thinning (the process of removing small trees and shrubs) is generally accepted as a method for reducing wildfire, but what are the other impacts? You measure 12 plots of land before and after thinning treatments and count the number of species present. What are appropriate null and alternative hypothesis to evaluate if forest thinning reduces biodiversity (species counts)? What type of hypothesis test would you use? At \(\alpha=0.05\), what would your critical value be?

C. The EPA annually releases fuel economy data on cars manufactured in that year. What are appropriate null and alternative hypothesis to test if automatic transmissions have higher estimate fuel economy ratings than manual transmissions? What type of hypothesis test would you use? Assuming you select \(n_1 =25\) cars with automatic transmissions and \(n_2=21\) cars with manual transmissions, at \(\alpha=0.10\) what would your critical value be?

D. Are allele frequencies different in different sub-populations? You take a sample of two different populations and measure the absence or presence of a specific allele. What are appropriate null and alternative hypothesis to evaluate if the proportion that the given allele occurs in population A is less than the rate it occurs in population B? What type of hypothesis test would you use? At \(\alpha=0.01\), what would your critical value be?

14.5 Exercises

14.5.1 Difference in Proportions

Exercise 14.1 An experiment is run to see if studying a year of Latin can increase verbal SAT scores. A student’s first attempt on SAT verbal scores were compared with PSAT verbal scores and it was noted whether the scores increased by at least 100 points or not. The following table shows the results.

What is your point estimate of the difference in the two groups? Explain it in context. Based on this, what is your guess about whether studying Latin helps improve test scores?
Run a hypothesis test at \(\alpha=0.05\) to evaluate if there is a statistically significant difference in SAT results for studying Latin or not. Be clear about all of the steps.
Comment about why these results may not be generally applicable based on the experimental design.

increase in score	\(<= 100\)	\(>100\)	total
studied Latin	6	14	20
didn’t	11	8	19

Exercise 14.2 Repeat the analysis from the previous problem using the prop.test() function in R. How do your results compare?

Exercise 14.3 Using the data on studying Lain from above, create a 95% confidence interval on the true difference between the proportions. Note, when using the standard error for confidence intervals on the true difference in proportions you should use the following formula, with \(z^* = 1.96\) for a 95% confidence interval.

\[ \hat p_1 - \hat p_2 \pm z^* \times \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1}+\frac{\hat p_2(1-\hat p_2)}{n_2}}\]

What do the results tell you? How are your results consistent or inconsistent with the conclusion from your hypothesis above?

Exercise 14.4 Consider an experiment for patients who underwent cardiopulmonary resuscitation (CPR) for a heart attack and were subsequently admitted to a hospital. These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The response variable of interest was whether the patients survived for at least 24 hours. Results are shown in the table below.

Perform a hypothesis test (by hand) to evaluate if there is a statistically significant difference at \(\alpha = 0.05\) between survival in the control and treatment groups.

n	Survived	Died	Total
Control	11	39	50
Treatment	14	26	40
Total	25	65	90

Exercise 14.5 Using the CPR data from above, create a 95% confidence interval on the true difference between the means.

What do the results tell you? How are your results consistent or inconsistent with your answers above?

Exercise 14.6 A supplier of parts for Boeing has two different factories that produce the same part. They are interested in understanding if the factories yield similar quality, (i.e. in terms of the percent of the total that pass). The following table gives results from a recent production run at each:

Run a hypothesis test at \(\alpha=0.05\) to evaluate if the yield rate (% that pass) between the two factories is the same. What do you conclude?
What is a 90% confidence interval on the true difference between the yields from the two factories?

quality	factory A	factory B
passed	850	920
failed	50	80
total	900	1000

Exercise 14.7 A drone company is considering a new manufacturer for rotor blades. The new manufacturer would be more expensive, but they claim their higher-quality blades are more reliable, with 3% more blades passing inspection than their competitor. The quality control engineer collects a sample of blades, examining 1000 blades from each company, and she finds that 899 blades pass inspection from the current supplier and 958 pass inspection from the prospective supplier. Use a hypothesis test to evaluate the new manufacturer’s claims. Perform a hypothesis test to evaluate if the difference between the reliability of current blades and new blades is statistically different than 3% at the \(\alpha = 0.05\) level.

14.5.2 Difference in Means

Exercise 14.8 A city council member claims that male and female police officers wait equal times for promotion in the police department. A women’s spokesperson, however, argues that women wait longer than men. A random sample of men show they waited 8, 6.5, 9, 5.5 and 7 years for promotion, while a random sample of women waited 9.5, 5, 11.5, 8 and 10 years for promotion.

What conclusion should be drawn? Be explicit about your null and alternative hypothesis, point estimate, standard error, test statistic and p-value.
What steps should be done to strengthen your conclusion?

(modified from Barron’s AP Stats p297)

Exercise 14.9 PM\(_{2.5}\) measures the concentration of fine particulate matter in the air. This measurement accounts for very small particles which can bypass the natural defenses in the human respiratory systems and embed themselves in the lungs or migrate into the bloodstream. Repeated exposure can impact lung and cardiovascular function.

For reference, the EPA considers values between 0 and 10 \(\mu g/m^3\) to be under the WHO target, 10.1 to 12 to be “good”, 12.1-35.4 to be “moderate”, 35.5 to 55.4 to be “unhealthy for sensitive groups” and above 55.5 to be “unhealthy” or worse.

In 2019, California had 9 of the top 10 most polluted cities in the US, and one of the worst cities was Walnut Park (part of Los Angeles). The city of Sunnyside, WA (near Yakima) ranked as the city with the highest pollution in the WA state (in terms of PM\(_{2.5}\)) in 2019. The following data shows the monthly average in 2019 for both Walnut Park, CA and Sunnyside, WA.

Is there a statistically significant difference (at \(\alpha=0.05\)) in average air quality in these two cities?
What is a 90% confidence interval on the true average difference in PM\(_{2.5}\) between these cities?

(All data taken from iqair.com)

Walnut_Park <- c(24.5, 12.6, 11, 13, 11.1, 19.1, 20.6, 16.8, 11.5, 16.2, 24.9, 19.9)
Sunnyside <- c(10.5, 14.6, 13.8, 5.3, 7.6, 7.5, 10.3, 10.2, 6.4, 9.2, 22.1, 14.3)

Exercise 14.10 Write a custom R script or function that allows you to run a hypothesis test on the difference of means.

14.5.3 Paired Data

Exercise 14.11 Do Amazon books sell for less than those at your local bookstore? The following table shows the average difference in prices and standard deviation of the difference in prices for \(n=28\) books for sale on both Amazon and at Island Books on Mercer Island. \(x_{diff}\) represents the Island Books price minus the Amazon price.

Based on this data, run a hypothesis test at \(\alpha=0.05\) to evaluate if Amazon generally sells books for less than your local bookstore. What do you conclude?

\(n_{diff}\)	\(\bar x_{diff}\)	\(s_{diff}\)
28	2.18	11.45

Exercise 14.12 Forest thinning (removing small trees and shrubs) is generally accepted as a method for reducing wildfire, but what are the other impacts? You are interested in understanding how thinning affects biodiversity. You measure 12 \(1x1\) meter plots of land, right before and 1 year after thinning treatments and count the number of plant species present at each time. The following data represents the counts of species before and after treatment.

Use a paired data test to run a hypothesis test at \(\alpha = 0.10\) to evaluate if there is a statistically significant difference in the number of species before and after forest thinning treatments. What do you conclude and why?

before <- c(5, 12, 8, 5, 7, 5, 7, 4, 6, 5, 8, 11)
after <- c(6, 7, 6, 3, 6, 4, 3, 8, 5, 4, 4, 9)

Exercise 14.13 Write a custom R script or function that allows you to run a hypothesis test on paired data (a) given summary data and (b) using the raw data.

Chapter 14 Two Sample Hypothesis Testing

14.1 Overview

14.2 Analysis of the Differences Between Proportions

14.2.1 Learning Objectives

14.2.2 Comparing Proportions in Two Different Groups

14.2.3 Determining the Null Distribution

14.2.4 Calculating the Test Statistic

14.2.5 Using prop.test() in R

14.2.6 Optional: The null hypothesis, point estimate and test statistic

14.2.7 Optional: Calculating the Standard Error

14.2.8 Inference on the True difference between \(p_1\) and \(p_2\)

14.2.9 Guided Practice

14.2.10 Summary of the Hypothesis Testing Statistics

14.2.11 Review of Learning Objectives

14.3 Difference of Means & Paired Data

14.3.1 Learning Objectives

14.3.2 Summary of the Hypothesis Testing Statistics

14.3.3 Notes on Difference of Means Testing

14.3.4 Notes on Paired Data Testing

14.3.5 Inference for Difference of Means and Paired Data

14.3.6 Guided Practice

14.3.7 Review of Learning Objectives

14.4 Review of One-sided and Two-sided tests

14.4.1 Some Guidelines

14.4.2 General Examples

14.4.3 Guided Practice

14.5 Exercises

14.5.1 Difference in Proportions

14.5.2 Difference in Means

14.5.3 Paired Data

14.2.5 Using `prop.test()` in R