6.10 Type I and Type II Errors
Why do Type 1 or Type 2 errors arise in statistics?
Statistical tests are often conducted on samples, as it may be impractical for the analyst or researcher to collect data from every single customer or subject being studied. Since the samples may not perfectly represent a population, the variability can lead to errors; for instance, we may not have enough customers in our dataset, or we may have a sample that does not capture the true characteristics of the population.
In addition, researchers commonly use a conventional cutoff threshold as a benchmark, such as a significance level of 0.05, to decide whether to reject a null hypothesis. This cutoff could be too high or too low, depending on the consequences of the matter being tested. Researchers strive to minimize errors by designing robust studies, choosing appropriate sample size, and accounting for sampling variability.
The two errors that may occur are a Type 1 error, and a Type 2 error.
A Type I error occurs when we mistakenly reject an actually true null hypothesis. This is also known as a False Positive.
A Type II error occurs when we mistakenly fail to reject the null hypothesis that is actually false. This is also known as a False Negative. Type II errors are more common when the researcher uses a small sample size, and/or when the sample values have a high degree of variability.
How can Type 1 and Type II errors be controlled?
Controlling for Type 1 and Type II errors is a delicate balancing act. Often, a researcher or analyst must consider these tradeoffs when designing the experiment, as minimizing the risk of one often results in an increased risk of the other.
Several factors influence the consideration:
- Consequences of the errors –
Scenario 1:
Suppose an analyst is tasked with evaluating a disease screening test. The consequences of making a Type 1 error (i.e. concluding that a person has the disease when the person in fact does not), could cause unnecessary stress for the patient. However, the implications of making a Type 2 error (i.e. concluding that a person does not have the disease when the person in fact does), could delay crucial treatment. Reducing Type 2 errors could be prioritized so that early intervention can be sought.
Scenario 2:
A company is conducting market research for a new product. The null hypothesis is that the product will not be successful. A Type 1 error would mean concluding that the product will be a market success, when in fact it will not. In this instance, a company may prioritize reducing the instance of a Type 1 error because of the financial risk, risk to brand reputation, and potential waste of resources. That could lead to mitigation techniques such as using a lower significance level when running statistical tests, and implementing small pilot launches before committing to a full-scale product rollout.
- Sample size –
Larger sample sizes provide more reliable results because they reduce the standard error, making it easier to detect a true effect.
Suppose we were evaluating the click-through rates of a Facebook campaign; all else equal, a sample of 10 days would be less reliable than a sample of 100 days, as changes in the former group would be more susceptible to random variation. However, large sample sizes may not always be possible due to practical constraints such as time and data collection costs. Compromising on the number of observations heightens the risk of a Type 2 error (false negative). Lowering the significance level can reduce the risk of a Type I error (false positive), but may increase the risk of a Type II error, especially when working with limited data.
Scenario #1:
A business runs a paid social media campaign on Facebook for 7 days. The objective is to evaluate whether the campaign has significantly increased the click through rate compared to the historical data. The analyst’s null hypothesis is that the increase observed in the campaign is not statistically significant.
The figures are as follows:
Historical CTR: 1.5%
Campaign average CTR: 1.7%
Confidence level: 95%
After using a one-sample t-test to evaluate those 7 data points (more on that test later), we fail to reject the null hypothesis because the p-value is higher than 0.05.
At this point, the analyst might be tempted to lower the confidence level to 90%, as doing so would make the results statistically significant.
But an after-the-fact adjustment to the threshold is bad practice (it’s a form of something called p-hacking). The higher alpha threshold increases the risk of making a Type 1 error (in this case, claiming success where it does not actually exist). This could have negative implications for a business, as the firm could overinvest in a strategy that does not actually work.
Instead, it is advisable for the analyst to suggest extending the campaign to collect more data and maintain transparency by reporting results with a disclaimer about the sample size and confidence level.