Select Page

6.11 Chi-Square Goodness of Fit


In this chapter, we will explore two types of statistical tests in detail.  The first of these is the chi-square goodness of fit test, also known as one-sample goodness of fit test or Pearson’s chi-square goodness of fit test.

This test is used when we want to know if the observed distribution of a categorical variable is different from your expectations. Like any statistical test, assumptions about the data must be met before the Chi-square goodness of fit test can be applied:

  • There must be one categorical variable e.g. gender, profession, a 5-point Likert scale indicating ‘strongly agree’, ‘agree’, ‘no opinion’, ‘disagree’, ‘strongly disagree’.
  • Samples are randomly selected. This is true across all inferential statistical tests.
  • The groups within the categorical variable must be mutually exclusive. For instance, you would not be able to apply this test to survey questions that allowed you to ‘select all that apply’.
  • There must be at least five responses in each group of the categorical variable.

To generate a chi-square statistic, we compare observed and expected values for each category.  We square the differences between the observed and expected values, and then divide that result by the expected value.

Suppose that 10 years ago, Lobster Land surveyed visitors, as they walked through the park entrance, with a single question:   If you were offered a weeklong, all-inclusive, completely free vacation to your choice of either:  a) Bermuda; b) Hawaii, or c) Key West, Florida, which one would you choose?  When that data was collected, Lobster Land found that 30 percent of respondents selected Key West, 40 percent chose Hawaii, and 30 percent chose Bermuda.

Now, 10 years later, we want to see whether people’s preferences regarding vacations have meaningfully changed.  To make this assessment, Lobster Land can conduct a new survey, asking the same question to visitors. 

Let’s imagine that we ask this question to the next 1000 guests to enter Lobster Land.  We find that 335 select Key West, 410 select Hawaii, and that 255 choose Bermuda.  Starting with the null hypothesis – that consumers’ vacation preferences from among these three options have not changed – let’s organize our data using the table below:

OptionObserved #Expected #Observed-Expected (O-E)(O-E)^2 / E
Key West335300354.08
Hawaii410400100.25
Bermuda255300-456.75

This test’s chi-square statistics is the sum of the three values in the rightmost table of the column:  4.08 + 0.25 + 6.75 = 11.08.    

How should we interpret this chi-square statistic?  What does it really mean?

Our null hypothesis here is that nothing has changed – without any reason to expect different preferences, we should expect that among those 1000 guests, 300 will prefer Key West, 400 will prefer Hawaii, and 300 will prefer Bermuda. 

We can run the chi-square test in Python, using the scipy.stats library, as shown below:

Note that this delivers a very tiny p-value.  Based on the chi-square distribution, and the number of degrees of freedom here (this is 2, one fewer than the number of unique categories), we get a p-value associated with this chi-square statistic.  This value is .00392.  Whether we set our alpha at 0.05, 0.01, or even 0.005, this value is tiny enough to place us in the rejection region – we will reject the null hypothesis here. 

To build some intuition around the relationship between chi-square statistics and their corresponding p-values, let’s start by taking a look at a few chi-square distributions.

For the chi-square goodness of fit test, the number of degrees of freedom is equal to the total number of outcome categories minus one.  In the vacation destination example above, therefore, there are two degrees of freedom.  Note that as the df values go higher, the chi-square distributions take a slightly more symmetric shape, with a peak occurring further to the right. 

Next, since our example has two degrees of freedom, let’s look at the chi-square distribution for df=2.

Now, let’s ask this question:  What if we could randomly generate values underneath this curve?  Where would most of those values tend to fall?  Technically, the curve goes on forever, but for practical purposes, we’ll think of the visible area under this curve now as “1.”

In a scenario involving two degrees of freedom, how unusual would a chi-square value of 2.5 be?  Let’s graph it, and then let’s find its p-value.

The graph immediately above shows us how much area under the curve is “spoken for” for chi-square values from 0 to 2.5.  Visually, we can get a sense here that this is the majority of the area, but to know how much, we will use another function from scipy.stats.

This value of 0.2865 tells us that there is a 28.65% chance that, had we generated a chi-square value randomly, in a scenario involving two degrees of freedom, it would be greater than 2.5. 

Next, let’s take a look at something much bigger.  What if we had a chi-square statistic of 10, with two degrees of freedom?

Now, the p-value is less than 0.01.  If we were to randomly generate a value underneath this curve, there is just a 0.67% chance that it would be greater than 10.  At nearly any commonly-used alpha threshold, this would be a significant result.  For our vacation example, which generated a chi-square statistic of 11.08, we can see that the result is quite significant. 

While this test does not deliver any insights more specific about what occurred, we can say from this chi-square test that something has changed regarding vacation preferences. 

Other scenarios where the chi-square goodness of fit can be applied include the following:

  • A company has created 5 new flavors of soda water. You recruit a group of participants to test the new products. Your null hypothesis is that each flavor will be equally popular. After receiving your observed results, you perform a chi-square goodness of fit test to see if the distribution of flavor choices is significantly different from your expectations. If alpha=0.05 and the p-value is less than that, we can reject the null hypothesis and conclude that the difference in opinion you have observed is statistically significant. Therefore the company can eliminate the less popular flavors.
  • Lobster Land is giving out free gifts to the first 1,000 people who sign up for their newsletter through lobsterland.net: a Larry the Lobster soft toy, a limited edition Lobster Land souvenir mug, and a Lobster Land t-shirt. People can choose one of the three gifts on offer. Your null hypothesis is that each free gift is equally popular. Your observed results show that the t-shirt is the most popular, followed by the soft toy, and the souvenir mug respectively. The question is: did these gift preferences occur by chance? You perform a chi-square goodness of fit test. Assuming alpha=0.05 and the p-value is more than that, we have insufficient evidence to reject the null hypothesis and conclude that people do not have a preference for sign-up gifts.