The Sampling Distribution of a Statistic

 

John K. Adams

 

In this exercise, we will explore the concept of the sampling distribution of a statistic. The sampling distribution of a statistic is a theoretical abstraction that provides an answer to the following question: “How variable can we expect the value of a particular statistic to be, from sample to sample, using the same population, if the sample size is N? Here we are concerned with the sampling distribution of a particular statistic, the mean. However, the same general principles apply to other statistics.

 

The notion of a sampling distribution is a cornerstone of inferential statistics. Recall that the goal of inferential statistics is to make inferences about a population based on the information provided by a sample. Frequently all the tangible information we have about a population lies in the sample. The sampling distribution enables us to use the sample information in a very powerful way to make statements about the population. The following is a rather formal definition of a sampling distribution.

The sampling distribution of a statistic is a theoretical distribution that represents all possible samples of the same size from the same population. It shows the relationship between the possible values of the statistic and the probability associated with each value.

The definition is obviously theoretical. That is, we could never actually obtain “all possible samples” from an infinite or even a very large population. We just pretend we could. Specifically, we assume that we carry out an infinite number of samplings of size N. For each sample we calculate, and record, the value of the statistic of interest. When we are “through” sampling, we form a population of the resulting values on the statistic of interest. In essence, then, the sampling distribution of a statistic is a population distribution for a particular statistic. Instead of a population of heights or test scores, the sampling distribution represents a population of values on the statistic, one value for each theoretical sample of size N.

You are already familiar with the idea of a population. For example, you know that a population can be represented by a distribution. You also know that an area under the distribution curve can represent a certain percentage of cases. Similarly, with a sampling distribution, we use the area under the curve to represent how likely it is that we will observe a particular range of values on a statistic.

There are many kinds of distributions. One very important theoretical distribution is the normal distribution. We often assume (unless there is evidence to the contrary) that a particular population is normally distributed. What we mean is that the population is distributed like the theoretical normal distribution. Doing so greatly simplifies working with the population because much is known about the normal distribution mathematically. It would be very convenient if we could assume that the sampling distribution of the mean approximates the normal distribution. Therefore, we must ask an important question: What form does the sampling distribution of the mean take? And a companion question: Does the form of population from which we sampled influence the form of the sampling distribution? The answers to these questions will tell us how to use the sampling distribution to make inferences about the population from which we sampled.

 

The answer to both questions comes from a theorem known as the central limit theorem. It is formally stated as follows:

If a population has a mean equal to m, and a variance equal to s2, then the sampling distribution of the mean, M, derived from samples of size N, approaches a normal distribution with mean m and variance s2/N, as N increases.

So, for all practical purposes, the sampling distribution of the mean has a normal distribution. Notice that nothing is said about the nature of the population from which we are sampling. It can be of any shape and the sampling distribution will still be normal, provided N is large enough. (How large is “large enough?” Usually an N of 30 is considered sufficiently large.) Of course, the central limit theorem is very good news. We already know how to use a normal distribution to represent a population, so now we can use it to represent a population of values on a statistic, the sampling distribution of the mean.

Here is a final theoretical note. As its name suggests, the sampling distribution is closely related to sampling. In fact, whenever you draw a sample from a population and calculate a statistic based on that sample, an abstract sampling distribution comes into existence. In the case of the mean, it is the sampling distribution of the mean.

 

Preliminary Exercises

 

1.  If the variance of the population is 100 and the sample size is 25, what is the variance of the sampling distribution of the mean?

 

 

 

2.  Same as Question 1, but the sample size is 1.

 

 

 

3.  Appropriately complete the following statement using the words “increases” and “decreases.” “In general, as the sample size _____, the variability of the sampling distribution of the mean _____.”

 

 

4.  If the mean of the population is 400, what is the mean of the sampling distribution of the mean?

 

 

5.  The standard deviation of the sampling distribution of the mean is given a special name, “the standard error of the mean.” Find its value for the values given in Question 1.

 

 

 

 

Advanced Exercises

 

As we discussed earlier, we could never actually construct a sampling distribution of the mean. The major limitation is collecting “all possible samples” from an infinite population, clearly an impossibility. We can, however, collect a number of samples of a particular size from a large population and calculate the mean for each sample. The collection of means would represent a small subset of the sampling distribution of the mean. It would be correct to call such a collection of means “a sample drawn from the sampling distribution of the mean.” (Linger over that statement until you understand it.) Like any sample drawn from a particular population, such a sample should reflect the characteristics of the “population,” in this case the sampling distribution of the mean.

A well‑known set of norms of the English language consists of about 50,000 words. Here we consider the length of each word in the norms as our population. Therefore, our population consists of numbers like “3” (the length of the word “dog”), “5” (the length of the word “party”), and so forth. There are as many lengths in our population as there are words in the norms, about 50,000. The mean of our population is 7.39 (the average length of all the words in the norms). The variance of our population is 7.28 (the average squared deviation from the population mean).

Now let’s sample from our population of lengths. In fact, let’s draw 10 samples, each of size 20 (that is, N = 20). After we draw a given sample, we calculate, and record, the mean of the sample. Now let’s draw another 10 samples, but this time each of size 50 (that is, N = 50). Once again, after we draw a given sample, we calculate and record the mean of the sample.

Here are the 10 means we recorded for each N we used. Notice that each “sample” of 10 means is incompletely labeled as “N = _____.”

 

N = _____

 

 

N = _____

 

7.95

8.55

6.10

6.85

6.25

6.75

6.55

7.30

6.15

6.65

 

 

7.60

7.26

7.38

6.62

8.02

7.34

7.46

6.92

7.20

7.58

 

 

 

For the following questions, keep in mind that each column above is essentially a sample from the sampling distribution of the mean for the particular N involved (20 or 50). Like any sample, each sample reflects the characteristics of the population from which the sample was drawn. The twist here is that the two “populations” involved are sampling distributions of the mean. Note that the two sampling distributions arose by sampling from the same parent population (word length), but each used a unique sample size.

 

1.      Using the available sample information, estimate the mean of each sampling distribution.

 

2.      By inspection, what are the relative variances of the two sampling distributions?

 

3.  Based on the relative variances, fill in the column labels above (with  “20” and “50”). Explain why you labeled them as you did.

 

4.  Compare the estimates you made in Question 1 with the actual mean of the length population (m = 7.39). Which estimate comes closest to the (parent) population mean? If you had just one sample mean with which to estimate m, would you rather have it based on N = 20 or N = 50? Why?