SAMPLING DISTRIBUTIONS - Section 4.9, 4.10
In-class Group Activity for demonstrating Sampling Distribution of Sample Mean.
Q1. American males must register at a local post office when they turn 18. In addition to other information, the height of each male is obtained. The national average height for 18-year-old males is 69 inches (5 ft. 9 in.). At the small, local post office, about 5 men register each day. At the large, city post office, about 50 men register each day. At the end of each day, the clerk at each post office computes the average height of the men who registered there that day
Which of the following predictions would you make regarding the number of days on which the average height for the day was more than 71 inches (5 Ft. 11 in)?
__a. The number of days with average heights over 71 inches would be greater for the small post office than for the large post office.
__b. The number of days with average heights over 71 inches would be greater for the large post office than for the small post office.
__c. There is no basis for predicting which post office would have the greater number of days.
Why:
NOTE: The concept behind this question is about estimating average height using 5 individuals Vs. using 50 individuals. Which estimate gives better result?
How do you solve this problem? The very first step is to understand what we mean by 'better estimate'.
Q2: When you say that the average of 50 individuals should give a better estimate of the true average male adult height, what do you have in mind with regards to 'better'?
Q3: If 'Better' means that the average of 50 individuals should be closer to the true average, then, the question remains is to identify the 'closeness'. Here are ways that can help us to do just this. You notice that the average height is computed from each Post Office for many days. When you look at these average heights from each Office, what of the following may more likely to happen:
Why:
Q4. Suppose each Post Office collected the average heights for 200 working days. You use Minitab to construct a histogram for each Office. Which of the following shapes are more likely to occur for the average heights:
Why:
NOTE that what you observed in Q4 is what so-called sampling distribution of sample mean. You should have noticed the following properties:
Q5. In a geology course, students were asked to determine the weight of rock samples. One instructor asked her students to weigh a rock several times on the same scale. This rock is known to weigh exactly 1000 grams. However, the scale is not completely accurate and sometimes it is off in either direction by 25 grams or less. After a lot of practice, one student weighed the rock 20 times, then computed and recorded the average of the 20 weighings. After a lot of practice, a second student weighed the rock 5 times, then computed and recorded the average of the five weighings.
How would you expect the accuracy of the average weight recorded by the first and second student to compare?
__a. The student who weighed the rock 20 times would have a more accurate average.
__b. The student who weighed the rock 5 times would have a more accurate average.
__c. Both averages would be equally accurate.
__d. It is impossible to predict which average would be more accurate.
Why:
More formal discussion about Sampling Distribution of Sample Mean:
One major purpose of statistics is to use sample information to estimate/predict the unknown population characteristics, or to use the sample information to help us making decisions. This is what inference is about. The simplest inference is point estimation.
----------------------------------------------------------------------------------------------------
Population Parameter: Mean,
m Variance, s2 s.d., s(Unknown to us)
---------------------------------------------------------------------------------------------------
Sample Statistic Mean, x variance, s2 s.d., s
(Computed from sample data)
------------------------------------------------------------------------------------------------------
We simply use common sense to do the estimation, by using sample mean to estimate population mean, sample variance to estimate the population variance, and sample s.d. to estimate population s.d.
How RELIABLE can a sample statistic be used to estimate the unknown population parameter? This must be quantified so that we will be able to make better decision.
To understand the reliability of using a sample statistic to estimate the population parameter, we must study the sampling distribution of the sample statistic, especially the distribution of the sample mean.
WHAT IS THE SAMPLING DISTRIBUTION OF SAMPLE MEAN?
HOW TO CONSTRUCT THE SAMPLING DISTRIBUTION OF SAMPLE MEAN?
In general, population distribution is unknown in real world situation. We try to characterize the population distribution by using previous experience and so on. This is what the random variables and population distributions are all about. We discussed
Binomial distribution: for those characteristics that can be described as X = # of successes in n identical trials.
Normal distribution: for those characteristics that are continuous and have bell distribution shape.
Other distributions can be described as skewed to the right or skewed to the left.
We learn the properties of population distributions so that when we collect data from the unknown population, we are able to use the sample information to infer to the properties of the unknown population.
But, problem is how reliable the sample information is for inferring the unknown population characteristic, such as using sample mean to estimate population mean!
We need to study the distribution of sample mean, and use this distribution to understand the reliability of using sample mean to estimate the population mean.
When we study the population distribution, we do not need to collect all of the observations in the population, but, rather using the conceptual understanding and previous experience, we are able to describe the population distribution of adult weight, SAT scores and so on.
We will use a similar conceptual thinking to develop the distribution of sample mean.
What is the sampling distribution of sample mean?
Considering that we are interesting in estimating the average beer price in Michigan.
We will take a random sample, say n dozens of beer, and compute the average beer price, then use the sample average beer price to estimate the unknown population beer price. But, as we all know, different samples of n dozens will result different average beer prices. This clearly indicates that the average beer price is a random variable. So, conceptually, we can construct a histogram of sample means, if we collect a large number, say, 10,000 of samples, each sample is of n dozens of beers, and compute 10,000 sample means. Now, this histogram resembles the distribution of the sample mean price from samples of size n dozen beers.
This is what we call the sampling distribution of sample mean.
Here is another example:
Considering the example of distance from your home town. We collect a sample of 26 students { x1, x2, ......x26} and compute the average
. In general, we would like to make some conclusion such as : based on my sample , the actual average distance from home for all CMU students is estimated to be
. But, how accurate we can make such a conclusion. That is how accurate can
be used to estimate the actual population mean
To really understand this , the best way is to collect every one’s distance and find the actual population mean
m , then compare with the sample meanIn reality, it is either very difficult or impossible to collect the entire population. However, one way to understand how good can
from 26 students observations estimate
Conceptually, if we continued our random sample of 26 students for a large amount of different random samples, we could compute {
1,
2, ......} Each
i is the average distance from a sample of 26 observations. Then, using histogram, we can empirically view the distribution of these
’s. Theoretically, this can be done and the results can be described as the following:
If the distance, call it X, a continuous random variable, follows normal with mean
m and standard deviation s :X ~ N(
m ,s )We consider a random sample of size n: X1, X2.... Xn
Then, the average distance
:
= ( X1 + X2 + ........... Xn ) / n
What should the distribution of
look like? We will present the result without going into too much detail. We will use these results to solve problems.
First result:
When population : X~N (m ,s )
The sample mean
from a random sample of n observations has the following distribution:
~ N(m ,
), or the standardized form
~ N(0,1)
E.g., Consider the distance from home for CMU students follows N(150, 100)
Take X to be the variable: Distance from Home for CMU students: X~ N(150, 100)
Our interest is to understand the distribution of sample mean of 25 students. The above result says: The distribution of
from a random sample of n = 25 students is:
~N(150, 100/Ö 25).
i.e.
~N(150,20) or Z = (
-150 )/20 ~ N(0,1)
It is not difficult to imagine that the shape of
is also normal when the original population X is normal. However, a further question is:
If the original population X is NOT normal, or even it is NOT known, then, is there any way we can capture the distribution of
?
Answer is YES, if sample size n is large enough, typically, n> 30 is considered large.
This is what Central Limit Theorem is all about. Let me describe it in the following:
Central Limit Theorem
Population X~(m , s )
(Note the shape is unknown or nonnormal) A random sample of size n is chosen form this population, when n > 30,
is approximately N(m ,
)
or a standardized form:
~ N(0,1)
This says: Regardless what the original population is, as long as the sample size is large (n>30), the sample mean
always follows approximately normal with mean m and s.d. ![]()
For the distance example, consider that the distribution of the distance does not follow normal curve, say it is skewed to the right (that is majority of students come from within 200 miles, and only a few come from far away). So, it is considered the case the population shape is skewed. To the worst, it is unknown to us. That is, distance, X~ (m , s ). has a population mean, m and s.d., s , but the shape is not known. ( Notice that the m and s are unknown in practice. here, we use m , s to denote the population mean and standard deviation for the purpose of studying the population distribution and distribution of sample mean).
Lets take a random sample of 36 students. Then, the distribution of
is approximated by
~N(m , s /6).