WSU STAT 360
Class Session 7 Summary and Notes October 13, 2000

Estimators and parameters repeated from last class

We have examined 4 important parameters in our investigations of sampling and probability. These are

  1. The population mean (greek letter mu=m). This represents the arithmetic average of the random variable of interest. Often we do not actually know the value of this parameter in advance, although we have worked many problems by assuming that we do.
  2. The population standard deviation (greek letter sigma=s). This is the square root of the population variance. Once again we may not know this value in advance. If the population follows a particular theoretical pdf, like the normal distribution, then we can calculate it.
  3. The sample mean (denoted by some symbol with a bar above it, or sometimes by a bold symbol like (x)). This is the arithmetic average of a number of observations made of the random variable. It is an unbiased estimator for the population mean. In cases where we do not know the population mean, we can use the sample mean instead and provide confidence limits for how much the true population mean might vary from it. In case we know the population mean from some other information, we typically assume that the central limit theorem applies, and then we calculate how likely is the deviation of the sample mean from the population mean. This is a test of significance.
  4. The sample standard deviation (usually denoted with the letter 's'). This is the square root of the sample variance. If the sample variance is calculated as the sum of squared deviations from the sample mean divided by the number of observations minus 1, then the sample variance is an unbiased estimator of the population variance.

The meaning of confidence intervals

I tried to assemble a cogent argument for confidence intervals off-the-cuff in class, but it may work better if I repeat it here, and try to tie it together more clearly. Let me start with a known "parent" population that has true mean value (m) and true standard deviation (s). Now assume that we take samples of size (n) from this distribution and calculate the sample mean of each.

Obviously, if these samples are independent, then the sample means will adher to a normal distribution (if the parent distribution was normal) or a nearly normal distribution if the parent distribution simply satisfies the central limit theorem. The mean of this "sample" distribution is m once again, but its standard distribution, actually often called the standard error, is less than that of the parent population. It is, in fact, s/SQRT(n).

According to what we know about the normal distribution, 90% of the samples so taken will have a mean value between m-1.96*s/SQRT(n) and m+1.96*s/SQRT(n). So far there is nothing remarkable about this result. It follows naturally from the definition of the pdf. However, let us turn the picture around so to speak, to examine a new circumstance.

Let us suppose that we do not know m and s at all. Now we are forced to use a mean, and standard deviation for the parent population that is estimated from a sample. Obviously the sample mean (x) is an estimator for the population mean, and the sample standard deviation (s) is an estimator of the population standard deviation. We know that the sample mean is not likely to equal the population mean. What we wish to know is, "What is the likely interval in which we will find the true population mean?"

Since we have no evidence otherwise, we will assume that the true population mean is equally likely to be above or below our sample mean. Therefore, we will specify a symmetric interval around our sample mean. Moreover, if we demand absolute certainty regarding the interval for the true population mean, we can only say it is between minus and plus infinity. This is not a useful result. We will have to specify a level of certainty for our estimated interval that carries some small chance of not including the true population mean. Typically, a person will choose a 90%, 95%, or 99% confidence interval, which corresponds, respectively, to a 10%, 5%, or 1% chance of not containing the true population mean simply because of the vagaries of random samples.

There is a further detail to settle. The population standard deviation, of which we have an estimate, describes how individuals drawn from the parent population will vary around the mean, but our sample mean is an average of n such individuals and it will vary according the the sample standard error s/SQRT(n). Therefore, if our sample is large enough for us to use the normal distribution, the 90% confidence interval for the unknown true population mean ranges from x-1.96*s/Sqrt(n) to x+1.96*s/SQRT(n). If the sample is too small (n<30 or so) then we should replace the two factors of 1.96 with the 5% and 95% values of the cdf of the t distribution with (n-1) degrees of freedom.


Link forward to the next set of class notes for Friday, October 20, 2000