Week 12--13. Statistics.

Inferring population parameters from sample statistics; margin of error and level of confidence

Basic ideas this week:

Much of statistics is concerned with the problem of obtaining information about a population from information about a sample. One very vivid application is currently in the news: polls attempt to determine the way a population will vote by examining the voting patterns within a sample.

The idea of generalizing from a sample to a population is not hard to grasp in a loose and informal way, since we do this all the time. After a few vivits to a store, for example, we notice that the produce is not fresh. So we assume that the store generally has bad produce. This is a generalization from a sample (the vegetables we have examined) to a population (all the vegetables the store sells). But there are many ways to go wrong or to misunderstand the meaning of the data obtained from a sample.

How do statisticians conceive of the process of drawing a conclusion about a population from a sample? How do they describe the information that is earned from a sample and quantify how informative it is? How much data do we need in order to reach a conclusion that is secure enough to print in a newpaper? Or on which to base medical decisions? These are the questions that we will address this week.

The simplest example arises when one uses a sample to infer a population proportion. We can give a fairly complete account of the mathematical ideas that are used in this situation, based on the binomial distribution. My aim is to enable you to understand the internal mathematical "clockwork" of how the statistical theory works.

Assignment:

Vocabulary:

Concepts:

We will concentrate on the estimating population proportions by sampling.

The meaning of margin of error and level of confidence.

The margin of error an level of confidence depend on the sample size (and NOT on population size):

The underlying idea that explains how we can determine the reliability of statistics is the notion of sampling distribution. In order to talk about this, I introduce a new term: by a "p-population", I mean a very large population that has proportion p of some characteristic that is of interest, e.g., democrat.

Skills:

  1. Explain the vocabulary, above and illustrate with examples.
  2. Explain what it means when a reporter or researcher says that a poll has a margin of error of 3 percentage points (say) at a level of confidence 95% (say).
  3. Use a table to determine the levels of confidence and margins of error that can be obtained with various sample sizes when attempting to determine population proportions.
  4. Use the sqare root law to estimate the sample size needed to get a given margin of error better than 95% confidence. (See text, page 350.)

Assessments:

  1. A jar of colored beads may be an analogy for more meaningful situations that you might encounter (e.g., in the news). List some examples and draw the analogy explicitly. For example: someone wants to predict the outcome of an election by means of an exit poll. All the people who voted are analogous to all the beads in the jar. The color of bead is analogous to the vote---e.g. color : bead : : candidate voted for : voter. The people who are questioned in the poll are analogous to the sample.
  2. Suppose a large population is 40% red. Imagine that you have drawn a sample of size 20 from this population. Describe what you think a typical sample might be like.
  3. Suppose that you have drawn a sample of size 20 from a population of unknown proportion red, and that our sample is 40% red. What do you think you cn deduce about the population?
  4. A random sample of size 100 from a population of voters is 52% Republican. What do think the true proportion of Republicans in the population is?
  5. Do you know anything more than just that the true proprtion is near 52%?
  6. Imagine a large bin with pieces of paper---or a jar filled with colored beads. Describe what we would do in order to estimate the sampling distribution empirically.
  7. If we draw 1000 samples, each of size 400, from a population that is 30% red, then how many samples will have a statistic of exactly 30% (the population proportion that you decided to work with)? What will the greatest deviation from p be?