(Surveys and Sampling) Sampling and Generalization
We have seen that surveys are conducted with the goal of creating an accurate picture of the current attitudes, beliefs, or behaviors of a large group of people. In some rare cases it is possible to conduct a census—that is, to measure each person about whom we wish to know. In most cases, however, the group of people that we want to learn about is so large that measuring each person is not practical. Thus, the researcher must test some subset of the entire group of people who could have participated in the research. Sampling refers to the selection of people to participate in a research project, usually with the goal of being able to use these people to make inferences about a larger group of individuals. The entire group of people that the researcher desires to learn about is known as the population, and the smaller group of people who actually participate in the research is known as the sample.
Definition of the Population
The population of interest to the researcher must be defi ned precisely. For instance, some populations of interest to a survey researcher might be “all citizens of voting age in the United States who plan to vote in the next election,” “all students currently enrolled full time at the University of Chicago,” or “all Hispanic Americans over forty years of age who live within the Baltimore city limits.” In most cases the scientist does not particularly care about the characteristics of the specifi c people chosen to be in the sample. Rather, the scientist uses the sample to draw inferences about the population as a whole (just as a medical researcher analyzes a sample to make inferences about blood that was not sampled).
Whenever samples are used to make inferences about populations, the researcher faces a basic dilemma—he or she will never be able to know exactly what the true characteristics of the population are because all of the members of the population cannot be contacted. However, this is not really as big a problem as it might seem if the sample can be assumed to be representative of the population. A representative sample is one that is approximately the same as the population in every important respect. For instance, a representative sample of the population of students at a college or university would contain about the same proportion of men, sophomores, and engineering majors as are in the college itself, as well as being roughly equivalent to the population on every other conceivable characteristic.
Probability Sampling
To make the sample representative of the population, any of several probability sampling techniques may be employed. In probability sampling, procedures are used to ensure that each person in the population has a known chance of being selected to be part of the sample. As a result, the likelihood that the sample is representative of the population is increased, as is the ability to use the sample to draw inferences about the population.
Simple Random Sampling. The most basic probability sample is drawn using simple random sampling. In this case, the goal is to ensure that each person in the population has an equal chance of being selected to be in the sample. To draw a simple random sample, an investigator must first have a complete list (known as a sampling frame) of all of the people in the population. For instance, voting registration lists may be used as a sampling frame, or telephone numbers of all of the households in a given geographic location may be used. The latter list will basically represent the population that lives in that area because almost all U.S. households now have a telephone. Recent advances in survey methodology allow researchers to include cell phone numbers in their sampling frame as well.
Then the investigator randomly selects from the frame a sample of a given number of people. Let’s say you are interested in studying volunteering behavior of the students at your college or university, and you want to collect a random sample of 100 students. You would begin by finding a list of all of the students currently enrolled at the college. Assume that there are 7,000 names on this list, numbered sequentially from 1 to 7,000. Then, as shown in the instructions for using Statistical Table A (in Appendix E), you could use a random number table (or a random number generator on a computer) to produce 100 numbers that fall between 1 and 7,000 and select those 100 students to be in your sample.
Systematic Random Sampling. If the list of names on the sampling frame is itself known to be in a random sequence, then a probability sampling procedure known as systematic random sampling can be used. In your case, because you wish to draw a sample of 100 students from a population of 7,000 students, you will want to sample 1 out of every 70 students (100/7,000 5 1/70). To create the systematic sample, you first draw a random number between 1 and 70 and then sample the person on the list with that number. You create the rest of the sample by taking every seventieth person on the list after the initial person. For instance, if the fi rst person sampled was number 32, you would then sample number 102, 172, and so on. You can see that it is easier to use systematic sampling than simple random sampling because only one initial number has to be chosen at random.
Stratified Sampling. Because in most cases sampling frames include such information about the population as sex, age, ethnicity, and region of residence, and because the variables being measured are frequently expected to differ across these subgroups, it is often useful to draw separate samples from each of these subgroups rather than to sample from the population as a whole. The subgroups are called strata, and the sampling procedure is known as stratified sampling.
To collect a proportionate stratifi ed sample, frames of all of the people within each strata are fi ,rst located, and random samples are drawn from within each of the strata. For example, if you expected that volunteering rates would be different for students from different majors, you could fi rst make separate lists of the students in each of the majors at your school and then randomly sample from each list. One outcome of this procedure is that the different majors are guaranteed to be represented in the sample in the same proportion that they are represented in the population, a result that might not occur if you had used random sampling. Furthermore, it can be shown mathematically that if volunteering behavior does indeed differ among the strata, a stratified sample will provide a more precise estimate of the population characteristics than will a simple random sample (Kish, 1965).
Disproportionate stratified sampling is frequently used when the strata differ in size and the researcher is interested in comparing the characteristics of the strata. For instance, in a class of 7,000 students, only 10 or so might be French majors. If a random sample of 100 students was drawn, there might not be any French majors in the sample, or at least there would be too few to allow a researcher to draw meaningful conclusions about them. In this case, the researcher draws a sample that includes a larger proportion of some strata than they are actually represented in the population. This procedure is called oversampling and is used to provide large enough samples of the strata of interest to allow analysis. Mathematical formulas are used to determine the optimum size for each of the strata.
Cluster Sampling. Although simple and stratifi ed sampling can be used to create representative samples when there is a complete sampling frame for the population, in some cases there is no such list. For instance, there is no single list of all of the currently matriculated college students in the United States. In these cases an alternative approach known as cluster sampling can be used. The technique is to break the population into a set of smaller groups (called clusters) for which there are sampling frames and then to randomly choose some of the clusters for inclusion in the sample. At this point, every person in the cluster may be sampled, or a random sample of the cluster may be drawn.
Often the clustering is done in stages. For instance, we might fi rst divide the United States into regions (for instance, East, Midwest, South, Southwest, and West). Then we would randomly select states from each region, counties from each state, and colleges or universities from each county. Because there is a sampling frame of the matriculated students at each of the selected colleges, we could draw a random sample from these lists. In addition to allowing a representative sample to be drawn when there is no sampling frame, cluster sampling is convenient. Once we have selected the clusters, we need only contact the students at the selected colleges rather than having to sample from all of the colleges and universities in the United States. In cluster sampling, the selected clusters are used to draw inferences about the nonselected ones. Although this practice loses some precision, cluster sampling is frequently used because of convenience.
Sampling Bias and Nonprobability Sampling
The advantage of probability sampling methods is that their samples will be representative and thus can be used to draw inferences about the characteristics of the population. Although these procedures sound good in theory,in practice it is difficult to be certain that the sample is truly representative. Representativeness requires that two conditions be met. First, there must be one or more sampling frames that list the entire population of interest, and second, all of the selected individuals must actually be sampled. When either of these conditions is not met, there is the potential for sampling bias. This occurs when the sample is not actually representative of the population because the probability with which members of the population have been selected for participation is not known.
Sampling bias can arise when an accurate sampling frame for the population of interest cannot be obtained. In some cases there is an available sampling frame, but there is no guarantee that it is accurate. The sampling frame may be inaccurate because some members of the population are missing or because it includes some names that are not actually in the population. College student directories, for instance, frequently do not include new students or those who requested that their name not be listed, and these directories may also include students who have transferred or dropped out.
In other cases there simply is no sampling frame. Imagine attempting to obtain a frame that included all of the homeless people in New York City or all of the women in the United States who are currently pregnant with their first child. In cases where probability sampling is impossible because there is no available sampling frame, nonprobability samples must be used. To obtain a sample of homeless individuals, for instance, the researcher will interview individuals on the street or at a homeless shelter. One type of nonprobability sample that can be used when the population of interest is rare or difficult to reach is called snowball sampling. In this procedure one or more individuals from the population are contacted, and these individuals are used to lead the researcher to other population members. Such a technique might be used to locate homeless individuals. Of course, in such cases the potential for sampling bias is high because the people in the sample may be different from the people in the population. Snowball sampling at homeless shelters, for instance, may include a greater proportion of people who stay in shelters and a smaller proportion of people who do not stay in shelters than are in the population. This is a limitation of nonprobability sampling, but one that the researcher must live with because there is no possible probability sampling method that can be used.
Even if a complete sampling frame is available, sampling bias can occur if all members of the random sample cannot be contacted or cannot be convinced to participate in the survey. For instance, people may be on vacation, they may have moved to a different address, or they may not be willing to complete the questionnaire or interview. When a questionnaire is mailed, the response rate may be low. In each of these cases the potential for sampling bias exists because the people who completed the survey may have responded differently than would those who could not be contacted.
Nonprobability samples are also frequently found when college students are used in experimental research. Such samples are called convenience samples because the researcher has sampled whatever individuals were readily available without any attempt to make the sample representative of a population. Although such samples can be used to test research hypotheses, they may not be used to draw inferences about populations. We will discuss the use of convenience samples in experimental research designs more fully in Chapter 13.
Whenever you read a research report, make sure to determine what sampling procedures have been used to select the research participants. In some cases, researchers make statements about populations on the basis of nonprobability samples, which are not likely to be representative of the population they are interested in. For instance, polls in which people are asked to call a 900 number or log on to a website to express their opinions on a given topic may contain sampling bias because people who are in favor of (or opposed to) the issue may have more time or more motivation to do so. Whenever the respondents, rather than the researchers, choose whether to be part of the sample, sampling bias is possible. The important thing is to remain aware of what sampling techniques have been used and to draw your own conclusions accordingly.