Showing posts with label Descriptive Statistics. Show all posts
Showing posts with label Descriptive Statistics. Show all posts

Sunday 30 June 2013

Summarizing the Sample Data

(Reliability and Validity)  Summarizing the Sample Data

You can well imagine that once a survey has been completed, the collected data (known as the raw data) must be transformed in a way that will allow them to be meaningfully interpreted. The raw data are, by themselves, not very useful for gaining the desired snapshot because they contain too many numbers. For example, if we interview 500 people and ask each of them forty questions, there will be 20,000 responses to examine. In this section we will consider some of the statistical methods used to summarize sample data. Procedures for using computer software programs to conduct statistical analyses are reviewed in Appendix B, and you may want to read this material at this point.

Frequency Distributions

Table 6.1 presents some hypothetical raw data from twenty-fi ve participants on five variables collected in a sort of “minisurvey.” You can see that the table is arranged such that the variables (sex, ethnic background, age, life satisfaction, family income) are in the columns and the participants form the rows. For nominal variables such as sex or ethnicity, the data can be summarized through the use of a frequency distribution. A frequency distribution is a table that indicates how many, and in most cases what percentage, of individuals in the sample fall into each of a set of categories. A frequency distribution of the ethnicity variable from Table 6.1 is shown in Figure 6.1(a). The frequency distribution can be displayed visually in a bar chart, as shown for the ethnic background variable in Figure 6.1(b). The characteristics of the sample are easily seen when summarized through a frequency distribution or a bar chart. 

































          One approach to summarizing a quantitative variable is to combine adjacent  values into a set of categories and then to examine the frequencies of each of the categories. The resulting distribution is known as a grouped frequency distribution. A grouped frequency distribution of the age variable from Table 6.1 is shown in Figure 6.2(a). In this case, the ages have been grouped into fi ve categories (less than 21, 21–30, 31–40, 41–50, and greater than 50).

The grouped frequency distribution may be displayed visually in the form of a histogram, as shown in Figure 6.2(b). A histogram is slightly different from a bar chart because the bars are drawn so that they touch each other. This indicates that the original variable is quantitative. If the frequencies of the groups are indicated with a line, rather than bars, as shown in Figure 6.2(c), the display is called a frequency curve.
        

             One limitation of grouped frequency distributions is that grouping the values together into categories results in the loss of some information. For instance, it is not possible to tell from the grouped frequency distribution in Figure 6.2(a) exactly how many people in the sample are twenty-three years old. A stem and leaf plot is a method of graphically summarizing the raw data such that the original data values can still be seen. A stem and leaf plot of the age variable from Table 6.1 is shown in Figure 6.3. 


Descriptive Statistics 

Descriptive statistics are numbers that summarize the pattern of scores observed on a measured variable. This pattern is called the distribution of the variable. Most basically, the distribution can be described in terms of its central tendency—that is, the point in the distribution  around which the data are centered—and its dispersion, or spread. As we will see, central tendency is summarized through the use of descriptive statistics such as the mean, the median, and the mode, and dispersion is summarized through the use of the variance and the standard deviation. Figure 6.4 shows a printout from the IBM Statistical Package for the Social Sciences (IBM SPSS) software of the descriptive statistics for the quantitative variables in Table 6.1.

Measures of Central Tendency. The arithmetic average, or arithmetic mean,
is the most commonly used measure of central tendency. It is computed by summing all of the scores on the variable and dividing this sum by the number of participants in the distribution (denoted by the letter N). The sample mean is sometimes denoted with the symbol x– , read as “X-Bar,” and may also be indicated by the letter M. As you can see in Figure 6.4, in our sample, the mean age of the twenty-fi ve students is 33.52. In this case, the mean provides  an accurate index of the central tendency of the age variable because if you look at the stem and leaf plot in Figure 6.3, you can see that most of the ages are centered at about thirty-three.
          The pattern of scores observed on a measured variable is known as the variable’s distribution. It turns out that most quantitative variables have distributions similar to that shown in Figure 6.5(a). Most of the data are located near the center of the distribution, and the distribution is symmetrical and bell-shaped. Data distributions that are shaped like a bell are known as normal distributions.
          In some cases, however, the data distribution is not symmetrical. This occurs when there are one or more extreme scores (known as outliers) at one end of the distribution. For instance, because there is an outlier in the family income variable in Table 6.1 (a value of $2,800,000), a frequency curve of this variable would look more like that shown in Figure 6.5(b) than that shown in Figure 6.5(a). Distributions that are not symmetrical are said to be skewed. As shown in Figure 6.5(b) and (c), distributions are said to be either positively skewed or negatively skewed, depending on where the outliers fall. 
         Because the mean is highly infl uenced by the presence of outliers, it is not a good measure of central tendency when the distribution is highly skewed. For instance, although it appears from Table 6.1 that the central tendency of the family income variable should be around $40,000, the mean family income is actually $159,920. The single very extreme income has a disproportionate impact on the mean, resulting in a value that does not well represent the central tendency. 
          The median is used as an alternative measure of central tendency when distributions are skewed. The median is the score in the center of the distribution, meaning that 50 percent of the scores are greater than the median and 50 percent of the scores are lower than the median. Methods for calculating the median are presented in Appendix B. In our case, the median household income ($43,000) is a much better indication of central tendency than is the mean household income ($159,920). 
       A final measure of central tendency, known as the mode, represents the value that occurs most frequently in the distribution. You can see from Table 6.1 that the modal value for the income variable is $43,000 (it occurs four times). In some cases there can be more than one mode. For instance, the age variable has modes at 18, 19, 31, 33, and 45. Although the mode does represent central tendency, it is not frequently used in scientific research. The relationships among the mean, the median, and the mode are described in Figure 6.5. 



































Measures of Dispersion. In addition to summarizing the central tendency of a distribution, descriptive statistics convey information about how the scores on the variable are spread around the central tendency. Dispersion refers to the extent to which the scores are all tightly clustered around the central tendency, like this:

          One simple measure of dispersion is to find the largest (the maximum) and the smallest (the minimum) observed values of the variable and to compute the range of the variable as the maximum observed score minus the minimum observed score. You can check that the range of the age variable is 63 2 18 5 45. 
        The standard deviation, symbolized as s, is the most commonly used measure of dispersion. As discussed in more detail in Appendix B, computation of the standard deviation begins with the calculation of a mean deviation score for each individual. The mean deviation is the score on the variable minus the mean of the variable. Individuals who score above the mean have positive deviation scores, whereas those who score below the mean have negative deviation scores. The mean deviations are squared and summed to produce a statistic called the sum of squared deviations, or sum of squares. The sum of squares is divided by the sample size (N) to produce a statistic known as the variance, symbolized as s2. The square root of the variance is the standard deviation, s. Distributions with a larger standard deviation have more spread. As you can see from Figure 6.4, the standard deviation of the age variable in Table 6.1 is 12.51.