(Reliability and Validity) Reliability
The reliability of a measure refers to the extent to which it is free from random error. One direct way to determine the reliability of a measured variable is to measure it more than once. For instance, you can test the reliability of a bathroom scale by weighing yourself on it twice in a row. If the scale gives the same weight both times (we’ll assume your actual weight hasn’t changed in between), you would say that it is reliable. But if the scale gives different weights each time, you would say that it is unreliable. Just as a bathroom scale is not useful if it is not consistent over time, an unreliable measured variable will not be useful in research.
The next section reviews the different approaches to assessing a measure’s reliability; these are summarized in Table 5.1.
Test-Retest Reliability
Test-retest reliability refers to the extent to which scores on the same measured variable correlate with each other on two different measurements given at two different times. If the test is perfectly reliable, and if the scores on the conceptual variable do not change over the time period, the individuals should receive the exact same score each time, and the correlation between the scores will be r 5 1.00. However, if the measured variable contains random error, the two scores will not be as highly correlated. Higher positive correlations between the scores at the two times indicate higher test-retest reliability.
Although the test-retest procedure is a direct way to measure reliability, it does have some limits. For one thing, when the procedure is used to assess the reliability of a self-report measure, it can produce reactivity. As you will recall from Chapter 4, reactivity refers to the infl uence of measurement on the variables being measured. In this case, reactivity is a potential problem because when the same or similar measures are given twice, responses on the second administration may be infl uenced by the measure having been taken the first time. These problems are known as retesting effects.
Retesting problems may occur, for instance, if people remember how they answered the questions the fi rst time. Some people may believe that the experimenter wants them to express different opinions on the second occasion (or else why are the questions being given twice?). This would obviously reduce the test-retest correlation and thus give an overly low reliability assessment. Or respondents may try to duplicate their previous answers exactly to avoid appearing inconsistent, which would unnaturally increase the reliability estimate. Participants may also get bored answering the same questions twice. Although some of these problems can be avoided through the use of a long testing interval (say, over one month) and through the use of appropriate instructions (for instance, instructions to be honest and to answer exactly how one is feeling right now), retesting poses a general problem for the computation of test-retest reliability.
To help avoid some of these problems, researchers sometimes employ a more sophisticated type of test-retest reliability known as equivalent-forms reliability. In this approach two different but equivalent versions of the same measure are given at different times, and the correlation between the scores on the two versions is assessed. Such an approach is particularly useful when there are correct answers to the test that individuals might learn by taking the first test or be able to fi nd out during the time period between the tests. Because students might remember the questions and learn the answers to aptitude tests such as the Graduate Record Exam (GRE) or the Scholastic Aptitude Test (SAT), these tests employ equivalent forms.
Reliability as Internal Consistency
In addition to the problems that can occur when people complete the same measure more than once, another problem with test-retest reliability is that some conceptual variables are not expected to be stable over time within an individual. Clearly, if optimism has a meaning as a conceptual variable, then people who are optimists on Tuesday should also be optimists on Friday of next week. Conceptual variables such as intelligence, friendliness, assertiveness, and optimism are known as traits, which are personality variables that are not expected to vary (or at most to vary only slowly) within people over time.
Other conceptual variables, such as level of stress, moods, or even preference for classical over rock music, are known as states. States are personality variables that are expected to change within the same person over short periods of time. Because a person’s score on a mood measure administered on Tuesday is not necessarily expected to be related to the same measure administered next Friday, the test-retest approach will not provide an adequate assessment of the reliability of a state variable such as mood. Because of the problems associated with test-retest and equivalent-forms reliability, another measure of reliability, known as internal consistency, has become the most popular and most accurate way of assessing reliability for both trait and state measures. Internal consistency is assessed using the scores on a single administration of the measure.
You will recall from our discussion in Chapter 4 that most self-report measures contain a number of items. If you think about measurement in terms of reliability, the reason for this practice will become clear. You can imagine that a measure that had only one item might be unreliable because that specific item might have a lot of random error. For instance, respondents might not understand the question the way you expected them to, or they might read it incorrectly. In short, any single item is not likely to be very reliable.
True Score and Random Error. One of the basic principles of reliability is that the more measured variables are combined together, the more reliable the test will be. This is so because, although each measured variable will be infl uenced in part by random error, some part of each item will also measure the true score, or the part of the scale score that is not random error, of the individual on the measure. Furthermore, because random error is selfcanceling, the random error components of each measured variable will not be correlated with each other, whereas the parts of the measured variables that represent the true score will be correlated. As a result, when they are combined together by summing or averaging, the use of many measured variables will produce a more reliable estimate of the conceptual variable than will any of the individual measured variables themselves.
The role of true score and random error can be expressed in the form of two equations that are the basis of reliability. First, an individual’s score on a measure will consist of both true score and random error:
Actual score = True score +Random error
and reliability is the proportion of the actual score that refl ects true score (and not random error).
Relibility =True score /Actual score
To take a more specifi c example, consider for a moment the Rosenberg self-esteem scale that we examined in Table 4.2. This scale has ten items, each designed to assess the conceptual variable of self-esteem in a slightly different way. Although each of the items will have random error, each should also measure the true score of the individual. Thus if we average all ten of the items together to form a single measure, this overall scale score will be a more reliable measure than will any one of the individual questions.
Internal consistency refers to the extent to which the scores on the items correlate with each other and thus are all measuring the true score rather than random error. In terms of the Rosenberg scale, a person who answers above average on question 1, indicating she or he has high self-esteem, should also respond above the average on all of the other questions. Of course, this pattern will not be perfect because each item has some error. However, to the extent that all of the items are measuring true score, rather than random error the average correlation among the items will approach r 5 1.00. To the extent that the correlation among the items is less than r 5 1.00, it tells us either that there is random error or that the items are not measuring the same thing.
Coefficient Alpha. One way to calculate the internal consistency of a scale is to correlate a person’s score on one half of the items (for instance, the evennumbered items) with her or his score on the other half of the items (the oddnumbered items). This procedure is known as split-half reliability. If the scale is reliable, then the correlation between the two halves will approach r 5 1.00, indicating that both halves measure the same thing. However, because splithalf reliability uses only some of the available correlations among the items, it is preferable to have a measure that indexes the average correlation among all of the items on the scale. The most common, and the best, index of internal consistency is known as Cronbach’s coefficient alpha, symbolized as a. This measure is an estimate of the average correlation among all of the items on the scale and is numerically equivalent to the average of all possible split-half reliabilities.
Coefficient alpha, because it refl ects the underlying correlational structure of the scale, ranges from a 5 0.00 (indicating that the measure is entirely error) to a 5 11.00 (indicating that the measure has no error). In most cases, statistical computer programs are used to calculate coeffi cient alpha, but alpha can also be computed by hand according to the formula presented in Appendix D.
Item-to-Total Correlations. When a new scale is being developed, its initial reliability may be low. This is because, although the researcher has selected those items that he or she believes will be reliable, some items will turn out to contain random error for reasons that could not be predicted in advance. Thus, one strategy commonly used in the initial development of a scale is to calculate the correlations between the score on each of the individual items and the total scale score excluding the item itself (these correlations are known as the item-to-total correlations). The items that do not correlate highly with the total score can then be deleted from the scale. Because this procedure deletes the items that do not measure the same thing that the scale as a whole does, the result is a shorter scale, but one with higher reliability. However, the approach of throwing out the items that do not correlate highly with the total is used only in the scale development process. Once the fi nal version of the scale is in place, this version should be given again to another sample of participants, and the reliability computed without dropping any items.
Interrater Reliability
To this point we have discussed reliability primarily in terms of self-report scales. However, reliability is just as important for behavioral measures. It is common practice for a number of judges to rate the same observed behaviors and then to combine their ratings to create a single measured variable. This computation requires the internal consistency approach—just as any single item on a scale is expected to have error, so the ratings of any one judge are more likely to contain error than is the averaged rating across a group of judges. The errors of judges can be caused by many things, including inattention to some of the behaviors, misunderstanding of instructions, or even personal preferences. When the internal consistency of a group of judges is calculated, the resulting reliability is known as interrater reliability.
If the ratings of the judges that are being combined are quantitative variables (for instance, if the coders have each determined the aggressiveness of a group of children on a scale from 1 to 10), then coeffi cient alpha can be used to evaluate reliability. However, in some cases the variables of interest may be nominal. This would occur, for instance, if the judges have indicated for each child whether he or she was playing “alone,” “cooperatively,” “competitively,” or “aggressively.” In such cases, a statistic known as kappa (k) is used as the measure of agreement among the judges. Like coeffi cient alpha, kappa ranges from k 5 0 (indicating that the judges’ ratings are entirely random error) to k 5 11.00 (indicating that the ratings have no error). The formula for computing kappa is presented in Appendix C.
Coefficient alpha, because it refl ects the underlying correlational structure of the scale, ranges from a 5 0.00 (indicating that the measure is entirely error) to a 5 11.00 (indicating that the measure has no error). In most cases, statistical computer programs are used to calculate coeffi cient alpha, but alpha can also be computed by hand according to the formula presented in Appendix D.
Item-to-Total Correlations. When a new scale is being developed, its initial reliability may be low. This is because, although the researcher has selected those items that he or she believes will be reliable, some items will turn out to contain random error for reasons that could not be predicted in advance. Thus, one strategy commonly used in the initial development of a scale is to calculate the correlations between the score on each of the individual items and the total scale score excluding the item itself (these correlations are known as the item-to-total correlations). The items that do not correlate highly with the total score can then be deleted from the scale. Because this procedure deletes the items that do not measure the same thing that the scale as a whole does, the result is a shorter scale, but one with higher reliability. However, the approach of throwing out the items that do not correlate highly with the total is used only in the scale development process. Once the fi nal version of the scale is in place, this version should be given again to another sample of participants, and the reliability computed without dropping any items.
Interrater Reliability
To this point we have discussed reliability primarily in terms of self-report scales. However, reliability is just as important for behavioral measures. It is common practice for a number of judges to rate the same observed behaviors and then to combine their ratings to create a single measured variable. This computation requires the internal consistency approach—just as any single item on a scale is expected to have error, so the ratings of any one judge are more likely to contain error than is the averaged rating across a group of judges. The errors of judges can be caused by many things, including inattention to some of the behaviors, misunderstanding of instructions, or even personal preferences. When the internal consistency of a group of judges is calculated, the resulting reliability is known as interrater reliability.
If the ratings of the judges that are being combined are quantitative variables (for instance, if the coders have each determined the aggressiveness of a group of children on a scale from 1 to 10), then coeffi cient alpha can be used to evaluate reliability. However, in some cases the variables of interest may be nominal. This would occur, for instance, if the judges have indicated for each child whether he or she was playing “alone,” “cooperatively,” “competitively,” or “aggressively.” In such cases, a statistic known as kappa (k) is used as the measure of agreement among the judges. Like coeffi cient alpha, kappa ranges from k 5 0 (indicating that the judges’ ratings are entirely random error) to k 5 11.00 (indicating that the ratings have no error). The formula for computing kappa is presented in Appendix C.