Showing posts with label Reliability and Validity. Show all posts
Showing posts with label Reliability and Validity. Show all posts

Sunday, 30 June 2013

Current Research in the Behavioral Sciences: Assessing Americans’ Attitudes Toward Health Care


(Reliability and Validity) Current Research in the Behavioral Sciences: Assessing Americans’ Attitudes Toward Health Care

Because so many opinion polls are now conducted, and many of their results are quickly put online, it is now possible to view the estimated opinions of large populations in almost real time. For instance, as I write these words in July, 2009, I can visit the CBS news website and see the results of a number of  recent polls regarding the opinions of U.S. citizens about a variety of national issues. 
        One poll, reported at http://www.cbsnews.com/htdocs/pdf/jul09b_ health_care-AM.pdf used a random sample of 1,050 adults nationwide in the United States, who were interviewed by telephone on July 24–28, 2009. The phone numbers were dialed from random digit dial samples of both standard landline and cell phones. The error due to sampling for results based on the entire sample is plus or minus three percentage points, although the error for subgroups is higher. 
            The polls provide a snapshot of the current state of thinking in U.S. citizens about health care reform. Here are some findings: 
          In response to the question “Will health care reform happen in 2009?” most Americans see health care reform as likely, although just 16 percent call it “very” likely. Four in 10 think it is not likely this year. 

Very likely 16% 
Somewhat likely 43% 
Not likely 40% 

        However, many Americans don’t see how they would personally benefi t from the health care proposals being considered. In response to the question, “Would the current congressional reform proposals help you? 59 percent say those proposals—as they understand them—would not help them directly. Just under a third says current plans would. 

Yes 31%
No 59%

         By a 2 to 1 margin, Americans feel President Obama has better ideas for reforming health care than Congressional Republicans. Views on this are partisan, but independents side with the President.
          The question asked was “Who has better ideas for health care reform?” Here are the results overall, as well as separately for Democrats, Republicans, and Independents:

                                        Overall         Democrats          Republicans           Independents

President Obama                 55%               81%                      27%                     48%
Republicans                          26%              10%                      52%                     26%

But, as you can see in the responses to the following question, Mr. Obama’s approval rating on handling the overall issue remains under  50 percent, and many still don’t have a view yet:
          “Do you approve or disapprove of President Obama’s health care plans?”
           Approve           46%
           Disapprove       38%
           Don’t know      16%

Summarizing the Sample Data

(Reliability and Validity)  Summarizing the Sample Data

You can well imagine that once a survey has been completed, the collected data (known as the raw data) must be transformed in a way that will allow them to be meaningfully interpreted. The raw data are, by themselves, not very useful for gaining the desired snapshot because they contain too many numbers. For example, if we interview 500 people and ask each of them forty questions, there will be 20,000 responses to examine. In this section we will consider some of the statistical methods used to summarize sample data. Procedures for using computer software programs to conduct statistical analyses are reviewed in Appendix B, and you may want to read this material at this point.

Frequency Distributions

Table 6.1 presents some hypothetical raw data from twenty-fi ve participants on five variables collected in a sort of “minisurvey.” You can see that the table is arranged such that the variables (sex, ethnic background, age, life satisfaction, family income) are in the columns and the participants form the rows. For nominal variables such as sex or ethnicity, the data can be summarized through the use of a frequency distribution. A frequency distribution is a table that indicates how many, and in most cases what percentage, of individuals in the sample fall into each of a set of categories. A frequency distribution of the ethnicity variable from Table 6.1 is shown in Figure 6.1(a). The frequency distribution can be displayed visually in a bar chart, as shown for the ethnic background variable in Figure 6.1(b). The characteristics of the sample are easily seen when summarized through a frequency distribution or a bar chart. 

































          One approach to summarizing a quantitative variable is to combine adjacent  values into a set of categories and then to examine the frequencies of each of the categories. The resulting distribution is known as a grouped frequency distribution. A grouped frequency distribution of the age variable from Table 6.1 is shown in Figure 6.2(a). In this case, the ages have been grouped into fi ve categories (less than 21, 21–30, 31–40, 41–50, and greater than 50).

The grouped frequency distribution may be displayed visually in the form of a histogram, as shown in Figure 6.2(b). A histogram is slightly different from a bar chart because the bars are drawn so that they touch each other. This indicates that the original variable is quantitative. If the frequencies of the groups are indicated with a line, rather than bars, as shown in Figure 6.2(c), the display is called a frequency curve.
        

             One limitation of grouped frequency distributions is that grouping the values together into categories results in the loss of some information. For instance, it is not possible to tell from the grouped frequency distribution in Figure 6.2(a) exactly how many people in the sample are twenty-three years old. A stem and leaf plot is a method of graphically summarizing the raw data such that the original data values can still be seen. A stem and leaf plot of the age variable from Table 6.1 is shown in Figure 6.3. 


Descriptive Statistics 

Descriptive statistics are numbers that summarize the pattern of scores observed on a measured variable. This pattern is called the distribution of the variable. Most basically, the distribution can be described in terms of its central tendency—that is, the point in the distribution  around which the data are centered—and its dispersion, or spread. As we will see, central tendency is summarized through the use of descriptive statistics such as the mean, the median, and the mode, and dispersion is summarized through the use of the variance and the standard deviation. Figure 6.4 shows a printout from the IBM Statistical Package for the Social Sciences (IBM SPSS) software of the descriptive statistics for the quantitative variables in Table 6.1.

Measures of Central Tendency. The arithmetic average, or arithmetic mean,
is the most commonly used measure of central tendency. It is computed by summing all of the scores on the variable and dividing this sum by the number of participants in the distribution (denoted by the letter N). The sample mean is sometimes denoted with the symbol x– , read as “X-Bar,” and may also be indicated by the letter M. As you can see in Figure 6.4, in our sample, the mean age of the twenty-fi ve students is 33.52. In this case, the mean provides  an accurate index of the central tendency of the age variable because if you look at the stem and leaf plot in Figure 6.3, you can see that most of the ages are centered at about thirty-three.
          The pattern of scores observed on a measured variable is known as the variable’s distribution. It turns out that most quantitative variables have distributions similar to that shown in Figure 6.5(a). Most of the data are located near the center of the distribution, and the distribution is symmetrical and bell-shaped. Data distributions that are shaped like a bell are known as normal distributions.
          In some cases, however, the data distribution is not symmetrical. This occurs when there are one or more extreme scores (known as outliers) at one end of the distribution. For instance, because there is an outlier in the family income variable in Table 6.1 (a value of $2,800,000), a frequency curve of this variable would look more like that shown in Figure 6.5(b) than that shown in Figure 6.5(a). Distributions that are not symmetrical are said to be skewed. As shown in Figure 6.5(b) and (c), distributions are said to be either positively skewed or negatively skewed, depending on where the outliers fall. 
         Because the mean is highly infl uenced by the presence of outliers, it is not a good measure of central tendency when the distribution is highly skewed. For instance, although it appears from Table 6.1 that the central tendency of the family income variable should be around $40,000, the mean family income is actually $159,920. The single very extreme income has a disproportionate impact on the mean, resulting in a value that does not well represent the central tendency. 
          The median is used as an alternative measure of central tendency when distributions are skewed. The median is the score in the center of the distribution, meaning that 50 percent of the scores are greater than the median and 50 percent of the scores are lower than the median. Methods for calculating the median are presented in Appendix B. In our case, the median household income ($43,000) is a much better indication of central tendency than is the mean household income ($159,920). 
       A final measure of central tendency, known as the mode, represents the value that occurs most frequently in the distribution. You can see from Table 6.1 that the modal value for the income variable is $43,000 (it occurs four times). In some cases there can be more than one mode. For instance, the age variable has modes at 18, 19, 31, 33, and 45. Although the mode does represent central tendency, it is not frequently used in scientific research. The relationships among the mean, the median, and the mode are described in Figure 6.5. 



































Measures of Dispersion. In addition to summarizing the central tendency of a distribution, descriptive statistics convey information about how the scores on the variable are spread around the central tendency. Dispersion refers to the extent to which the scores are all tightly clustered around the central tendency, like this:

          One simple measure of dispersion is to find the largest (the maximum) and the smallest (the minimum) observed values of the variable and to compute the range of the variable as the maximum observed score minus the minimum observed score. You can check that the range of the age variable is 63 2 18 5 45. 
        The standard deviation, symbolized as s, is the most commonly used measure of dispersion. As discussed in more detail in Appendix B, computation of the standard deviation begins with the calculation of a mean deviation score for each individual. The mean deviation is the score on the variable minus the mean of the variable. Individuals who score above the mean have positive deviation scores, whereas those who score below the mean have negative deviation scores. The mean deviations are squared and summed to produce a statistic called the sum of squared deviations, or sum of squares. The sum of squares is divided by the sample size (N) to produce a statistic known as the variance, symbolized as s2. The square root of the variance is the standard deviation, s. Distributions with a larger standard deviation have more spread. As you can see from Figure 6.4, the standard deviation of the age variable in Table 6.1 is 12.51.

Saturday, 29 June 2013

Current Research in the Behavioral Sciences: The Hillyer-Joynes Kinematics Scale of Locomotion in Rats With Spinal Injuries

Jessica Hillyer and Robin L. Joynes conduct research on animals with injuries to their spinal cords, with the goal of helping learn how organisms, including humans, may be able to improve their physical movements (locomotion) after injury. One difficulty that they noted in their research with rats was that the existing measure of locomotion (the BBB Locomotor Rating Scale, (BBB), Basso, Beattie, & Bresnahan, 1995) was not sophisticated enough to provide a clear measure of locomotion skills. They therefore decided to create their own, new, measure, which they called the Hillyer-Joynes Kinematics Scale of Locomotion (HiJK). Their measure was designed to assess the locomotion abilities of rats walking on treadmills. 
        The researchers began by videotaping 137 rats with various degrees of spinal cord injuries as they walked on treadmills. Then three different coders viewed each of the videotapes on a subset of twenty of the rats. For each of these 20 rats, the coders rated the walking skills of the rats on eight different dimensions: Extension of the Hip, Knee, and Ankle joints, Fluidity of the joint movement, Alternation of the legs during movement, Placement of the feet, Weight support of the movement and Consistency of walking. 
      Once the raters had completed their ratings, the researchers tested for interrater reliability, to see if the three raters agreed on their coding of each of the five categories that they had rated. Overall, they found high interrater reliability, generally with r’s over .9. For instance, for the ratings of foot placement, the correlations among the three coders were as follows:

                   Rater 1             Rater 2
Rater 2          .95
Rater 3          .95                 .99

       The researchers then had one of the three raters rate all 137 of the rats on the 8 subscales. On the basis of this rater’s judgments, they computed  the overall reliability of the new measure, using each of the eight rated dimensions as an item in the scale. The Cronbach’s alpha for the composite scale, based on 8 items and 137 rats was a 5 .86, denoting acceptable reliability.

         Having determined that their new measure was reliable, the researchers next turned to study the validity of the scale. The researchers found that the new measure correlated signifi cantly with scores on the existing measure of locomotion, the BBB Locomotor Rating Scale, suggesting that it was measuring the locomotion of the rats in a similar way that it did.

         Finally, the researchers tested for predictive validity, by correlating both the BBB and the HiJK with a physiological assessment of the magnitude of each of the rat’s spinal cord injuries. The researchers found that the HiJK was better able to predict the nature of the rats’ injuries than was the BBB, suggesting that the new measure may be a better measure than the old one. ( Reliability and Validity )

Comparing Reliability and Validity

 Comparing Reliability and Validity

We have seen that reliability and construct validity are similar in that they are both assessed through examination of the correlations among measured variables. However, they are different in the sense that reliability refers to correlations among different variables that the researcher is planning to combine into the same measure of a single conceptual variable, whereas construct validity refers to correlations of a measure with different measures of other conceptual variables. In this sense, it is appropriate to say that reliability comes before validity because reliability is concerned with creating a measure that is then tested in relationship to other measures. If a measure is not reliable, then its construct validity cannot be determined. Tables 5.1 and 5.2 summarize the various types of reliability and validity that researchers must consider. 
            One important question that we have not yet considered is “How reliable and valid must a scale be in order to be useful?” Researchers do not always agree about the answer, except for the obvious fact that the higher the reliability and the construct validity, the better. One criterion that seems reasonable  is that the reliability of a commonly used scale should be at least a 5 .70. However, many tests have reliabilities well above a 5 .80. 
           In general, it is easier to demonstrate the reliability of a measured variable than it is to demonstrate a variable’s construct validity. This is so in part because demonstrating reliability involves only showing that the measured variables correlate with each other, whereas validity involves showing both convergent and discriminant validity. Also, because the items on a scale are all answered using the same response format and are presented sequentially, and because items that do not correlate highly with the total scale score can be deleted, high reliabilities are usually not diffi cult to achieve. 
           However, the relationships among different measures of the same conceptual variable that serve as the basis for demonstrating convergent validity are generally very low. For instance, the correlations observed by Snyder were only in the range of .40, and such correlations are not unusual. Although correlations of such size may seem low, they are still taken as evidence for convergent validity. 
          One of the greatest diffi culties in developing a new scale is to demonstrate its discriminant validity. Although almost any new scale that you can imagine will be at least moderately correlated with at least some other existing scales, to be useful, the new scale must be demonstrably different from existing scales in at least some critical respects. Demonstrating this uniqueness is difficult and will generally require that a number of different studies be conducted. 
         Because there are many existing scales in common use within the behavioral sciences, carefully consider whether you really need to develop a new scale for your research project. Before you begin scale development, be sure to determine if a scale assessing the conceptual variable you are interested in, or at least a similar conceptual variable, might already exist. A good source for information about existing scales, in addition to PsycINFO®, is Robinson, Shaver, and Wrightsman (1991). Remember that it is always advantageous to use an existing measure rather than to develop your own— the reliability and validity of such measures are already established, saving you a lot of work.

Improving the Reliability and Validity of Measured Variables

(Reliability and Validity) Improving the Reliability and Validity of Measured Variables

Now that we have considered some of the threats to the validity of measured variables, we can ask how our awareness of these potential threats can help us improve our measures. Most basically, the goal is to be aware of the potential diffi culties and to keep them in mind as we design our measures. Because the research process is a social interaction between researcher and participant, we must carefully consider how the participant perceives the research and consider how she or he may react to it. The following are some useful tips for creating valid measures:

1.    Conduct a pilot test. Pilot testing involves trying out a questionnaire or other research on a small group of individuals to get an idea of how they react to it before the fi nal version of the project is created. After collecting the data from the pilot test, you can modify the measures before actually using the scale in research. Pilot testing can help ensure that participants understand the questions as you expect them to and that they cannot  guess the purpose of the questionnaire. You can also use pilot testing to create self-report measures. You ask participants in the pilot study to generate thoughts about the conceptual variables of interest. Then you use these thoughts to generate ideas about the types of items that should be asked on a fi xed-format scale. 

2.      Use multiple measures. As we have seen, the more types of measures are used to assess a conceptual variable, the more information about the variable is gained. For instance, the more items a test has, the more reliable it will be. However, be careful not to make your scale so long that your participants lose interest in taking it! As a general guideline, twenty items are usually suffi cient to produce a highly reliable measure. 

3.       Ensure variability within your measures. If 95 percent of your participants answer an item with the response 7 (strongly agree) or the response 1 (strongly disagree), the item won’t be worth including because it won’t differentiate the respondents. One way to guarantee variability is to be sure that the average response of your respondents is near the middle of the scale. This means that although most people fall in the middle, some people will fall above and some below the average. Pilot testing enables you to create measures that have variability. 

4.         Write good items. Make sure that your questions are understandable and not ambiguous. This means the questions shouldn’t be too long or too short. Try to avoid ambiguous words. For instance, “Do you regularly feel stress?” is not as good as “How many times per week do you feel stress?” because the term regular is ambiguous. Also watch for “double-barreled” questions such as “Are you happy most of the time, or do you fi nd there to be no reason to be happy?” A person who is happy but does not find any real reason for it would not know how to answer this question. Keep your questions as simple as possible, and be specifi c. For instance, the question “Do you like your parents?” is vaguer than “Do you like your mother?” and “Do you like your father?” 

5.         Attempt to get your respondents to take your questions seriously. In the instructions you give to them, stress that the accuracy of their responses is important and that their responses are critical to the success of the research project. Otherwise carelessness may result. 

6.          Attempt to make your items nonreactive. For instance, asking people to indicate whether they agree with the item “I dislike all Japanese people” is unlikely to produce honest answers, whereas a statement such as “The Japanese are using their economic power to hurt the United States” may elicit a more honest answer because the item is more indirect. Of course, the latter item may not assess exactly what you are hoping to measure, but in some cases tradeoffs may be required. In some cases you may wish to embed items that measure something entirely irrelevant (they are called distracter items) in your scale to disguise what you are really assessing. 

7.       Be certain to consider face and content validity by choosing items that seem “reasonable” and that represent a broad range of questions concerning the topic of interest. If the scale is not content valid, you may be evaluating only a small piece of the total picture you are interested in. 

8.       When possible, use existing measures, rather than creating your own, because the reliability and validity of these measures will already be established.

Construct Validity



    (Reliability and Validity) Construct Validity

Although reliability indicates the extent to which a measure is free from random error, it does not indicate what the measure actually measures. For instance, if we were to measure the speed with which a group of research participants could tie their shoes, we might fi nd that this is a very reliable measure in the sense that it shows a substantial test-retest correlation. However, if the researcher then claimed that this reliable measure was assessing the conceptual variable of intelligence, you would probably not agree.
       Therefore, in addition to being reliable, useful measured variables must also be construct valid. Construct validity refers to the extent to which a measured variable actually measures the conceptual variable (that is, the construct) that it is designed to assess. A measure only has construct validity if it measures what we want it to. There are a number of ways to assess construct validity; these are summarized in Table 5.2.

Face Validity

In some cases we can obtain an initial indication of the likely construct validity
of a measured variable by examining it subjectively. Face validity refers to the extent to which the measured variable appears to be an adequate measure of the conceptual variable. For example, the Rosenberg self-esteem scale in Table 4.2 has face validity because the items (“I feel that I have a number of good qualities;” “I am able to do things as well as other people”) appear to assess what we intuitively mean when we speak of self-esteem. However, if I carefully timed how long it took you and ten other people to tie your shoelaces, and then told you that you had above-average self-esteem because you tied your laces faster than the average of the others did, it would be clear that, although my test might be highly reliable, it did not really measure self-esteem. In this case, the measure is said to lack face validity.
           Even though in some cases face validity can be a useful measure of whether a test actually assesses what it is supposed to, face validity is not always necessary or even desirable in a test. For instance, consider how White college students might answer the following measures of racial prejudice: 

I do not like African Americans:

Strongly disagree 1 2 3 4 5 6 7 Strongly agree

African Americans are inferior to Whites:

Strongly agree 1 2 3 4 5 6 7 Strongly disagree

These items have high face validity (they appear to measure racial prejudice), but they are unlikely to be valid measures because people are unlikely to answer them honestly. Even those who are actually racists might not indicate agreement with these items (particularly if they thought the experimenter could check up on them) because they realize that it is not socially appropriate to do so. In cases where the test is likely to produce reactivity, it can sometimes be the case that tests with low face validity may actually be more valid because the respondents will not know what is being measured and thus will be more likely to answer honestly. In short, not all measures that appear face valid are actually found to have construct validity. 

Content Validity

       One type of validity that is particularly appropriate to ability tests is known as content validity. Content validity concerns the degree to which the measured variable appears to have adequately sampled from the potential domain of questions that might relate to the conceptual variable of interest. For instance, an intelligence test that contained only geometry questions would lack content validity because there are other types of questions that measure intelligence (those concerning verbal skills and knowledge about current affairs, for instance) that were not included. However, this test might nevertheless have content validity as a geometry test because it sampled from many different types of geometry problems. 

Convergent and Discriminant Validity 

         Although face and content validity can and should be used in the initial stages of test development, they are relatively subjective, and thus limited, methods for evaluating the construct validity of measured variables. Ultimately, the determination of the validity of a measure must be made not on the basis of subjective judgments, but on the basis of relevant data. The basic logic of empirically testing the construct validity of a measure is based on the idea that there are multiple operationalizations of the variable: 
          If a given measured variable “x” is really measuring conceptual variable “X,” then it should correlate  
          with other measured variables designed to assess “X,” and it should not correlate with other measured 
          variables designed to assess other conceptually unrelated variables. 

          According to this logic, construct validity has two separate components. Convergent validity refers to the extent to which a measured variable is found to be related to other measured variables designed to measure the same conceptual variable. Discriminant validity refers to the extent to which a measured variable is found to be unrelated to other measured variables designed to assess different conceptual variables. 

Assessment of Construct Validity. Let’s take an example of the use of how convergent and discriminant validity were used to demonstrate the construct validity of a new personality variable known as self-monitoring. Self-monitoring  refers to the tendency to pay attention to the events that are occurring around you and to adjust your behavior to “fi t in” with the specifi c situation you are in. High self-monitors are those who habitually make these adjustments, whereas low self-monitors tend to behave the same way in all situations, essentially ignoring the demands of the social setting. 
      Social psychologist Mark Snyder (1974) began his development of a selfmonitoring scale by constructing forty-one items that he thought would tap into the conceptual variable self-monitoring. These included items designed to directly assess self-monitoring:

         “I guess I put on a show to impress or entertain people.”
          “I would probably make a good actor.”
and items that were to be reverse-scored:
           “I rarely need the advice of my friends to choose movies, books, or
music.”
          “I have trouble changing my behavior to suit different people and
different situations.”

On the basis of the responses of an initial group of college students, Snyder deleted the sixteen items that had the lowest item-to-total correlations. He was left with a twenty-fi ve-item self-monitoring scale that had a test-retest reliability of .83.
          Once he had demonstrated that his scale was reliable, Snyder began to assess its construct validity. First, he demonstrated discriminant validity by showing that the scale did not correlate highly with other existing personality scales that might have been measuring similar conceptual variables. For instance, the self-monitoring scale did not correlate highly with a measure of extraversion (r 5 1.19), with a measure of responding in a socially acceptable manner (r 5 2.19), or with an existing measure of achievement anxiety (r 5 1.14).
           Satisfied that the self-monitoring scale was not the same as existing scales, and thus showed discriminant validity, Snyder then began to assess the test’s convergent validity. Snyder found, for instance, that high self-monitors were more able to accurately communicate an emotional expression when asked to do so (r 5 .60). And he found that professional actors (who should be very sensitive to social cues) scored higher on the scale and that hospitalized psychiatric patients (who are likely to be unaware of social cues) scored lower on the scale, both in comparison to college students. Taken together, Snyder concluded that the self-monitoring scale was reliable and also possessed both convergent and discriminant validity. 
            One of the important aspects of Snyder’s fi ndings is that the convergent validity correlations were not all r 5 11.00 and the discriminant validity correlations were not all r 5 0.00. Convergent validity and discriminant validity are never all-or-nothing constructs, and thus it is never possible to defi nitively “prove” the construct validity of a measured variable. In reality, even measured variables that are designed to measure different conceptual variables will often
be at least moderately correlated with each other. For instance, self-monitoring relates, at least to some extent, to extraversion because they are related constructs. Yet the fact that the correlation coefficient is relatively low (r 5 .19) indicates that self-monitoring and extraversion are not identical. Similarly, even measures that assess the same conceptual variable will not, because of random error, be perfectly correlated with each other.

The Nomological Net. Although convergent reality and discriminant validity are frequently assessed through correlation of the scores on one self-report measure (for instance, one Likert scale of anxiety) with scores on another self-report measure (a different anxiety scale), construct validity can also be evaluated using other types of measured variables. For example, when testing a self-report measure of anxiety, a researcher might compare the scores to ratings of anxiety made by trained psychotherapists or to physiological variables such as blood pressure or skin conductance. The relationships among the many different measured variables, both self-report and otherwise, form a complicated pattern, called a nomological net. Only when we look across many studies, using many different measures of the various conceptual variables and relating those measures to other variables, does a complete picture of the construct validity of the measure begin to emerge—the greater the number of predicted relationships tested and confi rmed, the greater the support for the construct validity of the measure. 

Criterion Validity 

           You will have noticed that when Snyder investigated the construct validity of his self-monitoring scale, he assessed its relationship not only to other self-report measures, but also to behavioral measures such as the individual’s current occupation (for instance, whether he or she was an actor). There are some particular advantages to testing validity through correlation of a scale with behavioral measures rather than with other self-report measures. For one thing, as we have discussed in Chapter 4, behavioral measures may be less subject to reactivity than are self-report measures. When validity is assessed through correlation of a self-report measure with a behavioral measured variable, the behavioral variable is called a criterion variable, and the correlation is an assessment of the self-report measure’s criterion validity
          Criterion validity is known as predictive validity when it involves attempts to foretell the future. This would occur, for instance, when an industrial psychologist uses a measure of job aptitude to predict how well a prospective employee will perform on a job or when an educational psychologist predicts school performance from SAT or GRE scores. Criterion validity is known as concurrent validity when it involves assessment of the relationship between a self-report and a behavioral measure that are assessed at the same time. In some cases, criterion validity may even involve use of the self-report measure to predict behaviors that have occurred prior to completion of the scale.
           Although the practice of correlating a self-report measure with a behavioral criterion variable can be used to learn about the construct validity of the measured variables, in some applied research settings it is only the ability of the test to predict a specifi c behavior that is of interest. For instance, an employer who wants to predict whether a person will be an effective manager will be happy to use any self-report measure that is effective in doing so and may not care about what conceptual variable the test measures (for example, does it measure intelligence, social skills, diligence, all three, or something else entirely?). In this case criterion validity involves only the correlation between the variables rather than the use of the variables to make inferences about construct validity.

Reliability


(Reliability and Validity) Reliability

The reliability of a measure refers to the extent to which it is free from random error. One direct way to determine the reliability of a measured variable is to measure it more than once. For instance, you can test the reliability of a bathroom scale by weighing yourself on it twice in a row. If the scale gives the same weight both times (we’ll assume your actual weight hasn’t changed in between), you would say that it is reliable. But if the scale gives different weights each time, you would say that it is unreliable. Just as a bathroom scale is not useful if it is not consistent over time, an unreliable measured variable will not be useful in research.
          The next section reviews the different approaches to assessing a measure’s reliability; these are summarized in Table 5.1. 

Test-Retest Reliability

 Test-retest reliability refers to the extent to which scores on the same measured variable correlate with each other on two different measurements given at two different times. If the test is perfectly reliable, and if the scores on the conceptual variable do not change over the time period, the individuals should receive the exact same score each time, and the correlation between the scores will be r 5 1.00. However, if the measured variable contains random error, the two scores will not be as highly correlated. Higher positive correlations between the scores at the two times indicate higher test-retest reliability. 
          Although the test-retest procedure is a direct way to measure reliability, it does have some limits. For one thing, when the procedure is used to assess the reliability of a self-report measure, it can produce reactivity. As you will recall from Chapter 4, reactivity refers to the infl uence of measurement on the variables being measured. In this case, reactivity is a potential problem because when the same or similar measures are given twice, responses on the second administration may be infl uenced by the measure having been taken the first time. These problems are known as retesting effects
          Retesting problems may occur, for instance, if people remember how they answered the questions the fi rst time. Some people may believe that the experimenter wants them to express different opinions on the second occasion (or else why are the questions being given twice?). This would obviously reduce the test-retest correlation and thus give an overly low reliability assessment. Or respondents may try to duplicate their previous answers exactly to avoid appearing inconsistent, which would unnaturally increase the reliability estimate. Participants may also get bored answering the same questions twice. Although some of these problems can be avoided through the use of a long testing interval (say, over one month) and through the use of appropriate instructions (for instance, instructions to be honest and to answer exactly how one is feeling right now), retesting poses a general problem for the computation of test-retest reliability. 
       To help avoid some of these problems, researchers sometimes employ a more sophisticated type of test-retest reliability known as equivalent-forms reliability. In this approach two different but equivalent versions of the same measure are given at different times, and the correlation between the scores on the two versions is assessed. Such an approach is particularly useful when there are correct answers to the test that individuals might learn by taking the first test or be able to fi nd out during the time period between the tests. Because students might remember the questions and learn the answers to aptitude tests such as the Graduate Record Exam (GRE) or the Scholastic Aptitude Test (SAT), these tests employ equivalent forms.

Reliability as Internal Consistency 

In addition to the problems that can occur when people complete the same measure more than once, another problem with test-retest reliability is that some conceptual variables are not expected to be stable over time within an individual. Clearly, if optimism has a meaning as a conceptual variable, then people who are optimists on Tuesday should also be optimists on Friday of next week. Conceptual variables such as intelligence, friendliness, assertiveness, and optimism are known as traits, which are personality variables that are not expected to vary (or at most to vary only slowly) within people over time.
          Other conceptual variables, such as level of stress, moods, or even preference for classical over rock music, are known as states. States are personality variables that are expected to change within the same person over short periods of time. Because a person’s score on a mood measure administered on Tuesday is not necessarily expected to be related to the same measure administered next Friday, the test-retest approach will not provide an adequate  assessment of the reliability of a state variable such as mood. Because of the problems associated with test-retest and equivalent-forms reliability, another measure of reliability, known as internal consistency, has become the most popular and most accurate way of assessing reliability for both trait and state measures. Internal consistency is assessed using the scores on a single administration of the measure. 
          You will recall from our discussion in Chapter 4 that most self-report measures contain a number of items. If you think about measurement in terms of reliability, the reason for this practice will become clear. You can imagine that a measure that had only one item might be unreliable because that specific item might have a lot of random error. For instance, respondents might not understand the question the way you expected them to, or they might read it incorrectly. In short, any single item is not likely to be very reliable. 

True Score and Random Error. One of the basic principles of reliability is that the more measured variables are combined together, the more reliable the test will be. This is so because, although each measured variable will be infl uenced in part by random error, some part of each item will also measure the true score, or the part of the scale score that is not random error, of the individual on the measure. Furthermore, because random error is selfcanceling, the random error components of each measured variable will not be correlated with each other, whereas the parts of the measured variables that represent the true score will be correlated. As a result, when they are combined together by summing or averaging, the use of many measured variables will produce a more reliable estimate of the conceptual variable than will any of the individual measured variables themselves. 
        The role of true score and random error can be expressed in the form of two equations that are the basis of reliability. First, an individual’s score on a measure will consist of both true score and random error: 

Actual score = True score +Random error 

and reliability is the proportion of the actual score that refl ects true score (and not random error). 

Relibility =True score /Actual score

       To take a more specifi c example, consider for a moment the Rosenberg self-esteem scale that we examined in Table 4.2. This scale has ten items, each designed to assess the conceptual variable of self-esteem in a slightly different way. Although each of the items will have random error, each should also  measure the true score of the individual. Thus if we average all ten of the items together to form a single measure, this overall scale score will be a more reliable measure than will any one of the individual questions.
           Internal consistency refers to the extent to which the scores on the items correlate with each other and thus are all measuring the true score rather than random error. In terms of the Rosenberg scale, a person who answers above average on question 1, indicating she or he has high self-esteem, should also respond above the average on all of the other questions. Of course, this pattern will not be perfect because each item has some error. However, to the extent that all of the items are measuring true score, rather than random error the average correlation among the items will approach r 5 1.00. To the extent that the correlation among the items is less than r 5 1.00, it tells us either that there is random error or that the items are not measuring the same thing. 

Coefficient Alpha. One way to calculate the internal consistency of a scale is to correlate a person’s score on one half of the items (for instance, the evennumbered items) with her or his score on the other half of the items (the oddnumbered  items). This procedure is known as split-half reliability. If the scale is reliable, then the correlation between the two halves will approach r 5 1.00, indicating that both halves measure the same thing. However, because splithalf reliability uses only some of the available correlations among the items, it is preferable to have a measure that indexes the average correlation among all of the items on the scale. The most common, and the best, index of internal consistency is known as Cronbach’s coefficient alpha, symbolized as a. This measure is an estimate of the average correlation among all of the items on the scale and is numerically equivalent to the average of all possible split-half reliabilities.
           Coefficient alpha, because it refl ects the underlying correlational structure of the scale, ranges from a 5 0.00 (indicating that the measure is entirely error) to a 5 11.00 (indicating that the measure has no error). In most cases, statistical computer programs are used to calculate coeffi cient alpha, but alpha can also be computed by hand according to the formula presented in Appendix D.

Item-to-Total Correlations. When a new scale is being developed, its initial reliability may be low. This is because, although the researcher has selected those items that he or she believes will be reliable, some items will turn out to contain random error for reasons that could not be predicted in advance. Thus, one strategy commonly used in the initial development of a scale is to calculate the correlations between the score on each of the individual items and the total scale score excluding the item itself (these correlations are known as the item-to-total correlations). The items that do not correlate highly with the total score can then be deleted from the scale. Because this procedure deletes the items that do not measure the same thing that the scale as a whole does, the result is a shorter scale, but one with higher reliability. However, the approach of throwing out the items that do not correlate highly with the total is used only in the scale development process. Once the fi nal version of the scale is in place, this version should be given again to another sample of participants, and the reliability computed without dropping any items.

Interrater Reliability

          To this point we have discussed reliability primarily in terms of self-report scales. However, reliability is just as important for behavioral measures. It is common practice for a number of judges to rate the same observed behaviors and then to combine their ratings to create a single measured variable. This computation requires the internal consistency approach—just as any single item on a scale is expected to have error, so the ratings of any one judge are more likely to contain error than is the averaged rating across a group of judges. The errors of judges can be caused by many things, including inattention to some of the behaviors, misunderstanding of instructions, or even personal preferences. When the internal consistency of a group of judges is calculated, the resulting reliability is known as interrater reliability.
           If the ratings of the judges that are being combined are quantitative variables (for instance, if the coders have each determined the aggressiveness of a group of children on a scale from 1 to 10), then coeffi cient alpha can be  used to evaluate reliability. However, in some cases the variables of interest may be nominal. This would occur, for instance, if the judges have indicated for each child whether he or she was playing “alone,” “cooperatively,” “competitively,” or “aggressively.” In such cases, a statistic known as kappa (k) is used as the measure of agreement among the judges. Like coeffi cient alpha, kappa ranges from k 5 0 (indicating that the judges’ ratings are entirely random error) to k 5 11.00 (indicating that the ratings have no error). The formula for computing kappa is presented in Appendix C.



Friday, 28 June 2013

Random and Systematic Error


(Reliability and Validity) Random and Systematic Error

The basic diffi culty in determining the effectiveness of a measured variable is that the measure will in all likelihood be infl uenced by other factors besides the conceptual variable of interest. For one thing, the measured variable will certainly contain some chance fl uctuations in measurement, known as random error. Sources of random error include misreading or misunderstanding of the questions, and measurement of the individuals on different days or in different places. Random error can also occur if the experimenter misprints the questions or misrecords the answers or if the individual marks the answers incorrectly. 
           Although random error infl uences scores on the measured variable, it does so in a way that is self-canceling. That is, although the experimenter may make some recording errors or the individuals may mark their answers incorrectly, these errors will increase the scores of some people and decrease the scores of other people. The increases and decreases will balance each other and thus cancel each other out. 
            In contrast to random error, which is self-canceling, the measured variable may also be infl uenced by other conceptual variables that are not part of the conceptual variable of interest. These other potential infl uences constitute systematic error because, whereas random errors tend to cancel out over time, these variables systematically increase or decrease the scores on the measured variable. For instance, individuals with higher self-esteem may score systematically lower on the anxiety measure than those with low self-esteem, and more optimistic individuals may score consistently higher. Also, as we have discussed in Chapter 4, the tendency to self-promote may lead some respondents to answer the items in ways that make them appear less anxious than they really are in order to please the experimenter or to feel better about themselves. In these cases, the measured variable will assess self-esteem, optimism, or the tendency to self-promote in addition to the conceptual variable of interest (anxiety). 
           Figure 5.1 summarizes the impact of random and systematic error on a measured variable. Although there is no foolproof way to determine whether measured variables are free from random and systematic error, there are techniques that allow us to get an idea about how well our measured variables “capture” the conceptual variables they are designed to assess rather than being influenced by random and systematic error. As we will see, this is accomplished through examination of the correlations among a set of measured variables.

Reliability and Validity



Reliability and Validity

We have seen in Chapter 4 that there are a wide variety of self-report and behavioral measured variables that scientists can use to assess conceptual variables. And we have seen that because changes in conceptual variables are assumed to cause changes in measured variables, the measured variables are used to make inferences about the conceptual variables. But how do we know whether the measures that we have chosen actually assess the conceptual variables they are designed to measure? This chapter discusses techniques for evaluating the relationship between measured and conceptual variables. 
          In some cases, demonstrating the adequacy of a measure is rather straightforward because there is a clear way to check whether it is measuring what it is supposed to. For instance, when a physiological psychologist investigates  perceptions of the brightness or color of a light source, she or he can compare the participants’ judgments with objective measurements of light intensity and wavelength. Similarly, when we ask people to indicate their sex or their current college grade-point average, we can check up on whether their reports are correct. 
           In many cases within behavioral science, however, assessing the effectiveness of a measured variable is more difficult. For instance, a researcher who has created a new Likert scale designed to measure “anxiety” assumes that an individual’s score on this scale will refl ect, at least to some extent, his or her actual level of anxiety. But because the researcher does not know how to measure anxiety in any better way, there is no obvious way to “check” the responses of the individual against any type of factual standard.