(Reliability and Validity) Construct Validity
Although reliability indicates the extent to which a measure is free from random error, it does not indicate what the measure actually measures. For instance, if we were to measure the speed with which a group of research participants could tie their shoes, we might fi nd that this is a very reliable measure in the sense that it shows a substantial test-retest correlation. However, if the researcher then claimed that this reliable measure was assessing the conceptual variable of intelligence, you would probably not agree.
Therefore, in addition to being reliable, useful measured variables must also be construct valid. Construct validity refers to the extent to which a measured variable actually measures the conceptual variable (that is, the construct) that it is designed to assess. A measure only has construct validity if it measures what we want it to. There are a number of ways to assess construct validity; these are summarized in Table 5.2.
Face Validity
In some cases we can obtain an initial indication of the likely construct validity
of a measured variable by examining it subjectively. Face validity refers to the extent to which the measured variable appears to be an adequate measure of the conceptual variable. For example, the Rosenberg self-esteem scale in Table 4.2 has face validity because the items (“I feel that I have a number of good qualities;” “I am able to do things as well as other people”) appear to assess what we intuitively mean when we speak of self-esteem. However, if I carefully timed how long it took you and ten other people to tie your shoelaces, and then told you that you had above-average self-esteem because you tied your laces faster than the average of the others did, it would be clear that, although my test might be highly reliable, it did not really measure self-esteem. In this case, the measure is said to lack face validity.
Even though in some cases face validity can be a useful measure of whether a test actually assesses what it is supposed to, face validity is not always necessary or even desirable in a test. For instance, consider how White college students might answer the following measures of racial prejudice:
I do not like African Americans:
Strongly disagree 1 2 3 4 5 6 7 Strongly agree
African Americans are inferior to Whites:
Strongly agree 1 2 3 4 5 6 7 Strongly disagree
These items have high face validity (they appear to measure racial prejudice), but they are unlikely to be valid measures because people are unlikely to answer them honestly. Even those who are actually racists might not indicate agreement with these items (particularly if they thought the experimenter could check up on them) because they realize that it is not socially appropriate to do so. In cases where the test is likely to produce reactivity, it can sometimes be the case that tests with low face validity may actually be more valid because the respondents will not know what is being measured and thus will be more likely to answer honestly. In short, not all measures that appear face valid are actually found to have construct validity.
Content Validity
One type of validity that is particularly appropriate to ability tests is known as content validity. Content validity concerns the degree to which the measured variable appears to have adequately sampled from the potential domain of questions that might relate to the conceptual variable of interest. For instance, an intelligence test that contained only geometry questions would lack content validity because there are other types of questions that measure intelligence (those concerning verbal skills and knowledge about current affairs, for instance) that were not included. However, this test might nevertheless have content validity as a geometry test because it sampled from many different types of geometry problems.
Convergent and Discriminant Validity
Although face and content validity can and should be used in the initial stages of test development, they are relatively subjective, and thus limited, methods for evaluating the construct validity of measured variables. Ultimately, the determination of the validity of a measure must be made not on the basis of subjective judgments, but on the basis of relevant data. The basic logic of empirically testing the construct validity of a measure is based on the idea that there are multiple operationalizations of the variable:
If a given measured variable “x” is really measuring conceptual variable “X,” then it should correlate
with other measured variables designed to assess “X,” and it should not correlate with other measured
variables designed to assess other conceptually unrelated variables.
According to this logic, construct validity has two separate components. Convergent validity refers to the extent to which a measured variable is found to be related to other measured variables designed to measure the same conceptual variable. Discriminant validity refers to the extent to which a measured variable is found to be unrelated to other measured variables designed to assess different conceptual variables.
Assessment of Construct Validity. Let’s take an example of the use of how convergent and discriminant validity were used to demonstrate the construct validity of a new personality variable known as self-monitoring. Self-monitoring refers to the tendency to pay attention to the events that are occurring around you and to adjust your behavior to “fi t in” with the specifi c situation you are in. High self-monitors are those who habitually make these adjustments, whereas low self-monitors tend to behave the same way in all situations, essentially ignoring the demands of the social setting.
Social psychologist Mark Snyder (1974) began his development of a selfmonitoring scale by constructing forty-one items that he thought would tap into the conceptual variable self-monitoring. These included items designed to directly assess self-monitoring:
“I guess I put on a show to impress or entertain people.”
“I would probably make a good actor.”
and items that were to be reverse-scored:
“I rarely need the advice of my friends to choose movies, books, or
music.”
“I have trouble changing my behavior to suit different people and
different situations.”
On the basis of the responses of an initial group of college students, Snyder deleted the sixteen items that had the lowest item-to-total correlations. He was left with a twenty-fi ve-item self-monitoring scale that had a test-retest reliability of .83.
Once he had demonstrated that his scale was reliable, Snyder began to assess its construct validity. First, he demonstrated discriminant validity by showing that the scale did not correlate highly with other existing personality scales that might have been measuring similar conceptual variables. For instance, the self-monitoring scale did not correlate highly with a measure of extraversion (r 5 1.19), with a measure of responding in a socially acceptable manner (r 5 2.19), or with an existing measure of achievement anxiety (r 5 1.14).
Satisfied that the self-monitoring scale was not the same as existing scales, and thus showed discriminant validity, Snyder then began to assess the test’s convergent validity. Snyder found, for instance, that high self-monitors were more able to accurately communicate an emotional expression when asked to do so (r 5 .60). And he found that professional actors (who should be very sensitive to social cues) scored higher on the scale and that hospitalized psychiatric patients (who are likely to be unaware of social cues) scored lower on the scale, both in comparison to college students. Taken together, Snyder concluded that the self-monitoring scale was reliable and also possessed both convergent and discriminant validity.
One of the important aspects of Snyder’s fi ndings is that the convergent validity correlations were not all r 5 11.00 and the discriminant validity correlations were not all r 5 0.00. Convergent validity and discriminant validity are never all-or-nothing constructs, and thus it is never possible to defi nitively “prove” the construct validity of a measured variable. In reality, even measured variables that are designed to measure different conceptual variables will often
be at least moderately correlated with each other. For instance, self-monitoring relates, at least to some extent, to extraversion because they are related constructs. Yet the fact that the correlation coefficient is relatively low (r 5 .19) indicates that self-monitoring and extraversion are not identical. Similarly, even measures that assess the same conceptual variable will not, because of random error, be perfectly correlated with each other.
The Nomological Net. Although convergent reality and discriminant validity are frequently assessed through correlation of the scores on one self-report measure (for instance, one Likert scale of anxiety) with scores on another self-report measure (a different anxiety scale), construct validity can also be evaluated using other types of measured variables. For example, when testing a self-report measure of anxiety, a researcher might compare the scores to ratings of anxiety made by trained psychotherapists or to physiological variables such as blood pressure or skin conductance. The relationships among the many different measured variables, both self-report and otherwise, form a complicated pattern, called a nomological net. Only when we look across many studies, using many different measures of the various conceptual variables and relating those measures to other variables, does a complete picture of the construct validity of the measure begin to emerge—the greater the number of predicted relationships tested and confi rmed, the greater the support for the construct validity of the measure.
Criterion Validity
You will have noticed that when Snyder investigated the construct validity of his self-monitoring scale, he assessed its relationship not only to other self-report measures, but also to behavioral measures such as the individual’s current occupation (for instance, whether he or she was an actor). There are some particular advantages to testing validity through correlation of a scale with behavioral measures rather than with other self-report measures. For one thing, as we have discussed in Chapter 4, behavioral measures may be less subject to reactivity than are self-report measures. When validity is assessed through correlation of a self-report measure with a behavioral measured variable, the behavioral variable is called a criterion variable, and the correlation is an assessment of the self-report measure’s criterion validity.
Criterion validity is known as predictive validity when it involves attempts to foretell the future. This would occur, for instance, when an industrial psychologist uses a measure of job aptitude to predict how well a prospective employee will perform on a job or when an educational psychologist predicts school performance from SAT or GRE scores. Criterion validity is known as concurrent validity when it involves assessment of the relationship between a self-report and a behavioral measure that are assessed at the same time. In some cases, criterion validity may even involve use of the self-report measure to predict behaviors that have occurred prior to completion of the scale.
Although the practice of correlating a self-report measure with a behavioral criterion variable can be used to learn about the construct validity of the measured variables, in some applied research settings it is only the ability of the test to predict a specifi c behavior that is of interest. For instance, an employer who wants to predict whether a person will be an effective manager will be happy to use any self-report measure that is effective in doing so and may not care about what conceptual variable the test measures (for example, does it measure intelligence, social skills, diligence, all three, or something else entirely?). In this case criterion validity involves only the correlation between the variables rather than the use of the variables to make inferences about construct validity.