|
|||||||
| Lesson 1: Outline | Notes | Glossary | Presentation | Activities | Directed Questions | Assessment | |
![]() |
Presentation: Reliability |
![]() |
|||||
|
No measurement is perfect. Take something as simple as measuring your height. If you measure using a cloth tape, it might stretch a little. If you are measuring with a steel tape, the tape might contract a bit if you measure yourself outside on a cold day. Moreover, people tend to be a tiny bit taller when measured in the morning than in the afternoon (due perhaps to decompression of the spine while sleeping). So can measuring height be useful even if it is not perfectly accurate? Reliability refers to consistency in measurement. If we are measuring a stable trait, like intelligence, we would like each administration of the assessment to yield the same results. A common analogy for understanding reliability is a dartboard. If you consistently throw a dart in the same general spot on the dartboard over and over again, then your throwing is reliable (even if you miss the bulls eye every time!). However, because tests (and your dart throw) are imperfect, we often do not get exactly the same score each time. Thus, a test has good consistency if the differences (or Variance) between testing are minimized. | ||
| As defined classically, reliability is the proportion of the total variance that can be attributed to "true" variance. In other words, a test is reliable if it can consistently measure a trait or ability and not be influenced much by Random error. Reliability is not an all or nothing concept, but rather something we consider in terms of degree. | ||
|
Test-retest reliability There are a variety of types of reliability that are characteristic of a fair test. Test-retest reliability examines the consistency of individual scores across different administrations of a test. For example, Winnie takes a strength test on Monday and then takes the same strength test on Friday. Assuming Winnie hasn´t been hitting the weight room to improve her strength, if the measure has test-retest reliability, it should produce similar results across the two administrations. Test-retest reliability only makes sense when we are measuring traits or abilities that are assumed to be relatively stable over time. Test-retest reliability would not make sense; for example, if we gave a sixth grade class a math test at the beginning of the school year and then again at the end of the school year. In this case, we would expect that the students would have learned new math skills, thus improving their score. | ||
|
Parallel-forms reliability Another way to measure reliability is a method called parallel-forms reliability. Many teachers use alternate forms of a test, which are intended to be equivalent. For example, perhaps you had a student miss an exam. To prevent any unfair advantage on the test, you decide to give the student a different version of the test. This different form intends to measure the same concepts but uses different questions to do so. Parallel-forms reliability evaluates if, indeed, parallel or alternative forms of a test are equivalent. Parallel-forms reliability is calculated by correlating the two forms. Standardized test developers often have alternate forms of tests to prevent students from sharing information about what questions are on a test. These companies are expected to be able to demonstrate that different forms of the "same" test correlate well with each other. | ||
|
Internal consistency reliability Internal consistency reliability is a form of reliability that indicates the degree of stability in responses across the many items of a single test. | ||
|
Split-half reliability A traditional way of demonstrating internal consistency is to administer one test and then split the test evenly into two, such as even and odd numbered questions. Then, correlate scores from the two equivalent halves of the test. The stronger the correlation (the closer to 1.0), the better the internal consistency. Generally speaking, a correlation of 0.85 or greater would represent good internal consistency. | ||
|
Coefficient (or Cronbach´s) alpha Split-half reliability is one way to measure the internal consistency of the test. A more common way to report this characteristic is through the use of a statistical value called coefficient alpha, also known as Cronbach´s alpha. Coefficient alpha is an estimate of the average of all possible split-half reliabilities of a test. Although mathematically different than split-half correlation, the interpretation is the same. This method is particularly useful when only one administration of a test is possible or practical. | ||
|
Inter-rater reliability Finally, inter-rater reliability refers to the amount of agreement or consistency between two different raters scoring the same test. Tests that require mostly Subjective Scoring, an essay test for instance, is less likely to have higher inter-rater reliability than, say an objective math test. However, teachers and standardized test developers can help work with the problem of inter-rater reliability on subjective tests by establishing a grading Rubric beforehand. To demonstrate that an individual´s score represents typical performance, it must be shown that it makes no difference which judge, scorer, or rater was used to score the task. The level of inter-rater reliability is usually established with correlations between raters´ scores for a series of people or with a percentage that indicates how often they agreed. | ||
| Principles of Measurement |
|
||||||