Home Table of Contents Orientation Support Lessons
Navigation Tabs
Divider bar space Previous Page  18 of 45  Next Page space
space line
Preview Main Page Graphic Presentation: Reliability
Graphic Baseline
Play in Flash Player Download Quicktime media
Image 01 Ms. Chan had an interesting experience. At two different professional development courses the instructor administered the Myers-Briggs Type Indicator to all the participants. When she compared her scores they were not identical. This got her thinking. Why did her test scores vary? What if the tests she gave her students were administered twice - would her students get the same scores? Sometimes Ms. Chan creates a second form of a test to use for students who were absent for the first test. If a student took two different forms of a test would their scores vary even more? How can a test score be used fairly if it isn´t always the same? space
space
Image 02 No measurement is perfect. Take something as simple as measuring your height. If you measure using a cloth tape, it might stretch a little. If you are measuring with a steel tape, the tape might contract a bit if you measure yourself outside on a cold day. Moreover, people tend to be a tiny bit taller when measured in the morning than in the afternoon (due perhaps to decompression of the spine while sleeping). So can measuring height be useful even if it is not perfectly accurate?

Reliability refers to consistency in measurement. If we are measuring a stable trait, like intelligence, we would like each administration of the assessment to yield the same results. A common analogy for understanding reliability is a dartboard. If you consistently throw a dart in the same general spot on the dartboard over and over again, then your throwing is reliable (even if you miss the bulls eye every time!). However, because tests (and your dart throw) are imperfect, we often do not get exactly the same score each time. Thus, a test has good consistency if the differences (or Variance) between testing are minimized.
space
space
Image 03 As defined classically, reliability is the proportion of the total variance that can be attributed to "true" variance. In other words, a test is reliable if it can consistently measure a trait or ability and not be influenced much by Random error. Reliability is not an all or nothing concept, but rather something we consider in terms of degree. space
space
Image 04 Test-retest reliability

There are a variety of types of reliability that are characteristic of a fair test. Test-retest reliability examines the consistency of individual scores across different administrations of a test. For example, Winnie takes a strength test on Monday and then takes the same strength test on Friday. Assuming Winnie hasn´t been hitting the weight room to improve her strength, if the measure has test-retest reliability, it should produce similar results across the two administrations. Test-retest reliability only makes sense when we are measuring traits or abilities that are assumed to be relatively stable over time. Test-retest reliability would not make sense; for example, if we gave a sixth grade class a math test at the beginning of the school year and then again at the end of the school year. In this case, we would expect that the students would have learned new math skills, thus improving their score.
space
space
Image 05 Parallel-forms reliability

Another way to measure reliability is a method called parallel-forms reliability. Many teachers use alternate forms of a test, which are intended to be equivalent. For example, perhaps you had a student miss an exam. To prevent any unfair advantage on the test, you decide to give the student a different version of the test. This different form intends to measure the same concepts but uses different questions to do so. Parallel-forms reliability evaluates if, indeed, parallel or alternative forms of a test are equivalent. Parallel-forms reliability is calculated by correlating the two forms. Standardized test developers often have alternate forms of tests to prevent students from sharing information about what questions are on a test. These companies are expected to be able to demonstrate that different forms of the "same" test correlate well with each other.
space
space
Image 06 Internal consistency reliability

Internal consistency reliability is a form of reliability that indicates the degree of stability in responses across the many items of a single test.
space
space
Image 07 Split-half reliability

A traditional way of demonstrating internal consistency is to administer one test and then split the test evenly into two, such as even and odd numbered questions. Then, correlate scores from the two equivalent halves of the test. The stronger the correlation (the closer to 1.0), the better the internal consistency. Generally speaking, a correlation of 0.85 or greater would represent good internal consistency.
space
space
Image 08 Coefficient (or Cronbach´s) alpha

Split-half reliability is one way to measure the internal consistency of the test. A more common way to report this characteristic is through the use of a statistical value called coefficient alpha, also known as Cronbach´s alpha. Coefficient alpha is an estimate of the average of all possible split-half reliabilities of a test. Although mathematically different than split-half correlation, the interpretation is the same. This method is particularly useful when only one administration of a test is possible or practical.
space
space
Image 09 Inter-rater reliability

Finally, inter-rater reliability refers to the amount of agreement or consistency between two different raters scoring the same test. Tests that require mostly Subjective Scoring, an essay test for instance, is less likely to have higher inter-rater reliability than, say an objective math test. However, teachers and standardized test developers can help work with the problem of inter-rater reliability on subjective tests by establishing a grading Rubric beforehand.

To demonstrate that an individual´s score represents typical performance, it must be shown that it makes no difference which judge, scorer, or rater was used to score the task. The level of inter-rater reliability is usually established with correlations between raters´ scores for a series of people or with a percentage that indicates how often they agreed.
space
space
Image 10 Inter-rater correlations

Inter-rater correlations compare the scores two raters awarded one individual on each item. If the correlation is high (0.85 or greater), then the two raters were fairly consistent with how they assessed the individual and one could conclude that the inter-rater reliability is decent. The greater the correlation (the maximum is 1.0), the stronger the relationship between the two sets of scores. The lesser the correlation (minimum is -1.0), the more inconsistent the two raters were in their scoring of the assessment and the less reliable are the scores. A negative one correlation would indicate perfect disagreement - for example, every time the first rater gave an answer a perfect score the other rater would give the answer a 0.
space
space
Image 11 Percentage of Agreement

Percent agreement is easy to understand and is related to reliability (in general the higher the percent agreement, the higher the reliability) but it is not itself a measure of reliability. If percentage of agreement is used as the method for looking at inter-rater reliability, then two people independently score the same set of tests or items and their scores are compared. The percentage of time that the two raters provided the exact same score for the same test or item is their percentage of agreement. Depending on the range of score points awarded to each item or task, the percentage of agreement may be computed as scores in which two raters assigned a score within one point of each other, as opposed to exact agreement. For example, if a task was rated on a 10-point scale and rater A assigned 8 points to a student and rater B assigned 9 points to the same student, then it is acceptable to consider this score as an "agreed" score. For items scored on a 0-4 scale, an 80% perfect agreement between the scores is considered reasonable.
space
space
Image 12 Classroom teachers can develop inter-rater reliability information for their own scoring rubrics or subjective scoring rules using these methods. Teachers may worry less about a scored classroom assessment having low inter-rater reliability, however, if they develop scoring guides which are as objective as possible. With standardized tests, using multiple raters and checking the agreement levels among multiple raters is good practice. The trained experts who do this sort of scoring have usually passed a high level of inter-rater reliability before they are considered ready to score these often high stakes tests. Even after experts have been trained to score, they typically are re-tested from time to time to verify that they are still scoring at a level of high inter-rater reliability. space
space
Image 13
EXAMPLE (Case Study)

Mr. Dixon often uses an assignment that asks his students to produce a map of their school building in relation to surrounding elements. He has used a scoring rubric that he made himself to assess the quality of these maps. The rubric takes a holistic approach, which evaluates the product as a whole and not in pieces, and assigns a single score based on a five-point scale. The scale points range from 0 points, which means "minimum assignment requirements were not met" to 4 points which means "there are many accurately placed and scaled elements." He is interested in checking the reliability of his scoring rubric and has asked another teacher, Ms. Valdez, to help him. Both teachers score the same ten maps. They score these maps without knowing how the other person scored them. They then compare scores for percentage of exact agreement.The results of this inter-rater reliability mini-study are below.

Student´s map Mr.Dixon´s score Ms.Valdez´ score Exact Agreement?
A 4 4 Yes
B 3 4 No
C 3 4 No
D 3 3 Yes
E 4 4 Yes
F 2 3 No
G 0 0 Yes
H 3 3 Yes
I 4 4 Yes
J 4 4 Yes


Mr. Dixon and Ms. Valdez, while following the same scoring rubric, did not agree all the time on which score a student should receive. Out of ten opportunities to agree, the two scorers agreed seven times. 7/10 = 70%, so Mr. Dixon could interpret the reliability for his scoring rubric as a percentage of agreement of 70%. This seemed low to him, so he decided that he would add more precise descriptions of what each score should mean. Even though he will be the only one scoring these assignments, his data suggested that there was some subjectivity in the way scores could be assigned and he worried that he might assign scores differently for the same quality work.
space
space
Principles of Measurementspace
Divider bar space Previous Page Top of Page Next Page space
space line