Home Table of Contents Orientation Support Lessons
Navigation Tabs
Divider bar space Previous Page  34 of 45  Next Page space
space line
Preview Main Page Graphic Presentation: Fairness
Graphic Baseline
Play in Flash Player Download Quicktime media
Image 01 After recess Shaquille went to Mr. Coyne to complain. "I just talked to all my friends and we think the reading test you gave was biased!" "Why do you think that?" Mr. Coyne asked. "The girls scored much higher than the boys," Shaquille replied.

Score Differences Versus Bias

There can be many reasons scores for two groups of people are different. Some are legitimate reflections of existing differences. Others might be due to chance. Still others might be due to bias - a systematic tendency for one group to perform better than another for reasons that are not intended by the definition of what is being measured. This section discusses different types of error that occur in tests, as well as ways to detect biased items.
space
space
Image 02 Sources of Variance

Variance is a fancy statistical term for how different a set of numbers are from each other. It is defined based on how close each number is to the average of the set of numbers. If all the numbers are the same they are all the same as the average and their variance is 0. If all the numbers are at the extremes (say 10 scores of 0 and 10 scores of 100), then all the scores are very different than the average (which is 50 in this case) and the variance is high.
space
space
Image 03 Test Bias

As we said before, there are lots of different reasons why scores might vary. Perhaps the test scores were for a math test for 5th graders, but all the problems were word problems and were written at the 8th grade reading level. Maybe the 10 students who got 0s got that score not because they did not know math, but because they could not read. Maybe they also would have gotten scores of 100 if they were better readers. Calling something a math test does not mean that it is measuring only math. Maybe this math test is biased against poor readers.

Test bias can hurt students (and teachers) in many ways. One obvious way is that it can lead teachers to make bad instructional decisions for students. A low math score might lead to a student being placed in a remedial math course, but what they need might be additional reading instruction.

Because bias is systematic, it affects not just one child, but groups of children. If there is a correlation between an irrelevant Construct that a test is measuring and membership in a racial, ethnic, or gender group, then the bias can negatively impact whole groups of children. Thirty or more years ago far more males than females were interested in and knowledgeable about sports. Many verbal reasoning tests use word analogy problems. Some of those problems used sports terms. Research showed some of those items were biased against female students - those items underestimated the verbal reasoning ability of females.
space
space
Image 04 Similarly there are some topics used in reading comprehension passages that on average are of more or less interest to students in a particular ethnic group. There is research that shows that on average students do better on reading tests if they find the passages interesting. If most passages on the test were on average of greater interest to White-male students, than on average the reading ability of White-male students would be overestimated compared to female, Asian, Black, and Hispanic students. It is important to remember that while student interests may correlate with race, ethnicity, or gender, students vary in their interests within groups as well as across them. Regardless of the average interest of groups defined by race or ethnicity, individual students will find a passage about Russell Simmons (entrepreneur and founder of the pioneering hip hop record label, Def Jam records) more or less interesting than one about Richard Simmons (flamboyant fitness expert and promoter of weight loss programs).

Many misconceptions exist about what constitutes bias in an examination. The mean or average score difference depicted in the scenario is referred to as "impact" and may or may not represent bias. For example, imagine a test where there is a significant average score difference between White and Asian students. Perhaps the purpose of the examination was to test English language usage and the Asian examinees primarily come from a non-English speaking background. In this situation, we might readily expect group differences. The purpose of this section is to introduce you to the topic of bias and procedures that reflect high-quality test development practices.
space
space
Image 05 Test Fairness and Test Bias

The issue of fairness has become one of growing public concern since the Civil Rights Act of 1964. A natural extension of that concern has been the attention to evaluating bias within tests used for selection, placement, and classification. Bias in an examination can be any irrelevant influence causing differences in examinee scores, as opposed to differences resulting from true variation in ability. In other words, test bias results in differential Validity for different groups. Bias is another example of a systematic error and is a technical concept in that it can be analyzed impartially through statistics. Because bias is closely linked to principles of score validity, quality test construction procedures require that bias be addressed throughout the test design, construction, and implementation stages. Fairness, although usually associated with bias, is not the same thing. Fairness is not a technical concept but a broad concept that is based on philosophies of test use, social, and personal values. Fairness is a particularly controversial issue in our society today as tests (e.g. intelligence tests, SAT) are often used in selection processes and in conferring privileges. Theoretically, it is possible that both a biased test and a non-biased test can be used fairly or unfairly.
space
space
Image 06 Methods for Identifying Bias

There are two general ways for identifying bias: judgmentally and empirically. Judgmental reviews are conducted throughout the test development process and are concerned with the opinion of individuals representing the relevant subgroups in the population of potential examinees. The term "relevant subgroup"" can be interpreted in multiple ways depending on the examinee population; common examples include minority groups, gender, individuals with disabilities, or individuals whose primary language is not English. Judges are responsible for examining various components of the testing process such as test blueprints, individual items, or test administration manuals. In the examination of materials, judges are asked to identify such things as stereotypes, unfamiliar content or verbiage, or unequal representation.
space
space
Image 07 Another procedure often used for bias detection is empirical review. Empirical reviews are conducted following test administration. This review allows test developers to ascertain statistically whether or not individual items perform differently for relevant subgroups and is known as Differential Item Functioning (DIF) analysis. Procedurally, DIF analyses involve matching two groups such as males and females or Caucasian and Asian on the criterion of interest, usually the total test score and looking for group differences over and above ability. DIF is present when examinees in the two separate groups have the same ability or total test score, but have a different probability of correctly responding to a particular test question. A variety of statistical techniques are available for DIF analyses. In any DIF analysis, there are two distinct types of DIF that can be identified, both uniform and non-uniform. Uniform DIF is present when the probability of answering an item correctly is consistently or "uniformly" higher for one group over all levels of ability. Non-uniform DIF is present when the probability of answering an item correctly is inconsistent or "non-uniform" over all levels of ability. Positive identification of DIF is not proof of bias, but indicates that an item may be unfair to a particular subgroup. Upon identification, expert judges representing the group of interest should conduct a logical review of items exhibiting DIF and either revise, remove, or approve the items in question. space
space
Image 08
Tests for All

The No Child Left Behind Act requires all students to be tested, including those with moderate to severe learning disabilities. The section discusses the idea behind a universally designed assessment, as well as Accommodations, Modifications, and Alternate Assessments.

Universal design

A universally designed assessment is one that is accessible to students who might have any of a variety of common disabilities. That is, most people can accurately demonstrate their level of knowledge on such a test without needing a special form of the assessment or administration conditions that are decided upon by a third party. For example, a universally designed computerized assessment might have a feature that allows all examinees to select the text size with which they are most comfortable. The key to a universally designed test is that a separate test or special accommodations are not necessary on an individual basis because the test automatically provides appropriate administration choices for each examinee.
space
space
Image 09 Accommodations

Accommodations are adjustments to a test that are intended to not affect the validity of the test, but make it accessible to students with disabilities. On a reading test, for example, providing the text in a larger font size is an accommodation that does not affect the validity of the test; it simply makes the test accessible to a visually impaired student. In general a testing accommodation is any change to the testing conditions that reduces the impact of Construct irrelevant factors (in the example, factors other than knowledge of mathematics) for an identifiable subgroup of examinees and has no significant impact on the scores of other examinees.

The most common, and in some ways most controversial, accommodation is testing time. Some examinees, such as those with attention deficit disorder or dyslexia, require more time to process written information than do examinees without these conditions. However, research has shown that many other examinees, not all of whom are members of readily identifiable subgroups, benefit from additional testing time.

Other common accommodation approaches include translating test questions into other spoken or sign languages, Braille, and audio test directions.
space
space
Image 10 Modifications

Modifications are adjustments to a test that are likely to affect the constructs measured by a test. Tests with modifications are changed more dramatically than a test with an accommodation, as described above. Modified tests may be altered in terms of length or, in the case of a multiple-choice test, the number of options. The overall Difficulty of the test may also be modified from the original version.
space
space
Image 11 Alternate Assessments
Alternate assessments are assessments intended to measure the same broad constructs as a test designed for the general population, but to allow meaningful scores to be produced for students with severe (usually mental) disabilities. For example, the general population might take a reading comprehension test, but an alternate assessment might broaden the construct to receptive communication and develop a test, structured observation protocol, or portfolio system to allow severely autistic children or profoundly mentally retarded children to demonstrate their receptive communication skills. Such children would not be able to respond meaningfully to the assessment used by the general population - such a test would not provide useful results.
space
space
Image 12
Summary

The issue of test fairness is becoming ever more important as we continue to rely on tests for selection, placement, and classification of members of our society. Quality test construction requires that we consider issues of fairness and bias throughout design, development, and implementation stages. A test must have adequate degrees of both validity and reliability to be fair. In addition, we also must take into account both random and systemic sources of error when interpreting examinee´s scores. Setting performance standards and cut-scores, as well as using empirical and judgment methods to screen for biases are important steps to ensure that the tests we develop are useful and fair for their intended purposes, and appropriate for the target audience of examinees.
space
space
Principles of Measurementspace
Divider bar space Previous Page Top of Page Next Page space
space line