Home Table of Contents Orientation Support Lessons
Navigation Tabs
Divider bar space Previous Page  42 of 45  Next Page space
space line
Preview Main Page Graphic Presentation: Developing Standardized Tests
Graphic Baseline
Play in Flash Player Download Quicktime media
Image 01 Mr. Rodriguez routinely sets aside class time so that his students can take standardized tests. It seems that more and more tests are required at the national, state or local level, and as he is asked to give up more instructional time so that his students can prepare for or take these tests, he wonders more about the Validity and precision of these assessments. What are they supposed to do? Where did they come from? How are they made? He would like to know how these tests are developed and whether he can trust the scores they produce. space
space
Image 02
"Standardized" Tests and Their Purpose

The word "standardized" refers to the method in which a test is developed, administered, and scored. Standardized tests are developed by a team of psychometricians (statisticians who are specifically trained in measurement and test development) and content experts in accordance with a set of strict guidelines (standards) put forth by the professional association most germane to the purpose of the test.
space
space
Image 03 The most widely used set of standards for educational tests is the "The Standards for Educational and Psychological Testing" developed jointly by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). Standardization also implies that the test is administered and scored following exactly the same instructions and procedures every time, regardless of where, when or by whom it is given.

space
space
Image 04 The first and most important step in developing a standardized test is to clearly identify its purpose. Is the test going to be used to measure students´ level of mastery in a subject or is it going to be used as a tool for college admissions? What inferences will be made from the scores? What decisions will be made based on those inferences? What is the target population that the test is going to be used with? These are some of the questions test developers consider when setting out on the task of designing a standardized test. The overall purpose of a test serves as the driving force that guides every subsequent step in its development. As a result, it is imperative that a great deal of time and effort is spent explicitly defining the purpose of a test before any other step can be taken. space
space
Image 05
Types of Standardized Tests

The two most common types of educational tests are achievement and aptitude. Whether achievement or aptitude, a test developer must decide whether to design a test to facilitate criterion-referenced or norm-referenced interpretations.
space
space
Image 06 Achievement versus Aptitude

One way to categorize standardized tests in education is to group them as being either achievement or aptitude tests. Achievement tests attempt to measure a person´s current knowledge or skill level in a given realm. Examples of achievement tests can range from professional licensure exams (such as the Uniform Certified Public Accountant Examination for accountants or the Bar Exam for lawyers), to teacher-made English mid-terms aimed at seeing how well students have mastered the course material.
space
space
Image 07 Criterion-referenced versus Norm-referenced

Another way to broadly categorize a standardized test is by the way its scores are interpreted; that is, whether is it a Criterion-referenced Test or a Norm-referenced Test. Recalling material from the lesson score interpretation, criterion-referenced tests are used to determine whether each student has achieved specific skills or concepts within a content Domain. Norm-referenced tests are used to rank each student´ s performance on an assessment in comparison to a norm group.

space
space
Image 08 To best facilitate criterion-referenced interpretations, a test should be designed to focus on the specific content that all students are intended to master. So a criterion-referenced test designed to measure achievement in 5th grade math would not include topics addressed in the curriculum for 4th or 6th grade math classes.

To best facilitate norm-referenced interpretations a test must have content easy enough to differentiate among very low achieving students AND hard enough to differentiate among very high achieving students. Thus a norm referenced math test designed for 5th graders might include questions on curriculum topics covered in 3rd through 7th grade.
space
space
Image 09
Building in Validity with Detailed Blueprints
space
space
Image 10 Content Domains

Once the purpose and interpretational basis of a test has been decided upon the process shifts to defining the content area or areas that relate to it. Defining the appropriate content domain is extremely important, for even the best constructed test is rendered useless if its content does not match its stated purpose (e.g., a test composed of vocabulary problems that is designed to measure grammar skills).

The content area of a test is commonly referred to as the content domain. The domain is a theoretical universe of content that represents every conceivable piece of knowledge or skill set that directly relates to the purpose of the test. For example, if the purpose of the test is to determine 3rd graders´ knowledge of adding single-digit fractions, then the content domain is every possible addition equation that contains single-digit fractions.
space
space
Image 11 Tests serve as a representative sample set of the content domain since it is impractical to design a test that includes every aspect of an entire content domain. There are various methods for defining a content domain. One typical approach is to survey experts in a given content area (e.g., asking 6th grade teachers what type of content should be included on a social studies test to be taken by students in the sixth grade). Other sources useful in helping define a contain domain are textbooks, coursework, syllabi, professional associations, and direct observations. space
space
Image 12 Operationalizing Constructs

Defining a content domain varies in Difficulty depending on the complexity of the test´s purpose. As previously mentioned, the purpose of the test may be to measure an abstract concept- what psychometricians call a Construct. If this is the case, the Construct needs to be operationally defined. In other words, the test developers need to translate the construct into observable behaviors that can be measured. For example, test developers at ACME Testing Company set out to develop a test that they can use to measure a person´s level of creativity. Since creativity is an abstract characteristic that cannot be directly observed, the developers need to operationally define exactly what it is. Is creativity how artistic a person is? Or is it how well a person can solve a problem in an atypical way? Or is it both? Does it relate to person´s level of intelligence? Maybe. What about to their height? Probably not. How a construct is defined directly relates to how the test can measure it. Regardless of how a construct is defined, it needs to be supported by a theoretical rationale and, if possible, previous research.
space
space
Image 13 The stronger the rationale that supports a specific definition, the stronger the evidence will be that the test measures what it claims to measure.

As previously mentioned, it is not practical to have a test that includes the entire content domain related to a test´ s purpose. As a result, a test includes only a sample of items that are believed to adequately represent the domain as a whole. The process of selecting items to use from the overall content domain is known as Domain Sampling. The guidelines for domain sampling are dictated by test specifications.
space
space
Image 14 Test Specifications

Test specifications provide an overall outline of how the test should be constructed from the content domain(s). It defines the test format (e.g., paper-and-pencil, oral exam, computerized); how many forms or versions of the test there will be; if it will be a timed test, and if so, how much time will be given; the total number of items there will be for each section and the test as a whole; the type(s) of item formats (e.g., multiple-choice, true-false, short-answer); the scoring rules or Rubric; and the method of score interpretation (e.g., criterion-referenced or norm-referenced).
space
space
Image 15 In addition, the test specifications typically result in a Table of specifications or "blueprint" that determines the make-up and characteristics of the test items. A simple blueprint outlines the number and proportion of the test items that should be of a particular format and from a particular content area. Determining which format to use and in what proportion relates to the test´s purpose and established test writing rules. Knowing which content areas to include and in what proportion is a direct product of how the content domain is defined. space
space
Image 16 Test blueprint

Table represents a blueprint for a hypothetical second grade math test. Imagine that in order to create this table a review of experts revealed that an average second grade math curriculum covers addition, subtraction, multiplication, and simple measurement quantities. Together these subjects represent the content domain of second grade math. However each of the topics is not covered equally. Addition and subtraction together take up approximately 80 percent of the coursework; whereas multiplication and measurement quantities are each only covered 10 percent of the time. Therefore, a 20-point test should represent the content domain by including eight questions on addition (40%), eight questions on subtraction (40%), two questions on multiplication (10%), and two questions on measurement quantities (10%).

Often a more complex blueprint is constructed by weighting each of the content areas based on the subject´s relative importance to the test´s purpose (e.g., if the aforementioned second grade math test is being used to select students for an advanced placement math class, then the more difficult multiplication items might be seen as being "worth more" than the other items). Another method is to include a dimension of cognitive abilities, detailing the specific competencies that are being utilized within a content area. Bloom´s taxonomy of cognitive operations is frequently used for this purpose. It consists of six levels of cognitive processing: knowledge, comprehension, application, analysis, synthesis, and evaluation.

A test blueprint should provide sufficient detail to guide the development process to the creation of a satisfactory test. If two expert test developers were given the same blueprint, you would expect the tests each developed to have different items, but for the tests to be so similar it would be a matter of indifference as to which test you used.
space
space
Image 17
Assuring Reliability with Pilot Testing and Item Analysis
space
space
Image 18 Item Development

Following the creation of the test specifications and blueprint, the next step is to write the test questions or items. Items for large standardized tests are created by a group of professional writers who are experts in the content domain being measured and have been specially trained to write test items. These groups are most typically made up of teachers. The test specifications and blueprint serve as a very detailed recipe that guides the item writers´ efforts and tells them what types of questions to write, how many, and from what content areas.
space
space
Image 19 Once an initial pool of items is written, one or more professional test item editors review the items, making sure that they are grammatically correct, unambiguous, follow standard and acceptable item-writing rules, and are in line with the test specifications and blueprint. The items also undergo a sensitivity review in order to identify any items that may be potentially inappropriate or biased toward a specific group. (This is discussed in more detail in the lesson, Fairness of Standardized Tests.) In addition, the items are reviewed by an independent panel of experts to make sure that the questions cover content that is appropriate for the given target population. Any questionable items are either revised or replaced. space
space
Image 20 Item Testing

Large standardized test publishers will conduct preliminary item tryouts before administering a large-scale field test. The preliminary tryouts can include anywhere from a few dozen to several hundred participants. This gives test developers the opportunity to observe the participants´ behaviors while taking the test to see if any unforeseen irregularities or problems develop. For example, they may notice that none of the students were able to finish in the allotted time, prompting the test developers to re-examine whether the time length is appropriate.
space
space
Image 21 The test developers review the results of the preliminary item tryouts and remediate any test flaws that were discovered. After this initial tryout, a large-scale field test is conducted using the items. This is typically done by combining the new items to an existing test that has the same purpose, content domain, and target population. For example, the SAT typically has 10 sections: three on critical reading, three on math, three on writing, and one experimental one. The experimental section contains new items that are being field tested for future versions of the test. The student´s overall score on the SAT is not affected by these items. They are only included in order to give the test developers an opportunity to try out the items in a real world situation.

If the test being developed is new and does not have a current version that it can piggyback on for the field test, then the test developers have to administer a stand-alone draft version of the test. As with the previous method, the test results are for experimental purposes only and do not result in any consequences. The California High School Exit Examination is one example of this approach. The test is a high school exit exam for the state of California. The state field-tested the exam by requiring segments of the student population to take the test starting in 2002. However, prior to the exam´s official roll-out in 2006, students were allowed to graduate even if they did not pass the test since the administrations were only used to try the items out.
space
space
Image 22
Item Analysis

The field tests provide test developers with real-world data that allows them to perform various statistical analyses on each item to determine its quality. As with all stages of the item review process, the key objective is to identify, revise or replaced any items that are deemed of poor quality. The quality of an item is usually measured on three dimensions: difficulty level, Discrimination index, and Differential Item Functioning.
space
space
Image 23 Difficulty Level

The difficulty level refers to the proportion of test takers that answered the item correctly. For example, an item with a difficulty level of 0.90 means nine-tenths or 90 percent of the test takers answered it correctly. Items should be neither too easy nor too hard. As a result, an item difficulty level of 0.50 is generally desired (meaning that half of the test takers answered it correctly and half answered it incorrectly).
space
space
Image 24 Discrimination Index

The discrimination index calculates the tendency of high-performing students to answer an item correctly and low-performing students to answer it incorrectly. An item of good quality is one in which the majority of the high scoring students answer it correctly and the majority of the low scoring students get it wrong. This should make intuitive sense, since it would be illogical for the group of lower-level students to more frequently answer a question correctly than the group of higher-level students.

Calculating the discrimination index requires a way to identify which students are in the high-performing group and which are in the low-performing group.
space
space
Image 25 One common way for a classroom teacher to do this is to rank the test takers by their raw score and then divide the sample group in half, classifying the top scorers as being in the high-performance group and the bottom half of scorers as being in the low-performance group. The discrimination index is then determined by subtracting the difficulty index for the low-performing group from the high-performing group. Standardized test developers typically calculate a correlation between each single item and the total test score. If students getting an item correct receive a score of 1 for that item and students who miss the item receive a 0, this allows for a meaningful correlation between each item and the test score. Items that do not correlate positively with the total test score are suspect. space
space
Image 26 Differential Item Functioning

Differential item functioning (DIF) examines the scores of test takers across groups to see if any of the questions were easier or hardier for one or more groups. In other words, DIF is a statistical technique used to help determine if any of the items might be biased for or against a particular group. Test developers typically evaluate DIF (pronounced "diff") along ethnic or gender lines. For example, do Hispanics and Caucasian test takers of similar ability level (a key prerequisite to make sure you are comparing groups of participants that are hypothetically similar in every way except for the characteristic being tested, i.e., their ethnicity) perform the same on an item? If so, then the item has a low level of DIF and is most likely not biased. If performance is significantly different, it may well be that the item is biased and should be revised or removed.
space
space
Image 27
Test Score Equating

One round of testing had just been completed at Hillview Elementary and the surrounding schools in the district, and scores were starting to roll back in. Curious and somewhat concerned about the performance of the students at Hillview, Mrs. Sullivan helped arrange a meeting among teachers to discuss their results. The group gathered together after school, each bringing the score reports from their classes. The reports contained scores for each individual (linked to a student by an ID number), as well as class statistics such as mean and standard deviation of scores. Almost immediately it became evident that there were unexpected disparities in various aspects of the test across the group of teachers.

The first thing Mrs. Sullivan noticed was that each test report came back with two scores: one that was labeled a "Raw" score, and one a "Scaled" score. While she knew that the raw score represented a student´s actual score on a test, she was not sure how that related to the scaled score (which gave a different score value), or why it was even necessary to report the score two different ways. She remembered learning that raw scores are often "scaled", or changed to scaled scores, which basically puts those scores in a new range with a different mean and standard deviation. (The SAT, for example, takes raw scores and scales them so that the distribution has a mean of 500 and a standard deviation of 100. The ACT, a similar test covering similar material, scales results to have a mean of 18 and a standard deviation of 6.) She wasn´t sure why this scaling was being done here, though.

It was noted that while most sections of the test (which covered a range of topics from math to reading comprehension and language skills) contained 35 questions or so, the scaled scores ranged anywhere from 40 to 100. Ms. White and Mr. Juarez, both of whom taught 6th grade, noticed something in addition to this. Their reports also contained raw and scaled scores. However, a raw score of 28 on Ms. White’s report translated to a scaled score of 90. On Mr. Juarez´s report, the same raw score of 28 translated to a scaled score of 93. What was the reason for this? Someone made the observation that the two teachers had been given different forms of the test for use with their classes - this could help explain the differences.

In fact, the incongruence in scaled scores between the reports of Ms. White and Mr. Juarez was due in part to the fact that they were issued different versions of the test, and also due to a process known as equating.

Equating is the process by which raw scores from different tests (or different versions of the same test, or the same test across different grade levels) are translated to a new scale from which direct comparisons can be made across test versions. Equating is a necessary step in score analysis when multiple versions of a test have been used, because it is common for one version of a test to be slightly easier or more difficult than another version. For example, it is easier for a student to score a "28" on one form of a test than on another form.
space
space
Image 28 Equating Test Scores

Here is a sample of the "raw score to scale score" conversions that might have occurred on both teachers´ forms:

Raw Score Scaled Form A (Ms. White) Scaled Form B (Mr. Juarez)
35 100 100
34 99 100
33 97 99
32 96 98
31 95 96
30 93 95
29 91 94
28 90 93
27 89 91
26 87 90
25 86 89


Notice the differences in scales between forms A and B. Most notably, it is possible for a student in Mr. Juarez´s class to have earned a raw score of 34 on form B, and still received the maximum scaled score of 100. This is an indication that form B is more difficult than form A, and therefore scoring a 34 (or 35) on form B is considered the equivalent of scoring a 35 on form A. This also explains why the raw score of 28 translates to a lower scaled score for Ms. White´s class than for Mr. Juarez´s class: Because form A is less difficult, a raw score of 28 is only equal to a scaled score of 90, while on form B a raw score of 28 is equal to a scaled score of 93.

How are the values used in the equating process defined? Once a test is developed, it is distributed to a sample of test takers--this sample is usually representative of the population for which this test is intended. If multiple versions of the test are available right away, all of these versions will be distributed. As results come back, analysis will take place to determine the relative levels of difficulty of these tests. One form (or version) will be designated the base form, which is the first form whose scores are to be translated to a new scale. This test can serve as the reference form for other test forms (of varying degrees of difficulty) to be scaled.
space
space
Image 29 There are different ways in which score scales can be adjusted. Sometimes, scaling a score is as simple as adding a constant value to each score of a certain form, then adding a different constant value to scores of another form, and so on, across all forms of a test. Other times, tests may use Linear Equating, which involves specifying the desired mean and standard deviation of the scaled score distribution ahead of time, then using these values and the calculated raw score mean and standard deviation to directly calculate new scaled score values. Yet another type of equating (Equipercentile Equating) takes into account the percentile ranks of scores on multiple versions of a test, and relates them accordingly. space
space
Image 30 Vertical Equating

Equating is used for more than just comparing scores from different test versions, however. This process can be used to compare performance on a single test across grades or age ranges. This process is known as vertical equating.

Imagine that the Reading Comprehension Assessment (RCA) is given to students in grades 3-6 at Hillview Elementary twice each year (once early in the fall and again toward the end of the school year in the spring). After each round of testing, raw scores can be placed on a new scale (through processes similar to those mentioned above) that allow teachers and administrators to see the progress made between grade levels. For instance, depending on the scale chosen, there may be a 100 point difference between scores for 5th graders and scores for 6th graders, signifying an increase in reading comprehension (of 100 points) across those grade levels.
space
space
Image 31 Horizontal Equating

We can also compare students taking different test formats in the same grade through horizontal equating. In one sense, the comparison of scores across different test versions within the same age range (the focus of the first part of this entry) is an example of horizontal equating. This process can also be used to track the progress of students within a certain grade or age group over time. At Hillview, for instance, comparisons can be made between score data from one year to the next year to see if third grade student performance, for example, improved over time.
space
space
Image 32 The equating process is necessary because each administration of the test will involve a different version of that assessment. While the versions may be arranged alike and contain similar items, it is probable that there will be at least slight differences in the way a group responds to both versions--these differences are accounted for through the equating process. Both horizontal and vertical methods of equating are concerned with the trend of growth, the observed or expected rise in test scores over time or across age groups that is the results of various factors, most notably learning. space
space
Image 33
Administration Guidelines

The final process in constructing a standardized test is to develop a set of administration guidelines. The guidelines spell out in exact detail how and to whom the test is to be administered. This includes: the procedures for verifying a person´s identity upon arrival, protocols for setting up the testing environment (e.g., seating students in a random fashion), how much time is allowed for each section of the test and the test as a whole, standardized instructions, what information can and cannot be provided in response to test taker questions, rules regarding breaks, etc. Although this part of the process may seem mundane, it is crucial to ensuring that the test is administered in the exact way each time it is given. By doing so the test is said to be standardized because the administration process is controlled as much as possible so as not to influence the interpretation of the scores.
space
space
Image 34
Summary

The process of constructing a standardized test is very complex, labor intensive and expensive. Although each test has unique considerations and challenges, the process typically encapsulates several general steps. The first, and arguably the most important step, is to define the test´s purpose, followed by its related content domain. Once both have been defined, test developers outline a list of test specifications, detailing the characteristics of the test and its items. Expert item writers then construct an initial pool of test items based on the table of specifications. The pool of items undergoes an extensive review process, including an examination by item editors and a panel of experts. In addition, the items are field-tested with a sample of the test´s target population. The results of the field test are statistically analyzed to determine the quality of the items. Any items deemed of poor quality are either revised or removed. Test developers also conduct validity and reliability studies to make sure the test measures what it is supposed to measure and produces consistent scores. Lastly, before a test is published for wide-scale use, a set of administration guidelines are written to ensure that the test is administered in a standardized fashion.
space
space
Principles of Measurementspace
Divider bar space Previous Page Top of Page Next Page space
space line