|
|||||||
| Lesson 4: Outline | Notes | Glossary | Presentation | Activities | Directed Questions | Assessment | |
![]() |
Presentation: Developing Standardized Tests |
![]() |
|||||
| Mr. Rodriguez routinely sets aside class time so that his students can take standardized tests. It seems that more and more tests are required at the national, state or local level, and as he is asked to give up more instructional time so that his students can prepare for or take these tests, he wonders more about the Validity and precision of these assessments. What are they supposed to do? Where did they come from? How are they made? He would like to know how these tests are developed and whether he can trust the scores they produce. | ||
|
Achievement versus Aptitude One way to categorize standardized tests in education is to group them as being either achievement or aptitude tests. Achievement tests attempt to measure a person´s current knowledge or skill level in a given realm. Examples of achievement tests can range from professional licensure exams (such as the Uniform Certified Public Accountant Examination for accountants or the Bar Exam for lawyers), to teacher-made English mid-terms aimed at seeing how well students have mastered the course material. | ||
|
Criterion-referenced versus Norm-referenced Another way to broadly categorize a standardized test is by the way its scores are interpreted; that is, whether is it a Criterion-referenced Test or a Norm-referenced Test. Recalling material from the lesson score interpretation, criterion-referenced tests are used to determine whether each student has achieved specific skills or concepts within a content Domain. Norm-referenced tests are used to rank each student´ s performance on an assessment in comparison to a norm group. | ||
|
| ||
|
Operationalizing Constructs Defining a content domain varies in Difficulty depending on the complexity of the test´s purpose. As previously mentioned, the purpose of the test may be to measure an abstract concept- what psychometricians call a Construct. If this is the case, the Construct needs to be operationally defined. In other words, the test developers need to translate the construct into observable behaviors that can be measured. For example, test developers at ACME Testing Company set out to develop a test that they can use to measure a person´s level of creativity. Since creativity is an abstract characteristic that cannot be directly observed, the developers need to operationally define exactly what it is. Is creativity how artistic a person is? Or is it how well a person can solve a problem in an atypical way? Or is it both? Does it relate to person´s level of intelligence? Maybe. What about to their height? Probably not. How a construct is defined directly relates to how the test can measure it. Regardless of how a construct is defined, it needs to be supported by a theoretical rationale and, if possible, previous research. | ||
|
The stronger the rationale that supports a specific definition, the stronger the evidence will be that the test measures what it claims to measure. As previously mentioned, it is not practical to have a test that includes the entire content domain related to a test´ s purpose. As a result, a test includes only a sample of items that are believed to adequately represent the domain as a whole. The process of selecting items to use from the overall content domain is known as Domain Sampling. The guidelines for domain sampling are dictated by test specifications. | ||
|
Test Specifications Test specifications provide an overall outline of how the test should be constructed from the content domain(s). It defines the test format (e.g., paper-and-pencil, oral exam, computerized); how many forms or versions of the test there will be; if it will be a timed test, and if so, how much time will be given; the total number of items there will be for each section and the test as a whole; the type(s) of item formats (e.g., multiple-choice, true-false, short-answer); the scoring rules or Rubric; and the method of score interpretation (e.g., criterion-referenced or norm-referenced). | ||
| In addition, the test specifications typically result in a Table of specifications or "blueprint" that determines the make-up and characteristics of the test items. A simple blueprint outlines the number and proportion of the test items that should be of a particular format and from a particular content area. Determining which format to use and in what proportion relates to the test´s purpose and established test writing rules. Knowing which content areas to include and in what proportion is a direct product of how the content domain is defined. | ||
|
Test blueprint Table represents a blueprint for a hypothetical second grade math test. Imagine that in order to create this table a review of experts revealed that an average second grade math curriculum covers addition, subtraction, multiplication, and simple measurement quantities. Together these subjects represent the content domain of second grade math. However each of the topics is not covered equally. Addition and subtraction together take up approximately 80 percent of the coursework; whereas multiplication and measurement quantities are each only covered 10 percent of the time. Therefore, a 20-point test should represent the content domain by including eight questions on addition (40%), eight questions on subtraction (40%), two questions on multiplication (10%), and two questions on measurement quantities (10%). Often a more complex blueprint is constructed by weighting each of the content areas based on the subject´s relative importance to the test´s purpose (e.g., if the aforementioned second grade math test is being used to select students for an advanced placement math class, then the more difficult multiplication items might be seen as being "worth more" than the other items). Another method is to include a dimension of cognitive abilities, detailing the specific competencies that are being utilized within a content area. Bloom´s taxonomy of cognitive operations is frequently used for this purpose. It consists of six levels of cognitive processing: knowledge, comprehension, application, analysis, synthesis, and evaluation. A test blueprint should provide sufficient detail to guide the development process to the creation of a satisfactory test. If two expert test developers were given the same blueprint, you would expect the tests each developed to have different items, but for the tests to be so similar it would be a matter of indifference as to which test you used. | ||
|
| ||
|
The field tests provide test developers with real-world data that allows them to perform various statistical analyses on each item to determine its quality. As with all stages of the item review process, the key objective is to identify, revise or replaced any items that are deemed of poor quality. The quality of an item is usually measured on three dimensions: difficulty level, Discrimination index, and Differential Item Functioning. | ||
|
Differential Item Functioning Differential item functioning (DIF) examines the scores of test takers across groups to see if any of the questions were easier or hardier for one or more groups. In other words, DIF is a statistical technique used to help determine if any of the items might be biased for or against a particular group. Test developers typically evaluate DIF (pronounced "diff") along ethnic or gender lines. For example, do Hispanics and Caucasian test takers of similar ability level (a key prerequisite to make sure you are comparing groups of participants that are hypothetically similar in every way except for the characteristic being tested, i.e., their ethnicity) perform the same on an item? If so, then the item has a low level of DIF and is most likely not biased. If performance is significantly different, it may well be that the item is biased and should be revised or removed. | ||
|
One round of testing had just been completed at Hillview Elementary and the surrounding schools in the district, and scores were starting to roll back in. Curious and somewhat concerned about the performance of the students at Hillview, Mrs. Sullivan helped arrange a meeting among teachers to discuss their results. The group gathered together after school, each bringing the score reports from their classes. The reports contained scores for each individual (linked to a student by an ID number), as well as class statistics such as mean and standard deviation of scores. Almost immediately it became evident that there were unexpected disparities in various aspects of the test across the group of teachers. The first thing Mrs. Sullivan noticed was that each test report came back with two scores: one that was labeled a "Raw" score, and one a "Scaled" score. While she knew that the raw score represented a student´s actual score on a test, she was not sure how that related to the scaled score (which gave a different score value), or why it was even necessary to report the score two different ways. She remembered learning that raw scores are often "scaled", or changed to scaled scores, which basically puts those scores in a new range with a different mean and standard deviation. (The SAT, for example, takes raw scores and scales them so that the distribution has a mean of 500 and a standard deviation of 100. The ACT, a similar test covering similar material, scales results to have a mean of 18 and a standard deviation of 6.) She wasn´t sure why this scaling was being done here, though. It was noted that while most sections of the test (which covered a range of topics from math to reading comprehension and language skills) contained 35 questions or so, the scaled scores ranged anywhere from 40 to 100. Ms. White and Mr. Juarez, both of whom taught 6th grade, noticed something in addition to this. Their reports also contained raw and scaled scores. However, a raw score of 28 on Ms. White’s report translated to a scaled score of 90. On Mr. Juarez´s report, the same raw score of 28 translated to a scaled score of 93. What was the reason for this? Someone made the observation that the two teachers had been given different forms of the test for use with their classes - this could help explain the differences. In fact, the incongruence in scaled scores between the reports of Ms. White and Mr. Juarez was due in part to the fact that they were issued different versions of the test, and also due to a process known as equating. Equating is the process by which raw scores from different tests (or different versions of the same test, or the same test across different grade levels) are translated to a new scale from which direct comparisons can be made across test versions. Equating is a necessary step in score analysis when multiple versions of a test have been used, because it is common for one version of a test to be slightly easier or more difficult than another version. For example, it is easier for a student to score a "28" on one form of a test than on another form. | ||
| There are different ways in which score scales can be adjusted. Sometimes, scaling a score is as simple as adding a constant value to each score of a certain form, then adding a different constant value to scores of another form, and so on, across all forms of a test. Other times, tests may use Linear Equating, which involves specifying the desired mean and standard deviation of the scaled score distribution ahead of time, then using these values and the calculated raw score mean and standard deviation to directly calculate new scaled score values. Yet another type of equating (Equipercentile Equating) takes into account the percentile ranks of scores on multiple versions of a test, and relates them accordingly. | ||
|
Vertical Equating Equating is used for more than just comparing scores from different test versions, however. This process can be used to compare performance on a single test across grades or age ranges. This process is known as vertical equating. Imagine that the Reading Comprehension Assessment (RCA) is given to students in grades 3-6 at Hillview Elementary twice each year (once early in the fall and again toward the end of the school year in the spring). After each round of testing, raw scores can be placed on a new scale (through processes similar to those mentioned above) that allow teachers and administrators to see the progress made between grade levels. For instance, depending on the scale chosen, there may be a 100 point difference between scores for 5th graders and scores for 6th graders, signifying an increase in reading comprehension (of 100 points) across those grade levels. | ||
|
Horizontal Equating We can also compare students taking different test formats in the same grade through horizontal equating. In one sense, the comparison of scores across different test versions within the same age range (the focus of the first part of this entry) is an example of horizontal equating. This process can also be used to track the progress of students within a certain grade or age group over time. At Hillview, for instance, comparisons can be made between score data from one year to the next year to see if third grade student performance, for example, improved over time. | ||
| The equating process is necessary because each administration of the test will involve a different version of that assessment. While the versions may be arranged alike and contain similar items, it is probable that there will be at least slight differences in the way a group responds to both versions--these differences are accounted for through the equating process. Both horizontal and vertical methods of equating are concerned with the trend of growth, the observed or expected rise in test scores over time or across age groups that is the results of various factors, most notably learning. | ||
|
The final process in constructing a standardized test is to develop a set of administration guidelines. The guidelines spell out in exact detail how and to whom the test is to be administered. This includes: the procedures for verifying a person´s identity upon arrival, protocols for setting up the testing environment (e.g., seating students in a random fashion), how much time is allowed for each section of the test and the test as a whole, standardized instructions, what information can and cannot be provided in response to test taker questions, rules regarding breaks, etc. Although this part of the process may seem mundane, it is crucial to ensuring that the test is administered in the exact way each time it is given. By doing so the test is said to be standardized because the administration process is controlled as much as possible so as not to influence the interpretation of the scores. | ||
|
The process of constructing a standardized test is very complex, labor intensive and expensive. Although each test has unique considerations and challenges, the process typically encapsulates several general steps. The first, and arguably the most important step, is to define the test´s purpose, followed by its related content domain. Once both have been defined, test developers outline a list of test specifications, detailing the characteristics of the test and its items. Expert item writers then construct an initial pool of test items based on the table of specifications. The pool of items undergoes an extensive review process, including an examination by item editors and a panel of experts. In addition, the items are field-tested with a sample of the test´s target population. The results of the field test are statistically analyzed to determine the quality of the items. Any items deemed of poor quality are either revised or removed. Test developers also conduct validity and reliability studies to make sure the test measures what it is supposed to measure and produces consistent scores. Lastly, before a test is published for wide-scale use, a set of administration guidelines are written to ensure that the test is administered in a standardized fashion. | ||
| Principles of Measurement |
|
||||||