|
Accommodations:
Adaptations made to accommodate special needs, such as providing a large-sized font for visually impaired students. Accommodations are intended level the playing field, but not provide any advantage to those receiving them.
Adequate Yearly Progress (also AYP):
As per the Federal No Child Left Behind Act, AYP represents the annual academic performance targets in reading and math that the State, school districts, and schools must reach to be considered on track for 100% proficiency by school year 2013-14.
Alternate Assessments:
A special form of an assessment that has been constructed to measure the same broad construct, but in a way appropriate for students with extreme disabilities.
AYP:
See Adequate Yearly Progress
Bar Exam:
A generic term for a standardized, criterion-referenced test that prospective lawyers must pass in order to be licensed to practice in a state.
Civil Rights Act of 1967:
The 1964 Civil Rights Act made it illegal in the United States to discriminate on the basis of race, color, religion, national origin, or sex in public establishments and required employers to provide equal employment opportunities.
Coefficient (or Cronbach´s) alpha:
An internal consistency measure of a test that is the average of all split-half reliabilities.
Concurrent validity:
A type of criterion validity evidence which demonstrates that scores on one test are related to scores on another test, which could be administered instead.
Construct:
A theorized phenomenon that cannot be directly observed or measured (e.g., intelligence).
Construct validity:
The extent to which test scores reflect the trait intended for measurement and the underlying theory or model of behavior which explains the trait.
Content validity:
The extent to which a test covers all of the relevant conceptual space instead of focusing on one narrow type or dimension of the concept.
Criterion validity:
The extent to which test scores are related to an external standard or benchmark that is known to be a good indicator of the same or similar construct.
Criterion-referenced Test:
A test that evaluates student performance against a predetermined set of objectives or criteria (e.g., bar exam). Contrasts with norm-referenced.
DIF:
see Differential Item Functioning
Differential Item Functioning:
A measure of item quality that uses various statistical techniques to examine whether an item may be biased towards one or more groups.
Differential Item Functioning (also DIF):
A type of analysis that statistically determines whether or not individual test items perform differentially for relevant subgroups.
Difficulty:
A measure of item quality that examines the proportion of participants who answered an item correctly.
Discrimination:
A measure of item quality that examines the proportion of high-performing participants who answered an item correctly in comparison to the proportion of low-performing participants who answered it correctly.
Domain:
The overall universe of content or subject matter that relates to the nature and purpose of a test.
Domain Sampling:
The process of selecting a representative sample set of items from a test´s content domain.
Equating:
The process by which raw scores from different tests or different versions of the same test are translated to a new scale so that direct comparisons can be made.
Equipercentile Equating:
A process of equating based on the percentile ranks of scores.
Error variance:
Variance in test scores due to random or irrelevant sources.
Generalizability:
The extent to which findings (from a test or study) can be generalized (or extended) to natural settings (outside the classroom or lab).
Horizontal Equating:
A process of equating which allows for meaningful comparisons for the same group of students across time.
Inter-rater reliability:
The correlation between two different scorers´ scores on the same test. This is a measure of agreement or consistency between different test scorers.
Internal consistency reliability:
The degree of consistency in responses across the many items of a single test or the consistency with which those items measure a single dimension.
Judgmental review:
A method for detecting bias that uses the opinions of individuals representing relevant subgroups of the population of potential examinees to assess the fairness of items on a test.
Linear Equating:
A process of equating which involves specifying the desired mean and standard deviation of the final distribution ahead of time and using those values to directly calculate new scores.
Modifications:
In terms of testing, a modified test is one that has been changed from its original forms to such an extent that it can no longer be considered an accommodated version of the original test. See alternate assessments.
Norm-referenced Test:
A test that evaluates student performance by comparing students to each other (e.g., ACT). Contrasts with criterion-referenced.
Objective Scoring:
Scoring systems which do not require any expertise or opinion. If a test is computer scorable, it is objectively scored.
Parallel-forms reliability:
The correlation between test scores from two versions (forms) of a test that are presumed to have the same measurement characteristics.
Percent Correct:
A common scoring system which divides the total points received by the total points possible. That proportion is then multiplied by 100 to get a percentage.
Predictive validity:
A type of criterion validity evidence which demonstrates that scores on one test are related to scores on another test, which cannot be administered until sometime in the future.
Psychometrician:
An expert in the statistical analysis and design of tests.
Random error:
Error that occurs by chance and is not consistent or predictable. Contrasts with systemic error.
Rubric:
Organized set of performance criteria associated with a range of point values often used for scoring performance-based assessments, constructed response items and other forms of supply items.
Score Interpretation:
The method in which a student´s score on a test is evaluated. Criterion-referenced and norm-referenced are the two primary ways of interpreting scores on a test.
Split-half reliability:
An internal consistency measure found by correlating two even halves of a test (such as correlating the odd numbered problems with the even number problems).
Standard error of measurement:
A statistical index of measurement error which gives an estimate of how much an examinee´s score might vary across multiple administrations of the same test. The standard error also indicates how close a person´ score probably is to their typical level of performance on that test.
Subjective Scoring:
Scoring systems which require some expertise or experience. If two people following the same scoring key and instructions might disagree on the correct number of points to assign, then the scoring is subjective.
Systemic error:
Error that occurs in a predictable manner for all, or a subgroup, of test takers. Contrasts with random error.
Table of Specifications:
A table or "blueprint" that details how a test´s items should be constructed, from what content areas, and in what proportions.
Table of specifications:
A structured framework for the process of matching test items to the chosen performance domain.
Test blueprint:
A detailed, written plan for a test that typically includes descriptions of the test´s purpose and target audience; the content or performance areas it will cover; the types of items and number to be written for each content or performance area, their scoring, and other characteristics; the test administration method; and desired psychometric characteristics of the items and the test.
Test Specifications:
An overall outline that details the characteristics of a test. More specifically, it defines a test´s format, the total number and proportion of test items from each content domain, the type of item formats to be used, how the items will be scored, the method of interpreting test scores, and the time limit (if applicable) for taking the test.
Test-retest reliability:
The correlation between a student´s scores on the same test taken at two different points in time. For example, a student takes an exam during week one of school and then retakes the same exam during week three.
True variance:
Variance in test scores due to actual differences in the measured trait or ability.
Unified view of validity:
The modern view that validity is a single concept that must be established by analyzing a variety of different types of evidence.
Universal design:
One design that can serve a diverse population.
Validity:
The extent to which a test measures what it is intended to measure.
Validity coefficient:
A correlation between scores on one test and on another test believed to measure the same or similar construct.
Variance:
A statistical index that describes how much test scores differ from each other. Variance is comprised of both the actual differences in examinees and the random influences which affect test performance.
Vertical Equating:
A process of equating which allows for meaningful comparison on a single test across grades or age ranges.
|