Policy-Assessment Validity and Reliability at National Spanish Exams
Updated 12/2022 (You may download this policy here)
NSE works hard to make sure tests are reliable and valid. NSE recognizes there are various types of evidence to ensure validity, including criterion, construct, and content based evidence. NSE is most concerned with content validity, which is the degree to which an assessment evaluates all aspects of a topic that it is designed to measure. (https://statisticsbyjim.com/basics/content-validity/) It also is the extent to which inferences, conclusions, and decisions made on the basis of test scores are appropriate and meaningful. One needs both in order to have a meaningful assessment. (https://rlhoover.people.ysu.edu/OAT-OGT/reliability_validity.html).
Reliability refers to the degree to which a test is consistent and reliable across time. A test’s validity is determined by how well it samples the range of knowledge, skills, and abilities that students were supposed to acquire in the period covered by the exam. Test reliability depends upon grading consistency and discrimination between students of differing performance levels. Well-designed multiple choice tests are actually in general more valid and reliable than other kinds of testing such as essay tests because of broader material, the determination of discrimination is easier to assert, and there is scoring consistency. (https://www.smu.edu/-/media/Site/Provost/assessment/Resources/MultipleChoices/Improving-Multiple-Choice-QuestionsUNCCH.pdf?la=en&hash=E8167D388358BFCCB9FD7ECFB154770FCEE73FEF#:~:text=Validity%20and%20Reliability,-The%20two%20most&text=Well%2Ddesigned%20multiple%20choice%20tests,scoring%20consistency%20is%20virtually%20guaranteed).
Development creation and analysis
The National Spanish Exam Writing Committee employs a set of test development procedures carefully focused on assuring that the questions on each level of the exam accurately assess the learner outcomes for that level. The principal objective is to match the content of the test with the assessment domain through specific procedures during the test development process.
- The Director of Test Development hires candidates (professionals in the field of language acquisition) for assessment creation and trains on NSE policy/procedures.
- The assessment team reviews standards and test needs.
- The team reviews benchmarks and levels under each standard, including ACTFL standards when appropriate.
- The team reviews learner outcomes and/or “can do” statements when applicable, under each benchmark by level.
- Team members write question(s).
- Other team members review and discuss question(s).
- The Director of Test Development confirms the relationship between question(s) and learner outcome(s) and updates assessment.
After completion of the test draft and review, the National Spanish Examination Review Committee rates the content appropriateness of a given test by performing a systematic review of the test. This study relies very heavily on human judgment; therefore, the teachers who are chosen for this committee are experts in standards-based curriculum, instruction and assessment.
Post Assessment Analysis
After the tests have been administered, the National Spanish Exam analyzes the question details reports which provide information on how a specific sampling of students have done on a specific item. A validity panel that consists of trained professionals who have taught across all levels confirms through the QDR that all test items have face and content validity.
It is imperative to analyze test answer results in order for any test to remain valid and reliable. NSE practices analyses of test items as tests are being created yearly. Overall, questions will be immediately changed/moved/removed if the following occur:
- Distractor effectiveness-An outlier has less than 5% chosen (if otherwise a good question, then just that outlier would be changed)
- Item Difficulty-A number smaller than 50% has answered correctly (often raised to the next level)
- Item Difficulty-A number larger than 95% has answered correctly (often lowered to a lower level)
- The question is outdated or otherwise has an issue (mentions watching a movie on VCR, for example)
- Index discrimination analysis lower than 0.20
The test questions in any given level will be a mix of ranges of difficulty as well. The higher the level of difficulty, the more points are given (1-3 points per question on average). The item difficulty is the number of people that answered correctly divided by the total number that answered the question. Norm-referenced tests (NRTs) such as NSE, are contest based with students’ percentiles being viewed based upon state designed to be difficulty indexes between 0.4 and 0.6.
This is determined, in part, by the item analysis post testing.
Analyzing test results yearly post testing and changing test assessment questions to make them more valid helps create a consistently reliable and valid test every year at NSE. NSE uses the Kuder-Richardson 21 formula currently to calculate reliability. The average coefficient on a test is 0.90, with the highest 1.0. The coefficients of each test in 2022 were between 0.98 and 0.99, showing that the NSE is highly reliable. Results for 2022 are available online and can be found here.