How Reliable Are Test Results? -

Hi there! I’m here to share some tidbits from Educational Assessment of Students, which you can purchase via the Pitt State Bookstore here, or via Pearson here . Brookhart and Nitko do wonderful work explaining everything we need to know in order to become teachers. Chapter Four, Reliability of Assessment Results, is summarized below. For those of you at home wondering, “Just how reliable are test results?” Let’s dive in.

Here are some important terms from this chapter so you can prepare yourself:

decision consistency index

domain of achievement

homogenous tasks

inter-rater reliability

measurement error

parallel forms

percentage of agreement

reliability

reliability coefficient

scorer reliability

speeded assessment

stability coefficient

This entire chapter is devoted to reliability, so before we get going, we need to be sure we understand what that is. Reliability is “the degree to which students’ results remain consistent over replications of an assessment procedure” (Brookhart and Nitko, p. 67). To be honest, I struggled with this chapter, so I started Googling a bit. This video helped me better grasp the concept of reliability, especially as it pertains to validity. Maybe it will help you, too? I initially thought they were the same thing. Though they are related, they are not interchangeable:

Validity, on the other hand, “relates to the confidence we have in interpreting students’ assessment results and in using them to make decisions” (Brookhart & Nitko, p.67). I continued to consult Google because my brain kept scrambling the two ideas. The more I read, the more it seemed that reliability is related to consistency, while validity is associated with accuracy. If we go a step further, measurement error is about the inconsistencies in assessment results.

According to Brookhart and Nitko, there are several potential explanations for inconsistencies in test outcomes. Some may be related to the actual content of the assessment, while others could be due to the occasion on which the test was given. Can’t we all empathize with someone who’s had a No Good, Very Bad Day? Students feel the impact of things like a headache, a fight with a friend, or even an upset stomach; these occasions can lead to an unusual performance on a test, thereby creating inconsistency.

The authors go on to describe reliability concerns with various types of assessments and how to address them. For example, they say about objective assessments, “Tests should have enough items that the consistency can show itself” (Brookhart & Nitko, p.70). Essay or project formats, on the other hand, will require rubrics and another set of eyes if/when possible. More than likely, there is just one person grading, so teachers should avoid looking at students’ names and grade one item at a time. This will promote solid and accurate scores. When a student has been out sick, teachers should either use a separate test for makeup work or make sure other students keep the test’s contents under wraps.

The number of items or questions is also critical for the reliability of oral questions and observations. In order to be certain our students have honed a new skill, we should come up with several questions about that skill. According to Brookhart and Nitko, it’s also necessary to provide extra time for pupils to respond to items: “Oral or observed performance should indicate achievement and not lack of time” (p.71). For reliability with self-assessments, teachers should cultivate a class culture that is warm and welcoming, a place where students feel secure enough to share their errors and be vulnerable about weaknesses.

The next part of this chapter addresses several types of reliability coefficients. These reliability coefficients all center around the issues of time, content, and raters (teachers/judges). These are all factors to be considered when teachers need to navigate their assessment results. Scores may be impacted by the day they’re given, and it’s important to determine whether they are consistent over time. There’s also the matter of content; are scores considered consistent if two relatively similar test forms are used? Lastly, what happens to test scores if different teachers are grading them? Are they consistent/equivalent, or not?

Later we learn about the standard error of measurement, or SEM, which is estimated by using an equation rather than testing students over and over again. The SEM estimates the number of mistakes and refers to the performance range where a student’s true score lies. However, Brookhart and Nitko caution readers about using SEM to determine the difference between two students’ scores. Interestingly, there is a possibility of overinterpreting OR underinterpreting score differences, along with the “do-nothing pattern”(Brookhart and Nitko, p.81). To play it safe, the authors suggest simply comparing the data from one test with other information on hand, like classroom performance.

This chapter concludes with nine tips to help teachers enhance the reliability of their test results (Brookhart & Nitko, p.84):

Add questions to the assessment
Expand the scope of the test
Increase objectivity; use a rubric
Have multiple teachers grade; average the results
Combine results from multiple assessments
Give students enough time
Teach students to do their best
Match the assessment level with students’ levels
Differentiate among students

This chapter presents reliability as a critical aspect of assessment results. Remember that reliability and validity are not one and the same! In order to use assessment results appropriately, teachers must first be able to determine how reliable they are. This post gives a brief synopsis of things educators must keep in mind when they make decisions for their students.

Pssssssst….if you liked reading How Reliable Are Test Results, read one of my other chapter summaries– CLICK HERE!

Author: Erin Best

She/her. Mom, wife, sister. Writer. Virgo. Dog lover. Bookworm. Kidney Donor. MAT student at Pitt State. Exceptional needs advocate. View all posts by Erin Best

Author: Erin Best

2 thoughts on “How Reliable Are Test Results?”