Choose at least two measures that you can use to improve the reliability of any test item.
- What are the factors that you should keep in mind while formulating these measures?
- Examine the parameters against which you should gauge the extent of improvement in the test item.
Reliability of Test Items
The reliability of a test is its ability to produce authentic results. Reliability is determined by the quality of measurement and represents the “consistency” of your results. Measures of reliability provide an indicator of how consistent a test is in assessing different groups of learners. Although some degree of variance is allowed, large variability of the test results can call into question a test’s reliability.
A reliable test is capable of measuring the same learning outcomes in different but equal populations. For example, a medical or surgical exam should be able to measure knowledge of medical or surgical skills in nursing students who have reached a given point in the curriculum.
Reliability is calculated using statistical analyses, such as the Kuder-Richardson formula, the alpha coefficient, and the reliability coefficient. Reliability is reported as a coefficient ranging from 0 to 1.
Several factors have the potential to impact the reliability coefficient in academic and clinical settings. They are:
- Quality of test item
- Level of difficulty of each item: This is the item’s p-value. If an item has a p value of less than 0.30, it is most likely too difficult for the group. Items that have a p value greater than 0.85 are most likely too easy. You should strive to achieve a p value within these parameters.
- Homogeneity of the test items: This deals with how the test “looks;” every test item should have a similar appearance. For example, when you write multiple choice questions, the stems should all be approximately the same length, the answer should be approximately the same length and the distractors should be approximately the same length so they do not to offer clues to possible answers, and they should be listed vertically.
- A point-biserial coefficient, computed for every multiplechoice item, is considered useful because it reflects how well an item is “discriminating.” A high point-biserial coefficient means that students selecting the correct response are students with higher total scores, and students selecting incorrect responses to an item are associated with lower total scores. Given this, the item can discriminate between low-performing students and high-performing students. This is a desirable characteristic of test questions. Very low or negative point-biserial coefficients can help identify items that are flawed (Measured Progress, 2013).
- The index ranges between -1 and +1. It is better to have a score as close as possible to +1. When the score is in the positive range, it indicates that high scoring students correctly answered the question more often than low scoring students.
- Similarly, scores closer to -1 mean the low scoring students answered the question correctly more often than members of the high scoring group.
- A high discriminating item on an exam has a PBI above 0.40. If an item has a PBI of less than 0.15 the item should be revised. If an item has a PBI of less than 0.09 the item should be rejected. A mean PBI of 0.36 is a good to acceptable value and the question should either be retained or consider improving. (McDonald, 2014).
- Homogeneity of the group being tested: When the majority of students choose the correct answers, it is difficult to differentiate between those who are at the high and low ends of achievement. If there is a small spread, you cannot compare individual items to the total test. This gives you a 0 reliability coefficient. However, because student nurses have to meet stringent nursing criteria, one normally sees a certain degree of homogeneity of the student population. Therefore, coefficients are generally closer to 0 than to 1 in nursing students.
- Length of test (i.e. the greater the number of items on a test, the greater its reliability)
- Number of students taking the test
- Time allotted for assessment
- Testing conditions
In nursing education, a variety of test items are used to assess the attainment of learning objectives. The structural differences of these test items sometimes make it difficult to assess their reliability.
Methods like the Kuder-Richardson, split-half, and alpha coefficient are used to measure the reliability of objective test items. Test re-test and Parallel Form tests for reliability are not conducive to the academic setting due to the time constraints of giving the same exam or parallel exam to the same students on the same nursing content.
Compared to objective testing, it is more difficult to assess the reliability of essay test items because the methods involved are time consuming. In this case, you can use an analytic rubric to determine whether the answer reflects what you are testing. You should look for key facts and information http://www.measuredprogress.org/learning-toolsstat… in the answer and check them against the rubric. This will help you determine whether or not the answer contains the required information.
McDonald, ME. (2014). Guide to assessing learning outcomes (3rd ed.) Burlington, MA: Jones & Bartlett.
Measured ProgressTM. (2013). Discovering the point biserial.
Retrieved from http://www.measuredprogress.org/learningtools-stat…
More on Reliability
Performance-based tests clearly outline the steps that learners are required to follow in order to excel in a particular skill or procedure. As long as they include the steps of the procedure or skill, the test is 100% reliable. In this case, the answer is “yes” or “no” in terms of reliability, and there is no need to formally test the reliability.
In portfolio evaluation, the artifacts used for testing reveal a nurse’s subjective experiences and growth, so those tests cannot be tested for reliability.
When it can be measured, the most frequently used methods for estimating the reliability of test items are the split-half and Kuder-Richardson method.
How is it used?
This method compares one half of the test to another.
This method compares each test item with another.
This method calculates the average of all possible split half scores.
Used on which kind of test items?
This method can be used to calculate the reliability of objective test items.
This method is used on test items that have either correct or incorrect answers. These may include multiple choice items or true and false items.
This method is used on test items that have more than one plausible answer.
The split-half method essentially divides a test into two equal components. Each half should be relatively comparable in terms of the level of difficulty or complexity and should encompass similar areas of inquiry. Next, the test is administered to a group of students. Each half is then scored and the scores are compared. If the learners’ scores on each half of the exam are similar, the test is deemed reliable. Using this method, you can test the reliability of a test without administering it twice on the same group of learners. However, the reliability measure obtained depends heavily on the way the test items are divided.
In the Kuder-Richardson method, the learners’ responses to each of the test items are taken into consideration. Then, the level of consistency in these responses is used to measure the reliability of the test. The Kuder-Richardson method, like the alpha coefficient, performs split-half testing on every possible combination of split halves. For a given test, you divide the test items into all possible combinations of halves, determine the split half coefficient of each combination, and then take the average of the coefficients to get the coefficient alpha.
Coefficient alpha is also known as Cronbach’s alpha—which is a statistic that is generally used as a measure of internal consistency or reliability. It estimates the internal consistency (homogeneity) of a measure which is composed of several subparts or items (such as a test).
Summary: A test’s reliability determines its reusability and its ability to obtain consistent responses from the learners. The reliability of a test is its ability to produce consistent and authentic results. Measures of reliability indicate the consistency of a test in assessing different groups of learners. The most frequently used methods of estimating the reliability of test items are the split-half and Kuder-Richardson method. The validity of a test ascertains whether the test measures what it intends to measure. A test can be reliable without being valid. Point Biserial Index is considered useful because it reflects how well an item is discriminating and given this, the item can discriminate between low-performing learners and high-performing learners. Test re-test and Parallel Form tests for reliability are not conducive to the academic setting due to the time constraints of giving the same exam or parallel exam to the same students on the same nursing content.
Validity of Test Items
A test’s reliability determines its reusability and its ability to extract consistent responses from the learners. Validity addresses a test’s ability to measure a given learning outcome. The validity of a test ascertains whether the test measures what it intends to measure.
A test can be capable of producing similar responses when administered on the same group of learners, at different times, without actually testing what the nurse educator intended to test. In short, a test can be reliable without being valid.
As a nurse educator, you should be concerned with three types of validity: content, construct, and criterion.
- Content validity: Content validity demonstrates how closely a test item matches the content that was delivered in a course of instruction.
- Construct validity: Construct validity demonstrates how closely the test measures the “construct or idea” it claims to measure. Construct represents qualities or ideas, like creativity or critical thinking.
- Criterion validity: Criterion validity demonstrates how closely the examination scores are related to predetermined outcome criteria. It is a measure of agreement between the results obtained by the test against a known standard or against itself.
More on Validity of Test Items
Content validity: Content validity is used to determine whether the test measures the content that has been taught. You can use a test blueprint to check this. Consider this example: You are developing a test for a surgical nursing course. To achieve good content validity, you should include test questions on various surgical principles, concepts, and procedures that have been presented in class. If you include questions related to obstetrics, the assessment would lack content validity.
Construct validity: If you seek to measure critical-thinking skills, then your test must have validity related to the construct of critical thinking. To develop a test that measures critical thinking, you should seek out evidence that supports the construct.
Let’s take a look at this example: You have to assess your learners’ critical-thinking and decision-making skills in the nursing care of patients experiencing cardiac arrest. To ensure construct validity, the test should identify critical health care situations related to cardiac arrest and ask learners to identify the most effective nursing care decisions.
Construct validity is a bit more complicated to measure and includes methods such as:
- Calculating test items’ intercorrelations: This provides evidence of homogeneity demonstrating that the test measures one of the constructs.
- Calculating the correlation of the score with other measures. For example, you can give the same test to two separate sections of a nursing class and receive comparable scores.
- Exploring the learners’ thinking processes to answer questions. By doing this, strength is added to the construct’s definition.
For example, you can use medication calculation tests where learners can demonstrate their skills with numbers and calculations. This can help you see where errors are made.
- Consulting experts in the area of measurement to verify that test items are actually capable of measuring the construct. Let us consider this example. Your area of specialization is geriatrics and you need to lecture your learners on ostomy. You can have an ostomy nurse lecture your class on ostomy. Later when you develop a test on ostomy, you can ask the ostomy nurse to validate the test questions.
Criterion validity: This kind of validity verifies the appropriateness of a test in predicting a learner’s future performance. In other words, the nurse educator seeks evidence to determine how valid a test is for “predicting” a secondary measure of student achievement (also referred to as a criterion).
A good example of this type of validity can be found in the mobility profile tests provided by many educational testing companies. Here, testing products are offered that attempt to predict the students’ likelihood of passing the NCLEX-RN.
A validity coefficient helps determine how well a test fulfills its purpose, or its validity. It demonstrates the strength of a relationship between two sets of scores.
Content validity cannot be calculated whereas construct and criterion validity can be calculated using a validity coefficient.
The Pearson product-moment correlation coefficient is often used to obtain the correlation coefficient. A correlation coefficient is reported as r. The score range for the correlation coefficient lies between -1 to +1 and this scoring represents both direction and strength. A correlation coefficient of -1 indicates a perfect negative correlation, meaning the two scores are not correlated. The closer your measure comes to +1, the more valid the test is at predicting the relationship. You should keep in mind that a significant correlation does not necessarily mean cause and effect have been established.
Validity coefficients are calculated using criterion and construct validity. Tests with low content and criterion validity scores will have a poor validity coefficient.
Apart from measuring a test for reliability and validity, you should also check it for the standard error of measurement. Every test score is not a true reflection of the individual’s abilities. It is important to calculate how close the obtained score is to the “true score.” This difference is called the standard error of measurement, which is the variance between a student’s score actually received in a test and his or her true score (or knowledge). This score is calculated using statistical software packages.
Consider this example: You develop a test to assess a learner’s knowledge on nursing interventions required for patients with severe traumatic brain injury. Then, without giving the learner a chance to improve upon the present knowledge of traumatic brain injury, you make the learner respond to it several times and collect the scores. You observe that scores are not uniform. Some scores are relatively higher or lower than the score which reflects his or her actual knowledge of the content.
This variation between the actual score of the learner and the highest or lowest score shows the standard error of measurement. The larger the standard error of measurement, the less reliable the test. The score that a student obtains should not differ by more than +1 or -1 standard error from actual score in about 68% of the cases. In 95% of cases, the standard errors of measurement will fluctuate by +2 or -2 from the actual score of the student.
Summary: Validity addresses a test’s ability to measure a given learning outcome. The validity of a test determines whether the test measures what it intends to measure. As a nurse educator, you should be concerned with three types of validity: content, construct, and criterion. A validity coefficient measures the extent to which a test is valid. The Pearson product-moment correlation coefficient is often used to obtain the correlation coefficient. Validity coefficients are calculated using criterion and construct validity. Tests with low content and criterion validity scores will have a poor validity coefficient and the test will not be a valid test