Journal Papers

Applications of Item Response Theory Models to Assess Item Properties and Students’ Abilities in Dichotomous Responses Items

Applications of Item Response Theory Models to Assess Item Properties and Students’ Abilities in Dichotomous Responses Items

ISSN: 2734-2050

Article Details:

DOI: 10.52417/ojed.v3i1.304 Article Ref. No.: OJED0301001-203

Volume: 3; Issue: 1, Pages: 01-19 (2023)

Accepted Date: 10 January, 2022

© 2021 Adetutu & Lawal.


 A test is a tool meant to measure the ability level of the students, and how well they can recall the subject matter, but items making up a test may be defectives, and thereby unable to measure students’ ability or traits satisfactorily as intended if proper attention is not paid to item properties such as difficulty, discrimination, and pseudo guessing indices (power) of each item. This could be remedied by item analysis and moderation. It is a known fact that the absence or improper use of item analysis could undermine the integrity of assessment, selection, certification and placement in our educational institutions. Both appropriateness and spread of items properties in accessing students’ abilities distribution, and the adequacy of information provided by dichotomous response items in a compulsory university undergraduate statistics course which was scored dichotomously, and analyzed with stata 16 SE on window 7 were focused here. In view of this, three dichotomous Item Response Theory (IRT) measurement models were used in the context of their potential usefulness in an education setting such as in determining these items properties. Ability, item discrimination, difficulty, and guessing parameters as unobservable characteristics were quantified with a binary response test, then discrete item response becomes an observable outcome variable which is associated with student’s ability level is thereby linked by Item Characteristic Curves that is defined by a set of item parameters that models the probability of observing a given item response by conditioning on a specific ability level. These models were used to assess each of the three items properties together with students’ abilities; then identified defectives items that were needed to be discarded, moderated, and non-defectives items as the case may be while some of these selected items were discussed based on underlining models. Finally, the information provided by these selected items was also discussed.

Keywords: Ability, Difficulty, Discrimination, Guessing, Response, Test


The main goal of testing is to collect information to make decisions either about the students’ abilities or suitability of test items, and different types of information may be needed depending on the kind of decision is intended to be made. Before Item Response Theory (IRT) was the development of Classical Test Theory (CTT) which was a product of pearsonian statistics intelligence testing movement of the first four decades of 20th century and its attendance controversies (Baker and Kim, 2004). Subsequently, Lord (1968) reformatted the base constructs of CTT using modern mathematical statistical approach where items, and its characteristics played a minor role in the structures of the theory. Earlier, both psychometric theoreticians, and practitioners became dissatisfied over the years with discontinuity between roles of items and test scores in CTT. All were of the opinion that a test theory should start with characteristics of the test items composing a test rather than resultant scores (Brzezinska, 2017).

Two major theories about the development of test are CTT, and IRT (Raykov, 2017). The former is all about reliability with its enormous limitations which includes: estimate of item parameters are group dependent, test item functions that could be either easy or difficult changes as sample changes, ability of students are entirely test dependent, ability of students changes as the occasion changes which result in poor or inconsistency of test, p and r which denote difficulty index and number of students who get item correctly respectively depend on sample of students taking while the latter (IRT) is a bit more complicated than CTT. Rather than looking at the reliability of the test as a whole, IRT looks at each item that makes up the test (Linden, 2018).


An item is a single question or task on a test or an instrument, and Item Response Theory (IRT) is a theoretical frame work organized around the concept of latent trait. It is made up of models, and related statistical methods that define observed responses on instrument to student’s level of the ability. It focuses specifically on the items that make up the test, compares the items that make up a test, and then evaluates the extent at which the test measures the student’s ability (Raykov and Marcoulides, 2018). IRT models are widely used today in the study of cognitive and personality ability, health responses, items bank development, and computer adaptive testing (Paek and Cole, 2020). For instance, King and Bond (1996) applied IRT to measure anxiety in the use of computer in grade school children, Mislevy and Wu (1996) used IRT in assessing physical functioning in adults with HIV, Boardley et al. (1999) used IRT to measure the degree of public policy involvement in nutritional professionals, Olukoya et al. (2018) presented a descriptive item analysis of university-wide multiple choice objectives examinations: the experience of a Nigeria private university, Ng et al (2016) applied item response theory and Rasch model to develop a new set of speech-recognition tests materials in Cantonese Chinese, Adetutu and Lawal (2020) make a comparisons of frequentist and Bayesian approaches to IRT and discovered that Bayesian approach is better in estimating three item properties along with students’ abilities simultaneously, and Zeigenfuse et al. (2020) developed extending dichotomous IRT models to account for test testing behaviour on matching test which violate the assumption of local independence, Bonifay and Cai (2017) findings on the complexity of item response theory models revealed that functional formed of IRT models should be considered not goodness of fit alone when chosen IRT model to be used. However, Suruchi and Rana (2015) identified two uses of item analysis which were the identification of defectives test items, and identifications of areas where students have mastered and not yet mastered. IRT is a potent tool in checking flaws in items and finding ways of correcting them before finally administering the items hence, item moderation needs to follow item analysis. In cases where item cannot be moderated, such item must be discarded and replaced. Ary et al. (2002) asserted that item analysis should make use of statistics that would reveal important and relevant information for upgrading the quality, and accuracy of multiple-choice items. Therefore, IRT plays a central role in the analysis, study of tests and items scores in explaining student test performance, and also provide solutions to test design problems using a test that consists of several items (Baker and Kim, 2004; Baker and Kim, 2017). The potent advantages of IRT over CTT that have propelled us to use IRT are: its treatment of reliability and error of measurement through item information functions which are computed for each item (Hassan, and Miller, 2019),


Undimensionality assumption of IRT implies homogeneity of a test item in the sense of measuring a single ability (Hambleton and Traub, 1973), and the probability of any student’s response pattern would be (1’s and 0’s). On local independence, test item response of a given student is statistically independence of another student’s response. The implication of this is that test items are uncorrelated for the students of the same ability level (Lord and Novick, 1968). Monotonicity assumption focuses on item response functions which model relationship between students’ trait level, item properties, and the probability of endorsing the item (Rizopoulos, 2006; De Ayala and Santiago, 2016). Finally, Item invariance assumption implies that item parameters estimated by an IRT model would not change even if the characteristics of the student, such as age changes (Peak and Cole, 2019).


Every year, university teachers face the challenge of how to cope with increasing number of examination students, which multiple choice items came to resolve in our educational setting; however, the absence of item analysis in developing these multiple-choice items undermines the integrity of assessments, selection, certification, and placement in our educational institutions. Also, improper use of item analysis leads to same fate while lopsided test items could lead to wrong award of grade, and certificate (Olukoya et al., 2018; Ary et al., 2002). We have seen that hundreds of secondary school students take university entrance examinations, and their results determine the entry into universities, and possible alternatives (Eli-Uri and Malas, 2013; Cechova et’ al., 2014). Hence, the needs to maintain the validity of tests using IRT models necessitate this study.


Professional conduct of item analysis that makes use of statistics would reveal important, and relevant information about the item for upgrading the quality, and accuracy of multiple-choice items, its power lies in identifying defective items, areas where students have mastered, and area not yet mastered thereby find ways of correcting them before finally administered them in order to have integrity in assessment, selection, certification, and placement in our educational institutions.


  1. Determine the spread and appropriateness of item properties in multiple choice
  2. Access the distribution of students’ abilities and the adequacy of information provided by test



Data used in illustrating these (one, two, and three-logistic) models were results of a university semester examination where a total of 403 students took a compulsory general statistics course in the university semester examination for 2017/2018 academic session. The test items (questions) were made up of 35 multiple choice items, where each item had 4 options, each of which had a correct option while the other three options were distractors. The same test items are administered to all the students, and their responses in terms of options chosen are coded into binary, (that is 0 for endorsing any of the incorrect options and 1 for endorsing a correct option) using Stata/SE 16.0 on window 7. Some selected items in supplementary section were discussed.


Method I: Rasch/One-parameter logistic Model

The first model employed is basically for accessing how difficult an item is being perceived by the test takers, it was proposed by Georg Rasch, a Danish mathematician in 1966 (Rasch, 1966) similar to One-parameter logistic model (1PL) proposed by Birnbaum (1968). This model is positioned in equation (1) and its described test item in term of only one parameter called difficulty index. The probability that student 𝐾 with ability ( 𝜃𝑘) will endorse item 𝑔 with difficulty index ( 𝑏𝑔) correctly is presented in equation (1):

Download Complete Paper