Skip to main content

Item Response Theory

Section 1: Why This Matters

When teachers add up test marks, it feels simple and fair: each question carries marks, totals show who has learned the most, and you get a neat list of results. But those totals rest on some shaky assumptions:

  • that every mark is of equal value,
  • that every question measures the same underlying skill or understanding, and
  • that it doesn’t matter whether a test has more easy or hard questions.

In the real world of classrooms, none of this holds true.

Take two students — Mina and Rahim — who both score 45 out of 60. Mina’s marks come mainly from short recall items: definitions, diagrams, and quick factual answers. Rahim, meanwhile, drops points on those but succeeds on the long reasoning questions that require linking ideas and applying concepts. Their totals are the same, yet their attainment is not. Mina has memorised the surface; Rahim has mastered the structure. The total score hides that difference.

This is what Becky Francis (Education Endowment Foundation) calls the illusion of attainment: when we treat test scores as if they reveal deep learning, even though they often measure familiarity, exam technique, or luck with question types. Without looking beneath the total, we risk rewarding rehearsal over reasoning.

That’s why the EBTD Guide to Assessment and this annex exist — to help teachers uncover what their tests really measure. You don’t need complex statistical software or a PhD in psychometrics. Using the staged ChatGPT approach outlined here, you can:

  • sense-check the difficulty and cognitive demand of each question before testing,
  • ensure marks reward deep understanding rather than short recall,
  • review results in ways that show which concepts — not just which students — were strong or weak.

In short, by thinking like Item Response Theory (IRT), teachers can move from adding marks to analysing meaning. The goal isn’t to make testing more complicated; it’s to make it more accurate, fair, and useful — so that every mark tells you something real about learning.

EBTD angle: we are not asking teachers to run full Rasch software. We’re giving them prompts to make better, fairer, more revealing tests with the tools they already have.

Section 2: The Four-Stage ChatGPT Workflow – A Worked Example

This annex gives teachers a practical way to move from adding marks to interpreting learning. Each stage uses ChatGPT as a classroom-ready assistant — helping you sense-check question design, adjust mark balance, and strengthen validity before you ever print a paper.

The example below follows one test from first draft to final version.


Stage 1 – Question-Level Analysis

(Before pupils sit the test)

Teacher prompt (copy/paste):

“I’ve written a test. Please analyse each question for:
• perceived difficulty (easy / moderate / hard)
• question type (recall, understanding, application, analysis, evaluation, creation)
• cognitive demand using Bloom/Anderson taxonomy
• topic or skill focus
• clarity and potential ambiguity
• overall balance of question types and coverage of the syllabus.

Then give me a short written report summarising:
– which questions may be too easy or too hard
– which types dominate the paper
– which topics are under- or over-represented
– any specific rewriting or re-ordering suggestions.

(Paste the full text of your test below, including question numbers and marks if available.)”

Example: Teacher’s First Draft (Before ChatGPT)

Subject: Science – Photosynthesis (initial teacher version)

Q1 Define photosynthesis (1)
Q2 State the word equation for photosynthesis (1)
Q3 List two raw materials needed (1)
Q4 Explain how light affects the rate (2)
Q5 Draw a diagram of a chloroplast (2)
Q6 Describe the role of chlorophyll (1)
Q7 Name the gas released (1)
Q8 Explain why temperature affects the rate (2)
Q9 Describe an experiment to test a leaf for starch (3)
Q10 Why do farmers use greenhouses? (2)

Total = 16 marks.

ChatGPT’s Analysis (Stage 1 Output)

Swipe to see full table →
Q Type Cognitive Level Est. Difficulty Comment
1 Recall Remember Easy Basic definition only.
2 Recall Remember Easy Another fact recall – overlaps with Q1.
3 Recall Remember Easy Too similar to Q1–Q2; low challenge cluster.
4 Application Apply Moderate First reasoning item. Could expand marks.
5 Understanding Understand Moderate Visual recall; okay variety.
6 Recall Remember Easy Short factual; could be merged.
7 Recall Remember Easy Factual repetition.
8 Analysis Analyse Moderate Good; keep.
9 Application Apply Harder Practical design; strong item.
10 Evaluation Evaluate Moderate Needs more marks to reflect complexity.

Summary:

  • The paper is 70% recall; higher-order skills under-represented.
  • Early questions overlap and could be merged.
  • Only one genuine evaluation question (Q10).
  • No data interpretation or extended reasoning.

ChatGPT suggestion:

Combine early recall items, expand reasoning questions, and add interpretation or experiment design items to reach a broader skill range.

Stage 2 – Mark Allocation and Balance Check

Teacher prompt (copy/paste):

“Please analyse the following test for mark-scheme balance.

For each question, tell me:
• the cognitive demand or question type (recall / analysis etc.)
• number of marks
• whether mark weighting seems fair for the effort and level
• which areas or skills are over- or under-represented.

Then provide:
– a short table showing total marks by cognitive level and by topic
– comments on fairness, balance and reliability
– specific mark-allocation changes you recommend (e.g. increase Q7 from 3 to 5 marks, reduce Q2 from 5 to 2).”

ChatGPT’s Balance Report (Stage 2 Output)

Swipe to see full table →
Level Questions Marks % of Total Comment
Recall 1, 2, 3, 6, 7 5 31% Too heavy on basic recall.
Understand / Apply 4, 5, 8, 9 9 56% Good mid-range challenge.
Analyse / Evaluate 10 2 13% Underweighted for higher-order thinking.

Observations:

  • Evaluation is under-represented despite being central to GCSE-style outcomes.
  • Several recall items could share marks or be combined to free space for higher-order questions.
  • 16-mark paper is short; range of 20–25 marks would give better discrimination.

Suggested mark changes:

  • Merge Q1–Q3 into one “definition + materials” (2 marks).
  • Expand Q4 and Q10 to 3–4 marks each to reflect reasoning.
  • Add one or two new “analysis/evaluation” questions (graphs or scenarios).

Stage 3 – Redesign & Enrichment Suggestions

Teacher prompt (copy/paste):

“Based on the previous analysis, please suggest specific improvements to this test.
For each question, tell me whether to keep, modify, replace, or remove it; what change you suggest (rewording, scaffolding, increasing reasoning depth, adding context, etc.); and why that change would improve validity or reliability.”

ChatGPT’s Redesign Table (Stage 3 Output)

Q Action Rationale
1–3 Merge into 1 new definition + materials question Reduces duplication.
4 Expand to include graph interpretation element Adds analytical depth.
5 Keep Varied modality (diagram).
6–7 Merge with Q9 (starch experiment and gas evidence) Integrates theory + practical.
8 Keep Solid mid-difficulty item.
9 Modify to include method control and result prediction Boosts reasoning.
10 Extend to “Evaluate use of greenhouses and CO₂ enrichment.” Increases higher-order thinking.
New Items Add data interpretation, error analysis, and design task Broaden assessment coverage.

Stage 3 – Outcome: Revised Test Blueprint

Following the prompts above, the teacher produced a 20-question, 60-mark test that balanced factual recall, applied reasoning, data analysis, and evaluation — the version you saw earlier.

Final version (after refinement):

Subject: Science (Photosynthesis) – End-of-unit test

Q1. Define photosynthesis. (2)
Q2. Write the balanced chemical equation for photosynthesis. (2)
Q3. Name two factors that can limit the rate of photosynthesis. (2)
Q4. Explain how light intensity affects the rate of photosynthesis. (3)
Q5. Describe an experiment you could do to investigate how temperature affects photosynthesis. (5)
Q6. A plant kept in the dark for 24 hours shows no starch in its leaves. Explain why. (3)
Q7. Farmers often add CO₂ to greenhouses. Evaluate this practice. (6)
Q8. Analyse the graph showing the relationship between temperature and rate of photosynthesis. (4)
Q9. Explain why chlorophyll is essential for photosynthesis. (2)
Q10. Describe how water reaches the leaf to be used in photosynthesis. (2)
Q11. Suggest why photosynthesis slows down on very hot days. (3)
Q12. Explain the difference between a limiting factor and a necessary condition. (3)
Q13. A student claims that “plants do not respire during the day because they are photosynthesising.” Explain why this is incorrect. (3)
Q14. Interpret the data table showing light intensity at different distances from a lamp. (3)
Q15. Explain how farmers could increase yield using knowledge of photosynthesis. (4)
Q16. Describe how you would test a leaf for starch. (3)
Q17. A plant grown in a nitrogen-poor soil has pale leaves. Explain the link to photosynthesis. (3)
Q18. Explain why variegated leaves are useful for photosynthesis experiments. (2)
Q19. Evaluate the statement: “Photosynthesis is the most important chemical reaction on Earth.” (4)
Q20. Design an investigation to compare the rate of photosynthesis in pondweed under red and blue light. (4)

Reflection: By iterating through these three stages, the teacher has broadened the range of cognitive demand, distributed marks more evenly, increased reliability and validity, and aligned questions more closely with real learning outcomes. In short, the workflow helped them see their test as data about thinking, not just as a score sheet.

Section 3: Stage 4 – Post-Test Item Review (Partial Credit + Rasch-Style View)

From “who scored what” to “what did this test really tell us?”

Why this stage matters: Once students have completed a test, raw totals alone can mislead. Two pupils with identical scores might have reached them in very different ways — and not all questions contribute equally to understanding. This stage helps teachers move from marks to meaning, showing how ChatGPT can perform a quick, research-style item analysis.

It combines simple classroom data with Rasch-inspired reasoning to reveal:

  • which questions truly discriminated between levels of understanding,
  • which items were too easy or too hard, and
  • how to redesign the next assessment to be fairer and more informative.

Below is a worked example using your 20-question photosynthesis test.

Class dataset (raw input)

Each row shows a student’s score per question; MaxMarks is the total possible for each item.

Student,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20
MaxMarks,2,2,2,3,5,3,6,4,2,2,3,3,3,3,4,3,3,2,4,4
S01,2,2,2,3,5,3,6,4,2,2,3,3,3,3,4,3,2,1,3,3
S02,1,1,1,2,4,2,5,3,1,1,2,2,2,2,2,1,1,0,2,2
S03,2,2,2,2,5,2,6,3,2,2,2,2,2,2,3,2,2,1,3,3
S04,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
S05,1,1,1,2,3,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0
S06,1,1,1,2,3,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0
S07,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
S08,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
S09,1,1,1,2,3,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0
S10,2,2,2,2,5,2,6,3,2,2,2,2,2,2,3,2,2,1,3,3
S11,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
S12,1,1,1,2,4,2,5,3,1,1,2,2,2,2,2,1,1,0,2,2
S13,1,1,1,2,3,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0
S14,2,2,2,3,5,3,6,4,2,2,3,3,3,3,4,3,3,2,4,4
S15,2,2,2,3,5,3,6,4,2,2,3,3,3,3,4,3,2,1,3,3
S16,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
S17,1,1,1,2,3,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0
S18,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
S19,2,2,2,3,5,3,6,4,2,2,3,3,3,3,4,3,2,1,3,3
S20,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  

Teacher prompt (copy/paste)

“I have a class test that my students have completed. The table below shows student marks for each question, and the first data row is ‘MaxMarks’ for each question. Some items allow partial credit (0 up to MaxMarks).

Please produce a full assessment analysis report that combines:
• Classical item analysis (% of max marks, flag items that were too easy or too hard)
• Rasch-style partial-credit analysis (estimate item difficulty and student ability on a common scale)

Provide tables for:
– Item vs % of max achieved vs difficulty + interpretation
– Student vs total marks vs ability + interpretation

Include simple ASCII visuals showing which items were easy or hard, and write a short, supportive commentary with teaching and test-design actions.”

ChatGPT’s generated outcome

This is the automated Stage 4 report the teacher would receive.

Stage 4 – Post-Test Item Review (Partial Credit + Rasch-Style View)

1. Fast headline summary
  • Easiest / best accessed items: Q1, Q5, Q3, Q2, Q7 – most students could reach these.
  • Hardest items: Q18, Q17, Q19, Q20, Q16 – these sat above the ability of many students; mainly top group picked up marks.
  • The test has a helpful mid-band around Q4, Q8, Q9–Q14, but several of those are still harder than your middle students.
  • One student (S14) got full marks (63/63), so the paper had challenge that was still reachable.
  • The bottom group only picked up marks on short/low-demand items — the paper is differentiating, but slightly “top loaded”.
2. Classical Item Analysis (by question)

Percentages below = “of all the marks that could be earned on this question, what % did the class actually earn?”

Swipe to see all columns →
Item Max marks % of max earned Rasch-style difficulty* What that means
Q1260.0%−0.41Good/easy opener – most could define.
Q2255.0%−0.20Still accessible – equation mostly known.
Q3257.5%−0.30Similar to Q1–2 – basic knowledge secure.
Q4351.7%−0.07Nice mid item – half the class could explain light.
Q5557.0%−0.28Practical/design question surprisingly well answered.
Q6338.3%+0.48Harder – explanation of no starch in dark not fully secure.
Q7655.0%−0.20Extended answer but well taught – top/mid groups handled it.
Q8447.5%+0.10Data/graph analysis – respectable, but some wobble.
Q9235.0%+0.62“Why chlorophyll matters” – conceptual, weaker in mid group.
Q10237.5%+0.51Transport to leaf – some gaps in processes.
Q11333.3%+0.69High temperature effect not secure.
Q12333.3%+0.69“Limiting factor vs necessary” was tricky.
Q13333.3%+0.69Respiration misconception persists – needs reteach.
Q14333.3%+0.69Interpreting the table was harder than expected.
Q15432.5%+0.73Applying to farming – partial transfer only.
Q16330.0%+0.85Test-for-starch method not fully remembered.
Q17325.0%+1.10Linking nitrogen → protein → chlorophyll → photosynthesis was hard.
Q18217.5%+1.55Variegated leaf purpose mostly unknown – too hard for most.
Q19428.7%+0.91Big evaluative statement – only stronger writers scored well.
Q20428.7%+0.91Comparative investigation – again, only top half coped.

*Rasch-style difficulty here is a simple logit-like ordering (negative = easier for this class, positive = harder). It’s not a full calibration, but it gives you a ladder.

Item-level story:

  • Q1–Q5 are your “green” items.
  • Q6–Q10 are “amber/mid”.
  • Q11–Q20 are “red for this class”, especially Q17–Q18.
3. Student Ability (same dataset)

Max possible marks = 63 (sum of all MaxMarks).

Student Total (out of 63) Rasch-style ability Interpretation
S1463+4.60Mastery for this test; ceilinged the paper. Use harder extension.
S0159+2.69Very strong – only slipped on hardest items.
S1959+2.69Very strong – same pattern as S01.
S1559+2.69Very strong.
S1050+1.35Upper-mid: secure on core, partial on hard items.
S0350+1.35Same as S10.
S1237+0.35Solid pass-level understanding; missed higher items.
S0237+0.35Same as S12.
S0914−1.25Could access only the easiest + a few mid questions.
S0514−1.25Same pattern.
S1314−1.25Same pattern.
S0614−1.25Same pattern.
S1714−1.25Same pattern.
S074−2.69Very limited access – test was too hard.
S084−2.69Same.
S164−2.69Same.
S113−3.00Only picked up on 1–2 isolated facts.
S203−3.00Ditto.
S042−3.42Test well above current level.
S181−4.13Needs foundation reteach.

Student-level story: clear top group (S14 → S10), a broad middle (S12, S02), a lower band relying on recall (S09–S17), and a small group for whom the paper was simply too hard.

4. Quick visuals – Difficulty at a glance

Item difficulty bands

Easy (lots right)Q1 Q3 Q5 Q2 Q7
Mid (about half right)Q4 Q8 Q6 Q9 Q10
Hard (few right)Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20
Very hard (too hard)Q17 Q18

Person–Item “ladder” (simplified)

Harder items →  Q18   Q17   Q19/Q20   Q16   Q15   Q11–14
        |     |     |     |     |
Students →
  S14
  S01 S19 S15
  S10 S03
  S12 S02
  S09 S05 S13 S06 S17
  S07 S08 S16
  S18 S11 S20 S04
Easier items →  Q1 Q2 Q3 Q5
5. Coaching commentary (for the teacher)

What is this test telling you about your teaching?

  • Core photosynthesis ideas are landing. Q1–Q5 were well answered → direct teaching on the basic process was effective.
  • Conceptual linking is the main weakness. Items that asked pupils to connect photosynthesis to wider plant physiology or conditions (very hot days, nutrient deficiency, variegated leaves, necessary vs limiting) all dropped to the 30% band → this is a curriculum sequencing/depth issue.
  • Application and evaluation stretched them. Q15 and Q19 are the kinds of questions we want, but they were too hard for the middle band. Keep them, but scaffold or split.
  • The test was slightly too difficult for the lower third. Bottom pupils were basically excluded from the latter half. If this happens often, you’ll under-estimate their progress.

What is it telling you about the test?

  • The first 7 questions did the job of sorting out who had the basics.
  • The later questions (Q11 onwards) compressed too many hard concepts together — so a wrong answer could be “didn’t know” or “too tired”. That weakens inference.
6. Concrete actions

A. For future teaching / planning

  • Reteach “conditions vs factors” (Q12) with classify-and-check hinge questions.
  • Revisit plant nutrition → chlorophyll → photosynthesis chain (Q17).
  • Do a short practical/demo on variegated leaves (Q18).
  • Model evaluative answers (Q19) with paragraph frames.
  • Give the lowest group a parallel, shorter version sampling the same constructs.

B. For redesigning the test

  • Keep Q1–Q5 as anchors.
  • Split hardest items (Q17, Q18, Q19/20) into 2 steps so mid students can earn marks.
  • Add one mid-difficulty data question between Q10 and Q11 to avoid the difficulty cliff.
  • Tag 2–3 items as “common across papers” for future equating.
  • Flag Q18 as “hard – use only when recently taught”.
7. CPD / department discussion prompts
  • Which of our items sat in the 40–60% “sweet spot”, and how can we write more like them?
  • Did we accidentally assess plant nutrition/experiments more than photosynthesis itself?
  • How can we build smaller concept-check questions into lessons so Q11–Q18 aren’t seen for the first time in the test?
  • What should we do with students like S14 (ceilinged) and S18 (shut out) so both are better served next time?
  • If we run this test again, which 3 items must stay the same so we can compare fairly across terms?