Students generally regard exams with trepidation for good reason: a bad day can easily trip them up, leaving their grades wanting. Concern over the accuracy of high-stakes exams and standardized tests have led the U.S. and other countries to look at alternatives that are less vulnerable to a poor performance on a single day and which, in principle, offer more accurate ways of determining a student's ability in a particular subject.
In 2019, a team based at King's College London in the U.K. used a long-term study of twins to determine how well teachers' assessments fare against exams in predicting overall performance once compulsory education has finished. They found teachers' assessments are as reliable as test scores at every stage, and recommended this approach to grading could replace some, if not all, high-stakes exams.
Amid the chaos caused by lockdowns in the spring of 2020 to try to limit the spread of the COVID-19 virus, it seemed school students in the U.K., as well as those taking the International Baccalaureate who were due to complete their courses late last year, might for once be able to take advantage of assessments and no longer have to fear the exam for which they hadn't crammed by taking advantage of the reliability of classroom assessments. At least, that was how it seemed before the results were published.
Days after receiving their grades, crowds of students protested outside the Department of Education in central London, claiming their results fell so short of expectations, they must be flawed. It was a similar situation for students around the world who were expecting to graduate from the International Baccalaureate course: many students found their results differed markedly from what they thought their body of coursework indicated.
One possible reason for the backlash was simply that teachers, parents, and pupils had overly optimistic opinions of their abilities that, under normal circumstances, would be tempered by exam results. Stripped of the element of chance inherent in testing, the replacement system seemed to need more convincing explanations of why expectations were not met.
"I think it is hard to think of any process that would not suffer problems," said Imperial College London professor of statistics Guy Nason.
One issue that U.K. politicians in particular wanted to address with the exam-replacement scheme devised by the nation's education regulator Ofqual was one of grade inflation, or at least the perception of it. Since 1992, U.K. schools have had to publish their results for compilation into national league tables. Many parents use those tables to help choose where they try to send their children. This was seen as providing a clear incentive for teachers to deliver optimistic assessments.
France, which last year also opted for grading of coursework by teachers instead of independently marked exam papers, saw almost 96% of candidates obtain a pass for its form of the baccalaureate qualification. The pass rate was 88% the previous year. This, in turn, led to universities in the country opening close to 10,000 additional places for the 2020 intake.
For their respective systems, the U.K. and the International Baccalaureate Organization (IBO) decided to address the potential for grade inflation through semi-automated moderation schemes. These were used to fit the distribution of grades awarded in 2020 to the pattern of prior years. In addition to its moderation scheme, the IBO had a sampling of coursework marked by independent examiners, stating this would help "maximize the confidence that every student will receive a fair mark for their coursework." Rather than attempt to organize a mass marking of coursework by independent examiners, the U.K. education authorities opted to rely purely on algorithms to redistribute the grades.
Once they saw their results, IBO students pointed to apparent anomalies in their awarded grades, which often seemed several points below those indicated by either teacher- or examiner-marked coursework. Students taking U.K. courses encountered more extreme examples. It was possible in some cases for a student expected by their school to obtain an A grade to wind up with one as bad as a U (Unsatisfactory), a grade normally reserved for an exam paper so bad it cannot be marked.
"Part of the issue in this case is that the evidence base that regulators in the U.K. relied on in their algorithms was extremely weak," Nason said, adding that the complexity of the system, which was accompanied by several hundred pages of description, makes it hard to identify the root cause of anomalies. The urgency of the situation brought on by the pandemic meant there was no time for testing, and Ofqual relied on statistical analysis that observers believe was flawed.
"It is possible to get a better understanding of uncertainty in these situations, but gathering and quantifying such evidence is expensive, time-consuming, and complex, and would, in particular, not have been possible in the time frame afforded by this year's pandemic situation," Nason said.
One apparent source of anomalies lies in a decision to try to match the distribution of grades from prior years on a school-by-school basis. That contrasts with what happens with conventional exams: test scores for a particular subject are assigned to grade boundaries across the entire student base. In the U.K., teachers were told to rank the students in their classes in order of ability; the rank ordering would determine who was to have their grades adjusted in order to fit the average distribution of prior years.
The problem that emerged was that schools that had traditionally seen large numbers of poor performers with maybe just a few high-fliers in a given year, encountered large downgrades for those who were unlucky in their ranking. Nason said such rankings have been shown to be subject to high degrees of uncertainty and subjectivity.
Confusion as to how appeals would be dealt with further dented confidence in the approach taken by the U.K. regulator. Helen Smith, a post-graduate researcher working on the use of artificial intelligence in decision-making at the University of Bristol's Center for Ethics in Medicine, claims the process presented by Ofqual contravened the U.K. government's own guidance on algorithmic decision-making: that unexpected results, and not just those due to bias, should be subject to appeal.
Ofqual's senior executives later argued in a hearing held by members of Parliament that they expected the appeals process did allow for unexpected results, and not just those who could claim evidence of bias. However, they said they anticipated the complaints to come from schools, rather than from students or parents directly. Daan Kolkman, a research fellow in computational sociology at the Eindhoven University of Technology in the Netherlands, noted, "Although there was a procedure for redress, it was unclear to many how to appeal their grade."
As the complaints built up, a decision was made in the U.K. in the early autumn to reverse the regrading process and use the unmoderated assessments to award grades. This resulted in a large reduction in complaints, as many of the students who risked being denied university places were able to take up those places. The IBO came to a similar decision for its qualifications.
In the wake of those reverses of policy, Ed Humpherson, director-general of regulation at the U.K.'s Office for Statistics Regulation (OSR), announced a review of the procedures used by Ofqual. "The resulting negative backlash, specifically on the role that algorithms played, threatens to undermine public confidence in statistical models more broadly," he said.
A case like this presents opportunities for governments to learn how to create better processes for algorithms that are used to support decisions affecting individuals. Kolkman argued there is, in general, a clear role for an algorithm watchdog that can prevent data from being used in a way that is difficult to justify. For example, though rank ordering was relatively easy to collect, its high level of uncertainty should probably have led to it being replaced by another, though possibly more expensive to collect, statistic.
"It may be tempting for algorithm developers to work with the data at hand, not the data that is most suited for the purpose of their analysis. This does not necessarily present a problem, if the limitations of the analysis are well-enough understood. I feel the problem does not necessarily lie with the algorithms we use, but with our lack of robust quality assurance and clear redress procedures," Kolkman said.
The need to test algorithms before deployment was one tackled before the COVID pandemic hit by Sir David Spiegelhalter, chairman of the University of Cambridge's Winton Center of Risk and Evidence Communications. He would later join the OSR's panel investigating Ofqual's decisions. In a paper published early last year, Spiegelhalter proposed borrowing auditing techniques from the pharmaceutical world that would put big-data algorithms into trials before they could be approved. Kolkman said the pharmaceutical-trials model would likely be too onerous for all systems, while other areas, such as food safety, might provide models for less-impactful algorithms.
Kolkman takes the view that algorithmic analysis should be conducted not just by subject-matter experts, but by laypeople as well, in order to ensure the decisions being made by the systems are as comprehensible as possible. Sometimes the question is bigger than which algorithm provides the best replacement for an existing system that may itself be flawed.
Rebecca Cairns of the Deakin School of Education in Australia points to the work by Kings College London on the value of teacher assessment versus exams. She and others see the post-pandemic environment as an opportunity to take the time to re-examine how the evaluation of students' work should take place, and the kinds of data needed to deliver it. However, it may take more upheaval, such as another year of canceled exams, before that debate begins.
Rimfeld, K., Malanchini, M., Hannigan, L.J., Dale, P.S., Allen, R., Hart, S.A., and Plomin, R.
Teacher assessments during compulsory education are as reliable, stable and heritable as standardized test scores The Journal of Child Psychology and Psychiatry (2019). Volume 60, Issue 12. DOI: 10.1111/jcpp.13070
Algorithmic bias: should students pay the price? AI and Society (2020). DOI: 10.1007/s00146-020-01054-3
The (in)credibility of algorithmic models to non-experts Information, Communication & Society (2020). DOI: 10.1080/1369118X.2020.1761860
Should we trust algorithms? Harvard Data Science Review, 2.1 (2020). DOI: 10.1162/99608f92.cb91a35a
©2021 ACM 0001-0782/21/6
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.
No entries found