Excelling at a test often does not translate into excelling at the skills the test purports to measure. This is true not only of humans but also of AI systems, and the more so the greater the claims of the test's significance.
This became evident less than a decade after the introduction of the Winograd Schema Challenge (WSC),3 a test designed to measure an AI system's commonsense reasoning (CSR) ability by answering simple questions. An example would be, given the information: The sculpture rolled off the shelf because it wasn't anchored, answering: What wasn't anchored?
There are multiple AI systems2 that achieve human performance on the WSC but are not capable of performing CSR. This would seem to be good reason to retire the WSC to the dustheap of benchmarks which have been conquered for little gain. But Yejin Choi and her colleagues at AI2 have sought to re-engineer the WSC as a more meaningful benchmark of a system's CSR ability. WINOGRANDE is one of a series of groundbreaking papers in which Choi and her team explore new methods of dataset development and adversarial filtering, expressly designed to prevent AI systems from making claims of smashing through benchmarks without making real progress.
Why try to fix the WSC? Why not simply develop a new dataset better suited to measuring CSR ability? The WSC's appeal lies partly in the test's radical simplicity and partly in what success might entail. Levesque proposed that the common task of pronoun resolution—determining which entity a pronoun referred to—could substitute as a test of CSR ability and intelligence. For example, consider the question: Anna did better than Lucy on the test because she had studied so hard. Who studied hard? Humans easily infer it is Anna who studied hard: We know studying hard generally leads to better grades. But a machine without CSR ability likely cannot answer correctly.
Levesque sought to minimize bias in a sentence's structure toward a particular referent by collecting pairs of sentences that were nearly identical. For example, the above sentence could be rewritten as: Anna did worse than Lucy on the test because she had studied so hard. Who studied hard? In this case the answer changes: it is Lucy who studied hard. The reasoning is similar, but the substitution of worse for better leads to a different answer. Such pairs of sentences, named Winograd Schemas, were intended to eliminate the possibility of such structural bias.
Achieving near human performance on Winograd Schemas seemed beyond the capability of AI systems five years ago. But by using deep learning frameworks such as BERT,1 which combine a transformer architecture, statistical natural language processing techniques, and a massive pre-trained language model, AI researchers rapidly developed high-performing systems—on the WSC as well as other benchmarks, for example, Super-GLUE6—while hardly moving the needle on more general AI measures.4
How to fix the WSC to prevent over-estimation of machine performance? WINOGRANDE combines two closely intertwined strategies: generating a large corpus (a drawback of the original WSC was the tiny training corpus released) and filtering out biased examples. The WINOGRANDE corpus was generated by Mechanical Turkers (MTs), who wrote pairs of sentences using anchor words and obeying constraints. Other MTs ensured humans could easily infer pronoun referents in these sentences. Then the corpus was processed using a filtering algorithm to retain only examples that minimize representation bias. Removed pairs include those with data-set specific polarity basis (for example, advanced rock climbing is more strongly associated with being strong than being weak). The result is a corpus (~44K examples) for which the best system's accuracy in 2019 was 79.1%, considerably below human level. This effect, to prevent AI systems achieving human performance levels in the absence of genuine reasoning ability, was a desired goal.
What is the long-term impact? A year later, the Choi team's UNICORN can solve WINOGRANDE problems with an almost human-level 91.28% accuracy, as indicated by the WINOGRANDE leaderboard. AI systems will likely soon solve WINOGRANDE at human level—without necessarily having made real progress on the underlying task of CSR. Arguably, this indicates that solving either the WSC or WINOGRANDE does not indicate CSR ability. The contributions of WINOGRANDE, however, go far beyond performance on specific datasets. Importantly, the methodologies introduced in the paper are independent of the WINOGRANDE dataset. Methods used to help MTs generate large-scale corpora can be adapted to create other corpora. The filtering algorithm introduced here can be modified to filter bias and other sources of error more aggressively. These techniques will remain useful, whether AI systems prematurely achieve human-level performance on any of the multiple corpora that researchers currently target.
To view the accompanying paper, visit doi.acm.org/10.1145/3474381
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.
No entries found