Pick the best ending to the context.
How to catch dragonflies. Use a long-handled aerial net with a wide opening. Select an aerial net that is 18 inches (46 cm) in diameter or larger. Look for one with a nice long handle.
Announcing HellaSwag, a new dataset for commonsense NLI.
Questions like the above are trivial to humans, with over 95% accuracy, but current state-of-the-art NLP models built on pretraining struggle with under 48% accuracy. We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
Paper
If the paper inspires you, please cite us:@inproceedings{zellers2019hellaswag, title={HellaSwag: Can a Machine Really Finish Your Sentence?}, author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year={2019} }
HellaSwag Leaderboard
To benchmark approaches to HellaSwag, we have a leaderboard for the test set. Submission is easy! Just follow the instructions on our github page.
As noted in the paper, the test and validation sets contain both in-domain and zero-shot categories, corresponding to whether the activity/how-to category like 'Riding a bike' is in the training set. The entire dataset includes both examples from ActivityNet as well as WikiHow. The overall accuracy is averaged over the entire test set.
Rank | Model | Overall accuracy | In-domain category accuracy | Zero-shot category accuracy | ActivityNet accuracy | WikiHow accuracy |
---|---|---|---|---|---|---|
Human Performance University of Washington (Zellers et al. '19) |
95.6 | 95.6 | 95.7 | 94.0 | 96.5 | |
🥇 March 8, 2023 | GPT4 base 10-shot OpenAI | 95.3 | 94.8 | 95.7 | 90.1 | 98.0 |
2 March 23, 2020 | ALUM MSR https://github.com/namisan/mt-dnn | 85.6 | 86.5 | 84.6 | 77.1 | 90.1 |
3 July 25, 2019 | RoBERTa Facebook AI | 85.2 | 87.3 | 83.1 | 74.6 | 90.9 |
4 February 7, 2020 | G-DAug-inf Anonymous | 83.7 | 85.6 | 81.8 | 73.0 | 89.6 |
5 January 19, 2020 | HighOrderGN + RoBERTa USC MOWGLI/INK Lab | 82.2 | 84.3 | 80.2 | 71.5 | 88.1 |
6 July 25, 2019 | Grover-Mega Univerity of Washington https://rowanzellers.com/grover | 75.4 | 79.1 | 71.7 | 64.8 | 81.2 |
7 July 25, 2019 | Grover-Large Univerity of Washington https://rowanzellers.com/grover | 57.2 | 60.7 | 53.6 | 53.3 | 59.2 |
8 May 19, 2019 | BERT-Large Google AI Language (experiment by Rowan) | 47.3 | 49.7 | 45.0 | 51.7 | 45.0 |
9 May 19, 2019 | GPT OpenAI (experiment by Rowan) | 41.7 | 44.0 | 39.3 | 43.8 | 40.5 |
10 May 19, 2019 | BERT-Base Google AI Language (experiment by Rowan) | 40.5 | 42.8 | 38.3 | 45.7 | 37.7 |
11 May 19, 2019 | LSTM+BERT Baseline (experiment by Rowan) | 36.2 | 38.2 | 34.1 | 40.5 | 33.8 |
12 October 19, 2022 | Rainier UQA T5-Large + Knowledge University of Washington https://arxiv.org/abs/2210.03078 | 34.8 | 34.7 | 34.9 | 36.1 | 34.1 |
13 October 19, 2022 | Baseline UQA T5-Large University of Washington | 34.3 | 34.1 | 34.5 | 35.8 | 33.5 |
14 May 19, 2019 | ESIM-ELMo Allen Institute for AI (experiment by Rowan) | 33.3 | 34.2 | 32.3 | 36.6 | 31.5 |
15 May 19, 2019 | LSTM+GLoVe Baseline (experiment by Rowan) | 31.7 | 32.9 | 30.4 | 33.8 | 30.5 |
16 May 19, 2019 | FastText Facebook AI Research (experiment by Rowan) | 31.6 | 32.9 | 30.2 | 28.4 | 33.3 |
17 May 19, 2019 | LSTM+ELMo Baseline (experiment by Rowan) | 31.4 | 32.8 | 30.0 | 33.3 | 30.4 |
18 December 31, 2019 | Abductive Reasoning for Unsupervised QA - BERT University of South Florida and Oklahoma State University | 30.4 | 30.6 | 30.2 | 35.0 | 28.0 |
19 December 31, 2019 | Abductive Reasoning for Unsupervised QA - GPT2 University of South Florida and Oklahoma State University | 29.5 | 28.5 | 30.5 | 32.6 | 27.9 |
20 April 16, 2024 | RM2 grimtin10 | 26.8 | 27.6 | 26.1 | 27.9 | 26.2 |
21 December 31, 2019 | Abductive Reasoning for Unsupervised QA - GPT University of South Florida and Oklahoma State University | 25.8 | 25.9 | 25.7 | 25.8 | 25.8 |
Random Performance | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Contact
Questions about the dataset, or want to get in touch? Contact Rowan Zellers at my contact page, or open up a pull request or issue on Github.