HellaSwag: Can a Machine Really Finish Your Sentence? (ACL 2019)

Pick the best ending to the context.

How to catch dragonflies. Use a long-handled aerial net with a wide opening. Select an aerial net that is 18 inches (46 cm) in diameter or larger. Look for one with a nice long handle.

Announcing HellaSwag, a new dataset for commonsense NLI.

Questions like the above are trivial to humans, with over 95% accuracy, but current state-of-the-art NLP models built on pretraining struggle with under 48% accuracy. We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.

Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Paper

If the paper inspires you, please cite us:

@inproceedings{zellers2019hellaswag,
    title={HellaSwag: Can a Machine Really Finish Your Sentence?},
    author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
    booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year={2019}
}

HellaSwag Leaderboard

To benchmark approaches to HellaSwag, we have a leaderboard for the test set.

As noted in the paper, the test and validation sets contain both in-domain and zero-shot categories, corresponding to whether the activity/how-to category like 'Riding a bike' is in the training set. The entire dataset includes both examples from ActivityNet as well as WikiHow. The overall accuracy is averaged over the entire test set.

NOTE: As of Nov 2024, submissions are closed! Most people have always used the validation set anyways!

Rank	Model	Overall accuracy	In-domain category accuracy	Zero-shot category accuracy	ActivityNet accuracy	WikiHow accuracy
	Human Performance University of Washington (Zellers et al. '19)	95.6	95.6	95.7	94.0	96.5
🥇 March 8, 2023	GPT4 base 10-shot OpenAI	95.3	94.8	95.7	90.1	98.0
2 March 23, 2020	ALUM MSR https://github.com/namisan/mt-dnn	85.6	86.5	84.6	77.1	90.1
3 July 25, 2019	RoBERTa Facebook AI	85.2	87.3	83.1	74.6	90.9
4 February 7, 2020	G-DAug-inf Anonymous	83.7	85.6	81.8	73.0	89.6
5 January 19, 2020	HighOrderGN + RoBERTa USC MOWGLI/INK Lab	82.2	84.3	80.2	71.5	88.1
6 July 25, 2019	Grover-Mega Univerity of Washington https://rowanzellers.com/grover	75.4	79.1	71.7	64.8	81.2
7 July 25, 2019	Grover-Large Univerity of Washington https://rowanzellers.com/grover	57.2	60.7	53.6	53.3	59.2
8 May 19, 2019	BERT-Large Google AI Language (experiment by Rowan)	47.3	49.7	45.0	51.7	45.0
9 May 19, 2019	GPT OpenAI (experiment by Rowan)	41.7	44.0	39.3	43.8	40.5
10 May 19, 2019	BERT-Base Google AI Language (experiment by Rowan)	40.5	42.8	38.3	45.7	37.7
11 May 19, 2019	LSTM+BERT Baseline (experiment by Rowan)	36.2	38.2	34.1	40.5	33.8
12 October 19, 2022	Rainier UQA T5-Large + Knowledge University of Washington https://arxiv.org/abs/2210.03078	34.8	34.7	34.9	36.1	34.1
13 October 19, 2022	Baseline UQA T5-Large University of Washington	34.3	34.1	34.5	35.8	33.5
14 May 19, 2019	ESIM-ELMo Allen Institute for AI (experiment by Rowan)	33.3	34.2	32.3	36.6	31.5
15 May 19, 2019	LSTM+GLoVe Baseline (experiment by Rowan)	31.7	32.9	30.4	33.8	30.5
16 May 19, 2019	FastText Facebook AI Research (experiment by Rowan)	31.6	32.9	30.2	28.4	33.3
17 May 19, 2019	LSTM+ELMo Baseline (experiment by Rowan)	31.4	32.8	30.0	33.3	30.4
18 December 31, 2019	Abductive Reasoning for Unsupervised QA - BERT University of South Florida and Oklahoma State University	30.4	30.6	30.2	35.0	28.0
19 December 31, 2019	Abductive Reasoning for Unsupervised QA - GPT2 University of South Florida and Oklahoma State University	29.5	28.5	30.5	32.6	27.9
20 April 16, 2024	RM2 grimtin10	26.8	27.6	26.1	27.9	26.2
21 December 31, 2019	Abductive Reasoning for Unsupervised QA - GPT University of South Florida and Oklahoma State University	25.8	25.9	25.7	25.8	25.8
	Random Performance	25.0	25.0	25.0	25.0	25.0

Authors

Contact

Questions about the dataset, or want to get in touch? Contact Rowan Zellers at my contact page, or open up a pull request or issue on Github.