HellaSwag: Can a Machine Really Finish Your Sentence? (ACL 2019)

Paper » Code » Dataset »

Pick the best ending to the context.

How to catch dragonflies. Use a long-handled aerial net with a wide opening. Select an aerial net that is 18 inches (46 cm) in diameter or larger. Look for one with a nice long handle.

Announcing HellaSwag, a new dataset for commonsense NLI.

Questions like the above are trivial to humans, with over 95% accuracy, but current state-of-the-art NLP models built on pretraining struggle with under 48% accuracy. We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.

Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Paper

If the paper inspires you, please cite us:
@inproceedings{zellers2019hellaswag,
    title={HellaSwag: Can a Machine Really Finish Your Sentence?},
    author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
    booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year={2019}
}

HellaSwag Leaderboard

To benchmark approaches to HellaSwag, we have a leaderboard for the test set. Submission is easy! Just follow the instructions on our github page.

As noted in the paper, the test and validation sets contain both in-domain and zero-shot categories, corresponding to whether the activity/how-to category like 'Riding a bike' is in the training set. The entire dataset includes both examples from ActivityNet as well as WikiHow. The overall accuracy is averaged over the entire test set.

Rank Model Overall accuracy In-domain category accuracy Zero-shot category accuracy ActivityNet accuracy WikiHow accuracy
Human Performance

University of Washington

(Zellers et al. '19)
95.6 95.6 95.7 94.0 96.5

🥇

March 8, 2023
GPT4 base 10-shot

OpenAI

95.394.895.790.198.0

2

March 23, 2020
ALUM

MSR

https://github.com/namisan/mt-dnn
85.686.584.677.190.1

3

July 25, 2019
RoBERTa

Facebook AI

85.287.383.174.690.9

4

February 7, 2020
G-DAug-inf

Anonymous

83.785.681.873.089.6

5

January 19, 2020
HighOrderGN + RoBERTa

USC MOWGLI/INK Lab

82.284.380.271.588.1

6

July 25, 2019
Grover-Mega

Univerity of Washington

https://rowanzellers.com/grover
75.479.171.764.881.2

7

July 25, 2019
Grover-Large

Univerity of Washington

https://rowanzellers.com/grover
57.260.753.653.359.2

8

May 19, 2019
BERT-Large

Google AI Language (experiment by Rowan)

47.349.745.051.745.0

9

May 19, 2019
GPT

OpenAI (experiment by Rowan)

41.744.039.343.840.5

10

May 19, 2019
BERT-Base

Google AI Language (experiment by Rowan)

40.542.838.345.737.7

11

May 19, 2019
LSTM+BERT

Baseline (experiment by Rowan)

36.238.234.140.533.8

12

October 19, 2022
Rainier UQA T5-Large + Knowledge

University of Washington

https://arxiv.org/abs/2210.03078
34.834.734.936.134.1

13

October 19, 2022
Baseline UQA T5-Large

University of Washington

34.334.134.535.833.5

14

May 19, 2019
ESIM-ELMo

Allen Institute for AI (experiment by Rowan)

33.334.232.336.631.5

15

May 19, 2019
LSTM+GLoVe

Baseline (experiment by Rowan)

31.732.930.433.830.5

16

May 19, 2019
FastText

Facebook AI Research (experiment by Rowan)

31.632.930.228.433.3

17

May 19, 2019
LSTM+ELMo

Baseline (experiment by Rowan)

31.432.830.033.330.4

18

December 31, 2019
Abductive Reasoning for Unsupervised QA - BERT

University of South Florida and Oklahoma State University

30.430.630.235.028.0

19

December 31, 2019
Abductive Reasoning for Unsupervised QA - GPT2

University of South Florida and Oklahoma State University

29.528.530.532.627.9

20

December 31, 2019
Abductive Reasoning for Unsupervised QA - GPT

University of South Florida and Oklahoma State University

25.825.925.725.825.8
Random Performance 25.0 25.0 25.0 25.0 25.0

Contact

Questions about the dataset, or want to get in touch? Contact Rowan Zellers at my contact page, or open up a pull request or issue on Github.