HellaSwag: Can a Machine Really Finish Your Sentence? (ACL 2019)

Paper » Code » Dataset »

Pick the best ending to the context.

How to catch dragonflies. Use a long-handled aerial net with a wide opening. Select an aerial net that is 18 inches (46 cm) in diameter or larger. Look for one with a nice long handle.

Announcing HellaSwag, a new dataset for commonsense NLI.

Questions like the above are trivial to humans, with over 95% accuracy, but current state-of-the-art NLP models built on pretraining struggle with under 48% accuracy. We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.

Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Paper

If the paper inspires you, please cite us:
@inproceedings{zellers2019hellaswag,
    title={HellaSwag: Can a Machine Really Finish Your Sentence?},
    author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
    booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year={2019}
}

HellaSwag Leaderboard

To benchmark approaches to HellaSwag, we have a leaderboard for the test set. Submission is easy! Just follow the instructions on our github page.

As noted in the paper, the test and validation sets contain both in-domain and zero-shot categories, corresponding to whether the activity/how-to category like 'Riding a bike' is in the training set. The entire dataset includes both examples from ActivityNet as well as WikiHow. The overall accuracy is averaged over the entire test set.

Rank Model Overall accuracy In-domain accuracy Zero-shot accuracy ActivityNet accuracy WikiHow accuracy
Human Performance

University of Washington

(Zellers et al. '19)
95.6 95.6 95.7 94.0 96.5

🥇

July 25, 2019
RoBERTa

Facebook AI

85.287.383.174.690.9

2

May 29, 2019
Grover-Mega

Univerity of Washington

https://rowanzellers.com/grover
75.479.171.764.881.2

3

May 29, 2019
Grover-Large

Univerity of Washington

https://rowanzellers.com/grover
57.260.753.653.359.2

4

May 19, 2019
BERT-Large

Google AI Language (experiment by Rowan)

47.349.745.051.745.0

5

May 19, 2019
GPT

OpenAI (experiment by Rowan)

41.744.039.343.840.5

6

May 19, 2019
BERT-Base

Google AI Language (experiment by Rowan)

40.542.838.345.737.7

7

May 19, 2019
LSTM+BERT

Baseline (experiment by Rowan)

36.238.234.140.533.8

8

May 19, 2019
ESIM-ELMo

Allen Institute for AI (experiment by Rowan)

33.334.232.336.631.5

9

May 19, 2019
LSTM+GLoVe

Baseline (experiment by Rowan)

31.732.930.433.830.5

10

May 19, 2019
FastText

Facebook AI Research (experiment by Rowan)

31.632.930.228.433.3

11

May 19, 2019
LSTM+ELMo

Baseline (experiment by Rowan)

31.432.830.033.330.4
Random Performance 25.0 25.0 25.0 25.0 25.0

Contact

Questions about the dataset, or want to get in touch? Contact Rowan Zellers on Twitter, open up a pull request on Github, or email me.