TuringAdvice: A Generative and Dynamic Evaluation of Language Use (NAACL 2021)
Paper (arxiv) » Code (github) »We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other.
Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.
TuringAdvice is an ongoing effort. For announcements subscribe to our Google group » and follow Rowan on twitter »
Leaderboard
We present a dynamic leaderboard and dataset for TuringAdvice. The dataset, RedditAdvice, is dynamic because it pulls data from Reddit's advice communities over the last two weeks during each evalaution. Thus, machines must tackle the same task as humans: providing advice for recently-written situations. There's no historic test set to overfit to.
We rank models by the human preference rate: the frequency (%) of their advice being chosen over the community-endorsed advice from Reddit (the advice with the most upvotes), in a head to head comparison. Human performance is at 50% by definition, while machine performance tends to be much worse.
Rank (updated Feb 12, 2020) | Model | Human Preference Rate (%) | Parameter count |
---|---|---|---|
Human Performance | 50.0 | ||
🥇 Feb 12, 2020 |
T5-11B Google (experiment by Rowan) |
9.0 | 11 billion |
2 Feb 12, 2020 |
T5-3B Google (experiment by Rowan) |
6.0 | 3 billion |
3 Feb 12, 2020 |
Grover-Mega Univerity of Washington https://rowanzellers.com/grover |
4.0 | 1.5 billion |
4 Feb 12, 2020 |
Grover-Large Univerity of Washington https://rowanzellers.com/grover |
3.5 | 355 million |
5 Feb 12, 2020 |
TF-IDF retrieval | 2.0 |
Get started
Our dynamic dataset, RedditAdvice, is combined with a very large static dataset (RedditAdvice2019) for training models. There are 600k training examples, and 16k validation examples. You can use RedditAdvice2019 for training models, or you can use other data. Download it at the link below.
Submitting to our leaderboard is easy. You'll need to set up a simple web API - no Docker required! The instructions are also on the github page, just follow the link below.
Paper: Evaluating Machines by their Real-World Language Use
If our paper inspires you, please cite us:@inproceedings{zellers-etal-2021-turingadvice, title = "{T}uring{A}dvice: A Generative and Dynamic Evaluation of Language Use", author = "Zellers, Rowan and Holtzman, Ari and Clark, Elizabeth and Qin, Lianhui and Farhadi, Ali and Choi, Yejin", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.naacl-main.386", pages = "4856--4880", }