TuringAdvice: A Generative and Dynamic Evaluation of Language Use (NAACL 2021)

Paper (arxiv) » Code (github) »

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other.

Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

TuringAdvice is an ongoing effort. For announcements subscribe to our Google group » and follow Rowan on twitter »

Leaderboard

We present a dynamic leaderboard and dataset for TuringAdvice. The dataset, RedditAdvice, is dynamic because it pulls data from Reddit's advice communities over the last two weeks during each evalaution. Thus, machines must tackle the same task as humans: providing advice for recently-written situations. There's no historic test set to overfit to.

We rank models by the human preference rate: the frequency (%) of their advice being chosen over the community-endorsed advice from Reddit (the advice with the most upvotes), in a head to head comparison. Human performance is at 50% by definition, while machine performance tends to be much worse.

Rank (updated Feb 12, 2020)	Model	Human Preference Rate (%)	Parameter count
	Human Performance	50.0
🥇 Feb 12, 2020	T5-11B Google (experiment by Rowan)	9.0	11 billion
2 Feb 12, 2020	T5-3B Google (experiment by Rowan)	6.0	3 billion
3 Feb 12, 2020	Grover-Mega Univerity of Washington https://rowanzellers.com/grover	4.0	1.5 billion
4 Feb 12, 2020	Grover-Large Univerity of Washington https://rowanzellers.com/grover	3.5	355 million
5 Feb 12, 2020	TF-IDF retrieval	2.0

Get started

Our dynamic dataset, RedditAdvice, is combined with a very large static dataset (RedditAdvice2019) for training models. There are 600k training examples, and 16k validation examples. You can use RedditAdvice2019 for training models, or you can use other data. Download it at the link below.

Submitting to our leaderboard is easy. You'll need to set up a simple web API - no Docker required! The instructions are also on the github page, just follow the link below.

Download RedditAdvice2019 » Submit to the leaderboard »

Paper: Evaluating Machines by their Real-World Language Use

If our paper inspires you, please cite us:

@inproceedings{zellers-etal-2021-turingadvice,
    title = "{T}uring{A}dvice: A Generative and Dynamic Evaluation of Language Use",
    author = "Zellers, Rowan  and
      Holtzman, Ari  and
      Clark, Elizabeth  and
      Qin, Lianhui  and
      Farhadi, Ali  and
      Choi, Yejin",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.386",
    pages = "4856--4880",
}

Authors

This work was done by a team of reseachers at the University of Washington, specifically in the Paul G. Allen School of Computer Science and Engineering. Some of us are also affiliated with the Allen Institute for AI (AI2).

Questions about advice, want to submit to the leaderboard, or want to get in touch? Contact me at my contact page.