Evaluating Machines by their Real-World Language Use

Paper (arxiv) » Code (github) »

There is a fundamental gap between how humans understand and use language - in open-ended, real-world situations - and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use.

One example of real-world language use is when we give advice to others. We thus present TuringAdvice as a new challenge task for language understanding systems. A machine must generate advice in response to a situation written in natural language. To pass the challenge, the machine's advice must be at least as helpful (to the situation-writer) as human-written advice, requiring a deep understanding of the situation.

TuringAdvice is an ongoing effort. For announcements subscribe to our Google group » and follow Rowan on twitter »


We present a demo of one model, Grover-Mega (with 1.5B parameters) for our challenge. While the advice it generates is usually on-topic, it's rarely helpful (4%), suggesting promising avenues for future work.

Fill out the context, or pick one of the presets from Reddit. Then, press "Generate Advice"!
Advice type
Machine written Advice


We present a dynamic leaderboard and dataset for TuringAdvice. The dataset, RedditAdvice, is dynamic because it pulls data from Reddit's advice communities over the last two weeks during each evalaution. Thus, machines must tackle the same task as humans: providing advice for recently-written situations. There's no historic test set to overfit to.

We rank models by the human preference rate: the frequency (%) of their advice being chosen over the community-endorsed advice from Reddit (the advice with the most upvotes), in a head to head comparison. Human performance is at 50% by definition, while machine performance tends to be much worse.

Rank (updated Feb 12, 2020) Model Human Preference Rate (%) Parameter count
Human Performance 50.0


Feb 12, 2020

Google (experiment by Rowan)

9.0 11 billion


Feb 12, 2020

Google (experiment by Rowan)

6.0 3 billion


Feb 12, 2020

Univerity of Washington

4.0 1.5 billion


Feb 12, 2020

Univerity of Washington

3.5 355 million


Feb 12, 2020
TF-IDF retrieval 2.0

Get started

Our dynamic dataset, RedditAdvice, is combined with a very large static dataset (RedditAdvice2019) for training models. There are 600k training examples, and 16k validation examples. You can use RedditAdvice2019 for training models, or you can use other data. Download it at the link below.

Submitting to our leaderboard is easy. You'll need to set up a simple web API - no Docker required! The instructions are also on the github page, just follow the link below.

Download RedditAdvice2019 » Submit to the leaderboard »

Paper: Evaluating Machines by their Real-World Language Use

If our paper inspires you, please cite us:
    title={Evaluating Machines by their Real-World Language Use},
    author={Rowan Zellers and Ari Holtzman and Elizabeth Clark and Lianhui Qin and Ali Farhadi and Yejin Choi},
    journal={arXiv preprint},


This work was done by a team of reseachers at the University of Washington, specifically in the Paul G. Allen School of Computer Science and Engineering. Some of us are also affiliated with the Allen Institute for AI (AI2).

Questions about advice, want to submit to the leaderboard, or want to get in touch? Contact me at my contact page.