Rowan Zellers

I'm a final year PhD candidate at the University of Washington in Computer Science and Engineering and am part time at the Allen Institute for Artificial Intelligence. I work with Yejin Choi and Ali Farhadi. My pronouns are he/his.

I perform interdisciplinary research in natural language processing (NLP), computer vision, and machine learning, together with insights from developmental psychology. I study grounding AI, enabling machines to understand what language, images, and videos mean in the context of the external world.

You can follow me on Twitter at @rown.

Research Themes

Grounding physical dynamics through symbolic structure ⚛️

As humans, our language use is informed by physical knowledge. We think of many objects in a structured and symbolic way, with actions transforming these objects. How can we represent and learn physical dynamics - and transfer this knowledge to models for language and beyond?

Grounding events through multimodal script knowledge 🚗

Consider the sentence "A man is pumping his car up, so he can take off the tire." We can imagine what this might look like, or predict what happens next. How, and through which modalities, can we learn and apply this multimodal script knowledge?

Adversarial evaluation of grounding 💯

Neural models excel at "gaming" benchmarks, using ample training data to answer correctly but for questionable reasons. How do we develop methodology, tasks, and datasets that evaluate the true challenge of grounding AI models? How do we apply this methodology to analyze risks posed by AI?

Recent invited talks


Also see my Semantic Scholar and Google Scholar pages.

MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi

CVPR 2022 Oral

website paper Press (VentureBeat)

We introduce MERLOT Reserve, which learns about YouTube videos through all their modalities (audio, vision, and text). Learning from audio helps broadly -- even on single-image tasks like VCR. Our model learns state-of-the-art representations, that also transfer well to video-based tasks in a zero-shot setting.

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Jack Hessel, Jena D Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

arxiv 2022


We introduce a new dataset named Sherlock for studying Visual Abductive Reasoning: inferring likely context about the world beyond an image, given a localized clue. We operationalize this task through a ranking-based evaluation, where we find headroom for today's vision-and-language models.

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi

NAACL 2022


The internet has lots of images paired with literal descriptions, but this pairing is less common for audio. We study using images as a "pivot" to learn strong audio-text representations, without paired data. In this zero-shot setting, our model obtains strong performance on a variety of sound description tasks.

NeuroLogic A* esque Decoding: Constrained Text Generation with Lookahead Heuristics

Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, Yejin Choi

NAACL 2022


We introduce a new constrained decoding strategy for text. The idea is to use lookahead heuristics to guide text towards satisfying provided constraints. It outperforms competitive baselines, particularly on tasks with complex constraints.

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers🍷, Ximing Lu🍷, Jack Hessel🍷, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi (🍷: equal contribution)

NeurIPS 2021 Oral (top 1% of submissions)

website paper Press (VentureBeat) Press (The Batch)

We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. Our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. It shows a strong temporal understanding of grounded events, particularly in ordering visual stories. When finetuned it gets SOTA on 12 datasets requiring visual (and temporal) reasoning.

MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui

NeurIPS 2021 Outstanding paper (top 0.1%)

website paper

We introduce MAUVE, a new metric for evaluating open-ended text generation. MAUVE simultaneously measures 1) errors due to assigning high probability to unnatural language, and 2) errors due to assigning low probability to true language. Our evaluation corresponds with human judgments, and reveals that decoding algorithms like Nucleus Sampling score well in practice.

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi

ACL 2021

website paper

We introduce a model named PIGLeT that learns physical commonsense understanding by interacting with the world, and uses this knowledge to ground language. This strategy works much better than modeling everything solely through linguistic form. When forecasting "what happens next" given an English sentence, our base-sized PIGLeT model outperforms text-to-text models that are 100x larger.

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi

NAACL 2021

website paper Press (VentureBeat)

Big transformers do well on NLP benchmarks, but there's a gap between today's benchmarks and how humans use language. Our TuringAdvice narrows this gap, evaluating machines dynamically by their real-world language use. Rather than bubbling the 'right answer' on a multiple choice exam, our benchmark requires models to write helpful language in response to a real-life situation; revealing key flaws with today's models.

NeuroLogic Decoding: Unsupervised Neural Text Generation with Predicate Logic Constraints

Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

NAACL 2021


Constraints like "generate a recipe using the following ingredients" are tricky for language models to follow, even after finetuning. We introduce NeuroLogic Decoding, an algorithm that allows for the generation of high-quality text while satisfying constraints, allowing us to avoid the finetuning step completely.

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D Hwang, Antoine Bosselut, Yejin Choi

ACL 2021

website paper

Many images online are edited in some way, but it is the intent that separates harmful edits like deepfakes from innocuous edits like an enhanced vacation photo. We introduce a new task and dataset for reasoning about why an image was edited, and a new model for the task.

Probing Contextual Language Models for Common Ground with Visual Representations

Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi

NAACL 2021


Today's best language understanding models are trained on massive amounts of purely textual data. We show that their representations have structural similarity to models trained on visual data, though this cross-modal linkage is far from perfect.

Adversarial Filters of Dataset Biases

Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, Yejin Choi

ICML 2020


Today's models achieve superhuman performance on benchmarks like ImageNet and Stanford Natural Language Inference (SNLI), but it is unclear whether they solved the underlying task, or rather overfitted to dataset biases. We study AFLite, an extension of Adversarial Filtering, to remove dataset biases. Filtering datasets like ImageNet makes them much harder for machines, while human performance remains high.

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi

AAAI 2020 Oral (top 3%).

website paper

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? We introduce a benchmark for evaluating the ability for machines to reason about physical situations through natural language. Today's pretrained language models struggle, showing room for future reesearch.

Defending Against Neural Fake News

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi

NeurIPS 2019

website paper
Press: TechCrunch UW Allen School News New Scientist GeekWire AdWeek The New York Times The Washington Post AI2 Blog

Can adversaries use state-of-the-art NLP models to generate "neural fake news"? We investigate the threat of machine-written propaganda that mimics the style of real news, through a model named Grover. Given a headline, Grover can generate the rest of the article. The best defense agianst Grover turns out to be Grover itself, demonstrating the importance of public release of strong generators.

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi

ACL 2019

website paper

We show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans, even deep pretrained models (like BERT) struggle. The key insight is to scale up the length and complexity of the dataset examples towards a critical zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi

CVPR 2019 Oral (top 5%).

website paper

We formulate the new task of Visual Commonsense reasoning, where a model must not only answer challenging visual questions expressed in natural language: it must provide a rationale explaining why its answer is true. We introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi

EMNLP 2018

website paper Press (The New York Times)

We release a new NLI dataset, SWAG, with 113k challenging multiple choice questions about grounded sitautions. To build this dataset, we present Adversarial Filtering (AF), which allows for data collection at scale while minimizing annotation artifacts.

Neural Motifs: Scene Graph Parsing with Global Context

Rowan Zellers, Mark Yatskar, Sam Thomson, Yejin Choi

CVPR 2018

website paper

We study scene graph generation: building a graph where the nodes are objects and the edges are pairwise relationships. The visual world has many repeating structures (motifs). We built a model to capture them, which improves significantly over the prior staet-of-the-art.

Zero-Shot Activity Recognition with Verb Attribute Induction

Rowan Zellers, Yejin Choi

EMNLP 2017

website paper

We investigate zero-shot multimodal learning where the topics of classification are verbs, not objects. We crowdsource verb attributes and build a model to learn them from unlabeled text, and dictionary definitions. We used these verb attributes to recognize actions in images.

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency

IEEE Intelligent Systems 2016


We provide a study of sentiment analysis applied on video data, not just text. We present a model that exploits the dynamics between gestures and verbal messages.



My email address is rowanz at Note:

  • If you have a question about getting code to work (for a paper I published), opening up an issue on my Github is usually better than email. Please try to provide sufficient detail so I can help you out!
  • I often serve as a reviewer for open-access NLP, CV, and ML conferences and journals. I try to avoid reviewing for closed-access journals, or for conferences/journals that I haven't heard of, sorry!