Rowan Zellers

I'm a fifth year graduate student at the University of Washington in Computer Science and Engineering and am part time at the Allen Institute for Artificial Intelligence. I work with Yejin Choi and Ali Farhadi. My pronouns are he/his.

My research spans natural language processing, computer vision, and artificial intelligence. I'm excited about commonsense language understanding. As humans, our language is rooted in the complex world around us -- we use it to learn new concepts, share information, and collaborate with others. I'm interested in bridging the gap between what existing machine-learning approaches can do, and this humanlike level of language understanding grounded in the world. I'm also interested in exploring the social impacts of these technologies.

You can follow me on Twitter at @rown.


For more, visit my Google Scholar and Semantic Scholar pages.

MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi. Preprint. [paper] [project page]

We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. Our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. When finetuned it gets SOTA on 12 datasets requiring visual (and temporal) reasoning.

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World
Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi. ACL 2021. [paper] [project page]

We introduce a model named PIGLeT that learns physical commonsense understanding by interacting with the world, and uses this knowledge to ground language. This strategy works much better than modeling everything solely through linguistic form. When forecasting "what happens next" given an English sentence, our base-sized PIGLeT model outperforms text-to-text models that are 100x larger.

Evaluating Machines by their Real-World Language Use
Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi. NAACL 2021. [paper] [project page]

Press: VentureBeat

Big transformers do well on NLP benchmarks, but there's a gap between today's benchmarks and how humans use language. We narrow this gap, instead evaluating machines dynamically by their real-world language use. Rather than bubbling the right answers on multiple choice tests, the idea is to write text that's helpful to people in need. Today's models show room for improvement in this setting.

NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints
Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. NAACL 2021. [paper]

Constraints like "generate a recipe using the following ingredients" are tricky for language models to follow, even after finetuning. We introduce NeuroLogic Decoding, an algorithm that allows for the generation of high-quality text while satisfying constraints, allowing us to avoid the finetuning step completely.

Probing Text Models for Common Ground with Visual Representations
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi. NAACL 2021. [paper]

Today's best language understanding models are trained on massive amounts of purely textual data. We show that their representations have structural similarity to models trained on visual data, though this cross-modal linkage is far from perfect.

Edited Media Understanding: Reasoning About Implications of Manipulated Images
Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi. arxiv 2020. [paper]

Many images online are edited in some way, but it is the intent that separates harmful edits like deepfakes from innocuous edits like an enhanced vacation photo. We introduce a new task and dataset for reasoning about why an image was edited, and a new model for the task.

Adversarial Filters of Dataset Biases
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, Yejin Choi. ICML 2020. [paper]

Today's models achieve superhuman performance on benchmarks like ImageNet and Stanford Natural Language Inference (SNLI), but it is unclear whether they solved the underlying task, or rather overfitted to dataset biases. We study AFLite, an extension of Adversarial Filtering, to remove dataset biases. Filtering datasets like ImageNet makes them much harder for machines, while human performance remains high.

PIQA: Reasoning about Phyiscal Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi. AAAI 2020. [paper] [leaderboard]

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? We introduce a benchmark for evaluating the ability for machines to reason about physical situations through natural language. Today's pretrained language models struggle, showing room for future reesearch.

Defending Against Neural Fake News
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi. NeurIPS 2019 [paper] [project page] [blog post]

Press: TechCrunch, Paul G. Allen School News, New Scientist, GeekWire, AdWeek, New York Times, Washington Post

Can adversaries use state-of-the-art NLP models to generate "neural fake news"? We investigate the threat of machine-written propaganda that mimics the style of real news, through a model named Grover. Given a headline like `Link Found Between Vaccines and Autism,' Grover can generate the rest of the article. The best defense agianst Grover turns out to be Grover itself, demonstrating the importance of public release of strong generators.

HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi. ACL 2019 [paper] [project page]

We show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans, even deep pretrained models (like BERT) struggle. The key insight is to scale up the length and complexity of the dataset examples towards a critical zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.

From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. CVPR 2019 oral [paper] [project page]

We formulate the new task of Visual Commonsense reasoning, where a model must not only answer challenging visual questions expressed in natural language: it must provide a rationale explaining why its answer is true. We introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi. EMNLP 2018 [paper] [project page] [code]

Press: New York Times

We release a new NLI dataset, SWAG, with 113k challenging multiple choice questions about grounded sitautions. To build this dataset, we present Adversarial Filtering (AF), which allows for data collection at scale while minimizing annotation artifacts.

Neural Motifs: Scene Graph Parsing with Global Context
Rowan Zellers, Mark Yatskar, Sam Thomson, Yejin Choi. CVPR 2018 [paper] [project page] [code]

We study scene graph generation: building a graph where the nodes are objects and the edges are pairwise relationships. The visual world has many repeating structures (motifs). We built a model to capture them, which improves significantly over the prior staet-of-the-art.

Zero-Shot Activity Recognition with Verb Attribute Induction
Rowan Zellers, Yejin Choi. EMNLP 2017 [paper] [code, data, BibTeX] [poster]

We investigate zero-shot multimodal learning where the topics of classification are verbs, not objects. We crowdsource verb attributes and build a model to learn them from unlabeled text, and dictionary definitions. We used these verb attributes to recognize actions in images.

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages
Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency. IEEE Intelligent Systems 2016 [paper]

We provide a study of sentiment analysis applied on video data, not just text. We present a model that exploits the dynamics between gestures and verbal messages.

Nucleotide Interdependency in Transcription Factor Binding Sites in the Drosophila Genome
Jacqueline Dresch, Rowan Zellers, Daniel Bork, Robert Drewell. Gene Regulation and Systems Bio 2016. [paper]

MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding
Rowan Zellers, Robert Drewell, Jacqueline Dresch. BMC Bioinformatics 2015. [paper+code]

We model the specificity with which regulatory proteins bind to DNA sequences during embryonic development. We use this model to study binding sites for 15 distinct regulatory proteins in the Drosophila (fruit fly) genome.


I help advise several talented undergrads on research:

  • Ximing (Gloria) Lu
  • Aaron Johnston
  • Nathaniel Wichman
  • Jeff Da



My email address is rowanz at Note:

  • If you have a question about getting code to work (for a paper I published), opening up an issue on my Github is usually better than email. Please try to provide sufficient detail so I can help you out!
  • I often serve as a reviewer for open-access NLP, CV, and ML conferences and journals. I try to avoid reviewing for closed-access journals or for conferences/journals that I haven't heard of, sorry!