Defining Comprehension: A Template of Understanding for Machine Reading of Narratives

1. Introduction & Core Thesis

The paper "To Test Machine Comprehension, Start by Defining Comprehension" presents a fundamental critique of the prevailing paradigm in Machine Reading Comprehension (MRC) research. The authors, Dunietz et al., argue that the field's obsession with creating incrementally "harder" question-answering tasks is misguided and unsystematic. They posit that without first defining what constitutes comprehension for a given text type, MRC benchmarks are haphazard and fail to ensure models build robust, useful internal representations of text meaning.

The core contribution is the introduction of a Template of Understanding (ToU)—a structured, content-first specification of the minimal knowledge a system should extract from a narrative text. This shifts the focus from how to test (via difficult questions) to what to test (systematic content coverage).

2. Analysis of Existing MRC Dataset Designs

The paper reviews common MRC dataset construction methodologies, highlighting their inherent flaws from a systematic evaluation standpoint.

2.1 The "Difficulty-First" Paradigm

Most contemporary MRC tasks (e.g., SQuAD 2.0, HotpotQA, DROP) are built by having annotators read a passage and formulate questions deemed challenging, often focusing on reasoning types like multi-hop, commonsense, or numerical inference. The authors liken this to "trying to become a professional sprinter by glancing around the gym and adopting any exercises that look hard." The training is scattershot and lacks a coherent roadmap toward genuine comprehension.

2.2 Shortcomings of Ad-Hoc Question Generation

This approach leads to datasets with uneven and incomplete coverage of a passage's semantic content. High performance on such benchmarks does not guarantee a system has constructed a coherent mental model of the text. It may instead excel at surface pattern matching or exploiting dataset-specific biases, a phenomenon well-documented in studies of NLI and QA datasets.

3. The Proposed Framework: Template of Understanding

The authors advocate for a foundational shift: first define the target of comprehension, then derive tests for it.

3.1 Why Narratives?

Narratives (short stories) are proposed as an ideal testbed because they are a fundamental and complex text type with clear real-world applications (e.g., understanding legal depositions, patient histories, news reports). They require modeling events, characters, goals, causal/temporal relations, and mental states.

3.2 Components of the Narrative ToU

Inspired by cognitive science models of reading comprehension (e.g., Kintsch's Construction-Integration model), the proposed ToU for a narrative specifies the minimal elements a system's internal representation should contain:

Entities & Coreference: Track all characters, objects, locations.
Events & States: Identify all actions and descriptive states.
Temporal Structure: Order events and states on a timeline.
Causal Relations: Identify cause-effect links between events/states.
Intentionality & Mental States: Infer characters' goals, beliefs, and emotions.
Thematic & Global Structure: Understand the overall point, moral, or outcome.

3.3 Operationalizing the ToU

The ToU is not just a theory; it's a blueprint for dataset creation. For each component, task designers can systematically generate questions (e.g., "What caused X?", "What was Y's goal when she did Z?") that probe whether the model has built that part of the representation. This ensures comprehensive and balanced coverage.

4. Experimental Evidence & Model Performance

The paper includes a pilot experiment to validate their critique.

4.1 Pilot Task Design

A small-scale dataset was created based on the ToU for simple narratives. Questions were systematically generated to probe each component of the template.

4.2 Results & Key Findings

State-of-the-art models (like BERT) performed poorly on this systematic test, despite excelling on standard "difficult" benchmarks. The models particularly struggled with questions requiring causal reasoning and inference of mental states, precisely the elements that are often under-sampled in ad-hoc QA collection. This pilot strongly suggests that current models lack the robust, structured understanding the ToU demands.

Pilot Experiment Snapshot

Finding: Models failed systematically on causal & intentional reasoning probes.

Implication: High scores on SQuAD-style tasks do not equate to narrative understanding as defined by the ToU.

5. Technical Deep Dive & Mathematical Formalism

The ToU can be formalized. Let a narrative $N$ be a sequence of sentences $\{s_1, s_2, ..., s_n\}$. The comprehension model $M$ should construct a representation $R(N)$ that is a structured graph:

$R(N) = (E, V, T, C, I)$

Where:

$E$: Set of entities (nodes).
$V$: Set of events/states (nodes).
$T \subseteq V \times V$: Temporal relations (edges).
$C \subseteq V \times V$: Causal relations (edges).
$I \subseteq E \times V$: Intentional relations (e.g., Agent(Entity, Event)).

The goal of an MRC system is to infer $R(N)$ from $N$. A QA pair $(q, a)$ is a probe function $f_q(R(N))$ that returns $a$ if $R(N)$ is correct. The ToU defines the necessary and sufficient structure of $R(N)$ for narrative texts.

6. Analytical Framework: A Case Study Example

Narrative: "Anna was frustrated with her slow computer. She saved her work, shut down the machine, and went to the store to buy a new solid-state drive. After installing it, her computer booted up in seconds, and she smiled."

ToU-Based Analysis:

Entities: Anna, computer, work, store, SSD.
Events/States: was frustrated, saved work, shut down, went, bought, installed, booted up, smiled.
Temporal: [frustrated] -> [saved] -> [shut down] -> [went] -> [bought] -> [installed] -> [booted] -> [smiled].
Causal: Slow computer caused frustration. Frustration caused goal to upgrade. Buying & installing SSD caused fast boot. Fast boot caused smile (satisfaction).
Intentional: Anna's goal: improve computer speed. Her plan: buy and install an SSD. Her belief: SSD will make computer faster.
Thematic: Problem-solving through technology upgrade leads to satisfaction.

A ToU-compliant QA set would contain questions probing each of these elements systematically, not just a random "hard" question like "Where did Anna go after shutting down her computer?"

7. Critical Analysis & Expert Commentary

Core Insight: Dunietz et al. have struck at the heart of a methodological rot in AI evaluation. The field's benchmark-driven progress, reminiscent of the "Clever Hans" effect in early AI, has prioritized narrow performance gains over foundational understanding. Their ToU is a direct challenge to the community: stop chasing leaderboard points and start defining what success actually means. This aligns with growing skepticism from researchers like Rebecca Qian and Tal Linzen, who have shown that models often solve tasks via superficial heuristics rather than deep reasoning.

Logical Flow: The argument is impeccably structured: (1) Diagnose the problem (unsystematic, difficulty-focused evaluation), (2) Propose a principled solution (content-first ToU), (3) Provide a concrete instantiation (for narratives), (4) Offer empirical validation (pilot study showing SOTA model failure). This mirrors the rigorous approach of seminal papers that defined new paradigms, such as the CycleGAN paper's clear formulation of unpaired image translation objectives.

Strengths & Flaws: The paper's strength is its conceptual clarity and actionable critique. The ToU framework is transferable to other text genres (scientific articles, legal documents). However, its main flaw is the limited scale of the pilot experiment. A full-scale ToU-based benchmark is needed to truly pressure-test models. Furthermore, the ToU itself, while structured, may still be incomplete—does it fully capture social reasoning or complex counterfactuals? It's a necessary first step, not a final theory.

Actionable Insights: For researchers: Build the next generation of benchmarks using a ToU-like methodology. For engineers: Be deeply skeptical of claims that models "comprehend" text based on existing benchmarks. Evaluate models internally against systematic, application-specific templates. For funders: Prioritize research that defines and measures genuine understanding over marginal improvements on flawed tasks. The path forward is to adopt a more theory-driven, cognitive science-informed approach to AI evaluation, moving beyond the "laundry list of hard problems" mentality.

8. Future Applications & Research Directions

Benchmark Development: Creation of large-scale, publicly available MRC datasets built explicitly from ToUs for narratives, news, and scientific abstracts.
Model Architecture: Designing neural architectures that explicitly build and manipulate structured representations (like the $R(N)$ graph) rather than relying solely on implicit embeddings. This points towards neuro-symbolic hybrids.
Evaluation Diagnostics: Using ToU-based probes as fine-grained diagnostic tools to understand specific weaknesses in existing models (e.g., "Model X fails on causal reasoning but is good at entity tracking").
Cross-Modal Understanding: Extending the ToU concept to multimodal comprehension (e.g., understanding video narratives or illustrated stories).
Real-World Deployment: Direct application in domains where structured understanding is critical: automated tutoring systems that assess story comprehension, AI legal assistants that parse case narratives, or clinical AI that interprets patient history narratives.

9. References

Dunietz, J., Burnham, G., Bharadwaj, A., Rambow, O., Chu-Carroll, J., & Ferrucci, D. (2020). To Test Machine Comprehension, Start by Defining Comprehension. arXiv preprint arXiv:2005.01525.
Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological review, 95(2), 163.
Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to Answer Open-Domain Questions. Proceedings of ACL.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of ICCV. (Cited as an example of clear objective formulation).
McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of ACL.