Adversarial Examples for Evaluating Reading Comprehension Systems

1. Introduction & Overview

This paper, "Adversarial Examples for Evaluating Reading Comprehension Systems" by Jia & Liang (2017), presents a critical examination of the true language understanding capabilities of state-of-the-art models on the Stanford Question Answering Dataset (SQuAD). The authors argue that standard accuracy metrics (e.g., F1 score) paint an overly optimistic picture, as models may exploit superficial statistical patterns rather than develop genuine comprehension. To address this, they propose an adversarial evaluation scheme that tests model robustness by inserting automatically generated, distracting sentences into the input paragraphs. These sentences are designed to fool models without changing the correct answer for a human reader.

Key Performance Drop

Average F1 Score: 75% → 36% (with grammatical adversarial sentences)

Further Drop: → ~7% (with ungrammatical word sequences on 4 models)

2. Core Methodology

2.1 Adversarial Evaluation Paradigm

Moving beyond average-case test set evaluation, the paper adopts an adversarial framework inspired by computer vision (e.g., Szegedy et al., 2014). However, unlike image perturbations, textual meaning is fragile. The authors' key innovation is to target model overstability—the tendency to latch onto any sentence containing keywords from the question, rather than identifying the one that logically answers it. The adversary's goal is to generate a distractor sentence $S_{adv}$ that maximizes the probability of an incorrect prediction $P(\hat{y}_{wrong} | P, Q, S_{adv})$ while ensuring a human would still answer correctly.

2.2 Distractor Sentence Generation

The process involves two main phases:

Rule-based Generation: Create a "raw" distractor sentence related to the question topic but not answering it. For the example in Figure 1, given the question about "the quarterback who was 38," a distractor is generated about "Quarterback Jeff Dean had jersey number 37." This exploits lexical overlap ("quarterback," number).
Crowdsourced Grammatical Correction: The raw, potentially ungrammatical sentences are polished by human workers to ensure they are fluent, isolating the test to semantic understanding rather than syntax tolerance.

3. Experimental Results & Analysis

3.1 Performance Drop with Grammatical Distractors

The primary experiment evaluated 16 published models on SQuAD. The addition of a single, grammatically correct adversarial sentence caused the average F1 score to plummet from 75% to 36%. This dramatic drop demonstrates that high performance on standard benchmarks is not synonymous with robust language understanding. Models were easily distracted by semantically related but irrelevant information.

3.2 Impact of Ungrammatical Sequences

In a more extreme test, the adversary was allowed to add ungrammatical sequences of words (e.g., "Quarterback jersey 37 Dean Jeff had"). On a subset of four models, this caused the average accuracy to fall to approximately 7%. This result highlights a severe weakness: many models rely heavily on local word matching and surface-level patterns, failing completely when those patterns are broken, even nonsensically.

Figure 1 Analysis (Conceptual)

The provided example illustrates the attack. The original paragraph about Peyton Manning and John Elway is appended with the adversarial sentence about "Jeff Dean." A model like BiDAF, which initially correctly predicted "John Elway," changes its answer to the distracting entity "Jeff Dean" because it appears in a sentence containing the question's keywords ("quarterback," a number). A human reader effortlessly ignores this irrelevant addition.

4. Technical Framework & Case Study

Analysis Framework Example (Non-Code): To deconstruct a model's vulnerability, one can apply a simple diagnostic framework:

Input Perturbation: Identify the question's key entities (e.g., "quarterback," "38," "Super Bowl XXXIII").
Distractor Construction: Generate a candidate sentence that includes these entities but alters the relationship (e.g., changes the number, uses a different named entity).
Model Interrogation: Use attention visualization or gradient-based saliency maps (similar to techniques in Simonyan et al., 2014 for CNNs) to see if the model's focus shifts from the evidentiary sentence to the distractor.
Robustness Score: Define a metric $R = 1 - \frac{P(\hat{y}_{adv} \neq y_{true})}{P(\hat{y}_{orig} \neq y_{true})}$, where a lower score indicates higher vulnerability to this specific adversarial pattern.

This framework helps pinpoint whether a model fails due to lexical bias, lack of coreference resolution, or poor relational reasoning.

5. Critical Analysis & Expert Insights

Core Insight: The paper delivers a brutal truth: the NLP community was, in 2017, largely building and celebrating pattern matchers, not comprehenders. The near-human F1 scores on SQuAD were a mirage, shattered by a simple, rule-based adversary. This work is the NLP equivalent of revealing that a self-driving car performing perfectly on a sunny test track fails catastrophically at the first sight of a graffiti-marked stop sign.

Logical Flow: The argument is impeccably structured. It starts by challenging the adequacy of existing metrics (Introduction), proposes a concrete adversarial method as a solution (Methodology), provides devastating empirical evidence (Experiments), and concludes by redefining the goalpost for "success" in reading comprehension. The use of both grammatical and ungrammatical attacks cleanly separates failures in semantic understanding from failures in syntactic robustness.

Strengths & Flaws: Its greatest strength is its simplicity and potency—the attack is easy to understand and execute, yet its effects are dramatic. It successfully shifted the research agenda towards robustness. However, a flaw is that the distractor generation, while effective, is somewhat heuristic and task-specific. It doesn't provide a general, gradient-based adversarial attack method for text like Papernot et al. (2016) did for discrete domains, which limited its immediate adoption for adversarial training. Furthermore, it primarily exposes one type of weakness (overstability to lexical distractors), not necessarily all facets of misunderstanding.

Actionable Insights: For practitioners and researchers, this paper mandates a paradigm shift: benchmark performance is necessary but insufficient. Any model claiming comprehension must be stress-tested against adversarial evaluation. The actionable takeaway is to integrate adversarial filtering into the development pipeline—automatically generating or collecting perturbed examples to train and validate models. It also argues for evaluation metrics that incorporate robustness scores alongside accuracy. Ignoring this paper's warning means risking deployment of brittle systems that will fail in unpredictable, and potentially costly, ways when faced with natural but confusing language in real-world applications.

6. Future Directions & Applications

The paper catalyzed several key research directions:

Adversarial Training: Using generated adversarial examples as additional training data to improve model robustness, a technique now standard in robust ML.
Robust Benchmarks: The creation of dedicated adversarial datasets like Adversarial SQuAD (Adv-SQuAD), Robustness Gym, and Dynabench, which focus on model failures.
Interpretability & Analysis: Driving the development of better model introspection tools to understand why models are distracted, leading to more architecturally robust designs (e.g., models with better reasoning modules).
Broader Applications: The principle extends beyond QA to any NLP task where superficial cues can be exploited—sentiment analysis (adding contradictory clauses), machine translation (inserting ambiguous phrases), and dialogue systems. It underscores the need for stress testing AI systems before deployment in critical areas like legal document review, medical information retrieval, or educational tools.

7. References

Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2021–2031).
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR).
Papernot, N., McDaniel, P., Swami, A., & Harang, R. (2016). Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations (ICLR).