Table of Contents
1. Introduction
This paper investigates the syntactic biases learned by Recurrent Neural Network (RNN) Language Models, specifically focusing on the phenomenon of relative clause (RC) attachment ambiguity. The central hypothesis is that the architectural biases of RNNs (e.g., recency bias) serendipitously align with the predominant human parsing preference in English (LOW attachment), but not with the contrasting preference found in Spanish (HIGH attachment). This creates an illusion of human-like syntactic competence in English models that does not generalize cross-linguistically, challenging the assumption that necessary linguistic biases are present in the training data.
2. Methodology & Experimental Design
2.1. Relative Clause Attachment Ambiguity
The study probes models using sentences with ambiguous RC attachments, such as: "Andrew had dinner yesterday with the nephew of the teacher that was divorced." Two interpretations are possible: attachment to the higher noun phrase ("nephew" - HIGH) or the lower noun phrase ("teacher" - LOW). While both are grammatically valid, English speakers show a reliable LOW-attachment bias, whereas Spanish speakers show a HIGH-attachment bias.
2.2. Model Architecture & Training
Standard RNN-based language models (e.g., LSTMs or GRUs) were trained on large corpora of English and Spanish text. The training objective is to minimize the negative log-likelihood of the next word given the previous context: $L(\theta) = -\sum_{t=1}^{T} \log P(w_t | w_{ Model preference is quantified by comparing the conditional probability the model assigns to the sentence continuation under each interpretation (HIGH vs. LOW). The bias score is calculated as the log probability difference: $\text{Bias} = \log P(\text{LOW}) - \log P(\text{HIGH})$. RNN LMs trained on English text consistently exhibited a significant LOW-attachment bias, mirroring the well-documented human preference. This suggests the model's internal representations align with human syntactic processing for this phenomenon in English. In stark contrast, RNN LMs trained on Spanish text failed to exhibit the human-like HIGH-attachment bias. Instead, they often showed a weak or even reversed (LOW) bias, indicating a failure to capture the typologically common syntactic preference present in the Spanish data. The divergence in model performance between English and Spanish strongly suggests that the apparent success in English is not due to learning abstract syntactic rules from data, but rather from an overlap between the RNN's inherent recency bias (favoring attachment to the most recent noun) and the English LOW-attachment preference. This architectural bias works against learning the HIGH-attachment preference required for Spanish. The core of the language model is the sequential prediction of word $w_t$ given its context. For an RNN, the hidden state $h_t$ is updated as: $h_t = f(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$, where $f$ is a non-linear activation (e.g., tanh or LSTM cell). The probability distribution over the vocabulary is: $P(w_t | w_{ Case: Evaluating an RNN LM's understanding of RC attachment in the sentence: "The journalist interviewed the assistant of the senator who was controversial." Core Insight: This paper delivers a crucial reality check for the NLP community. It demonstrates that what looks like "learning syntax" in an LM can often be a mirage—a fortunate coincidence between a model's architectural shortcomings (like recency bias) and the statistical patterns of a specific language (English). The failure to replicate the result in Spanish exposes the fragility of this "learning." As highlighted in the seminal work on evaluating syntactic knowledge in LMs by Linzen et al. (2016), we must be wary of attributing human-like linguistic competence to models based on narrow, language-specific successes. Logical Flow: The argument is elegantly constructed. It starts with a known human linguistic contrast (EN LOW vs. ES HIGH bias), trains standard models on both languages, and finds a performance asymmetry. The authors then logically connect this asymmetry to a known, non-linguistic property of RNNs (recency bias), providing a parsimonious explanation that doesn't require positing abstract rule learning. This flow effectively undermines the assumption that the training signal alone contains sufficient information for learning deep syntax. Strengths & Flaws: The major strength is the clever use of cross-linguistic variation as a controlled experiment to disentangle data-driven learning from architectural bias. This is a powerful methodological contribution. However, the analysis is somewhat limited by its focus on a single, albeit important, syntactic phenomenon. It leaves open the question of how widespread this issue is—are other apparent syntactic competencies in English LMs similarly illusory? Furthermore, the study uses older RNN architectures; testing with modern Transformer-based models (which have different inductive biases, like attention) is a critical next step, as suggested by the evolution seen from models like GPT-2 to GPT-3. Actionable Insights: For researchers and engineers, this paper mandates a shift in evaluation strategy. First, cross-linguistic evaluation must become a standard stress test for any claim about a model's linguistic capabilities, moving beyond the Anglo-centric benchmark suite. Second, we need more "probes" that separate architectural bias from genuine learning, perhaps by designing adversarial datasets in a single language. Third, for those building production systems for non-English languages, this is a stark warning: off-the-shelf architectures may embed syntactic biases that are alien to the target language, potentially degrading performance on complex parsing tasks. The path forward involves either designing more linguistically-informed model architectures or developing training objectives that explicitly penalize these unwanted inductive biases, moving beyond simple next-word prediction.2.3. Evaluation Metrics
Key Experimental Parameters
3. Results & Analysis
3.1. English Model Performance
3.2. Spanish Model Performance
3.3. Cross-Linguistic Comparison
4. Technical Details & Mathematical Framework
5. Analysis Framework: A Non-Code Case Study
6. Core Insight & Analyst's Perspective
7. Future Applications & Research Directions
8. References