RNN Language Models and Cross-Linguistic Syntactic Bias: English vs. Spanish Relative Clause Attachment

1. Introduction
2. Methodology & Experimental Design
3. Results & Analysis
4. Technical Details & Mathematical Framework
5. Analysis Framework: A Non-Code Case Study
6. Core Insight & Analyst's Perspective
7. Future Applications & Research Directions
8. References

1. Introduction

This paper investigates the syntactic biases learned by Recurrent Neural Network (RNN) Language Models, specifically focusing on the phenomenon of relative clause (RC) attachment ambiguity. The central hypothesis is that the architectural biases of RNNs (e.g., recency bias) serendipitously align with the predominant human parsing preference in English (LOW attachment), but not with the contrasting preference found in Spanish (HIGH attachment). This creates an illusion of human-like syntactic competence in English models that does not generalize cross-linguistically, challenging the assumption that necessary linguistic biases are present in the training data.

2. Methodology & Experimental Design

2.1. Relative Clause Attachment Ambiguity

The study probes models using sentences with ambiguous RC attachments, such as: "Andrew had dinner yesterday with the nephew of the teacher that was divorced." Two interpretations are possible: attachment to the higher noun phrase ("nephew" - HIGH) or the lower noun phrase ("teacher" - LOW). While both are grammatically valid, English speakers show a reliable LOW-attachment bias, whereas Spanish speakers show a HIGH-attachment bias.

2.2. Model Architecture & Training

Standard RNN-based language models (e.g., LSTMs or GRUs) were trained on large corpora of English and Spanish text. The training objective is to minimize the negative log-likelihood of the next word given the previous context: $L(\theta) = -\sum_{t=1}^{T} \log P(w_t | w_{

2.3. Evaluation Metrics

Model preference is quantified by comparing the conditional probability the model assigns to the sentence continuation under each interpretation (HIGH vs. LOW). The bias score is calculated as the log probability difference: $\text{Bias} = \log P(\text{LOW}) - \log P(\text{HIGH})$.

Key Experimental Parameters

Languages: English, Spanish
Model Type: RNN (LSTM/GRU)
Evaluation Metric: Log Probability Difference
Human Baseline: LOW bias (English), HIGH bias (Spanish)

3. Results & Analysis

3.1. English Model Performance

RNN LMs trained on English text consistently exhibited a significant LOW-attachment bias, mirroring the well-documented human preference. This suggests the model's internal representations align with human syntactic processing for this phenomenon in English.

3.2. Spanish Model Performance

In stark contrast, RNN LMs trained on Spanish text failed to exhibit the human-like HIGH-attachment bias. Instead, they often showed a weak or even reversed (LOW) bias, indicating a failure to capture the typologically common syntactic preference present in the Spanish data.

3.3. Cross-Linguistic Comparison

The divergence in model performance between English and Spanish strongly suggests that the apparent success in English is not due to learning abstract syntactic rules from data, but rather from an overlap between the RNN's inherent recency bias (favoring attachment to the most recent noun) and the English LOW-attachment preference. This architectural bias works against learning the HIGH-attachment preference required for Spanish.

4. Technical Details & Mathematical Framework

The core of the language model is the sequential prediction of word $w_t$ given its context. For an RNN, the hidden state $h_t$ is updated as: $h_t = f(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$, where $f$ is a non-linear activation (e.g., tanh or LSTM cell). The probability distribution over the vocabulary is: $P(w_t | w_{

5. Analysis Framework: A Non-Code Case Study

Case: Evaluating an RNN LM's understanding of RC attachment in the sentence: "The journalist interviewed the assistant of the senator who was controversial."

Step 1 - Parse Generation: Construct two minimally different sentence continuations that force either a HIGH (assistant is controversial) or LOW (senator is controversial) interpretation.
Step 2 - Probability Query: Feed each full sentence (context + forced continuation) into the trained RNN LM and extract the sequence probability $P(\text{sentence})$.
Step 3 - Bias Calculation: Compute $\Delta = \log P(\text{LOW continuation}) - \log P(\text{HIGH continuation})$.
Step 4 - Interpretation: A positive $\Delta$ indicates a LOW bias (English-like); a negative $\Delta$ indicates a HIGH bias (Spanish-like). Compare this to human psycholinguistic data.

6. Core Insight & Analyst's Perspective

Core Insight: This paper delivers a crucial reality check for the NLP community. It demonstrates that what looks like "learning syntax" in an LM can often be a mirage—a fortunate coincidence between a model's architectural shortcomings (like recency bias) and the statistical patterns of a specific language (English). The failure to replicate the result in Spanish exposes the fragility of this "learning." As highlighted in the seminal work on evaluating syntactic knowledge in LMs by Linzen et al. (2016), we must be wary of attributing human-like linguistic competence to models based on narrow, language-specific successes.

Logical Flow: The argument is elegantly constructed. It starts with a known human linguistic contrast (EN LOW vs. ES HIGH bias), trains standard models on both languages, and finds a performance asymmetry. The authors then logically connect this asymmetry to a known, non-linguistic property of RNNs (recency bias), providing a parsimonious explanation that doesn't require positing abstract rule learning. This flow effectively undermines the assumption that the training signal alone contains sufficient information for learning deep syntax.

Strengths & Flaws: The major strength is the clever use of cross-linguistic variation as a controlled experiment to disentangle data-driven learning from architectural bias. This is a powerful methodological contribution. However, the analysis is somewhat limited by its focus on a single, albeit important, syntactic phenomenon. It leaves open the question of how widespread this issue is—are other apparent syntactic competencies in English LMs similarly illusory? Furthermore, the study uses older RNN architectures; testing with modern Transformer-based models (which have different inductive biases, like attention) is a critical next step, as suggested by the evolution seen from models like GPT-2 to GPT-3.

Actionable Insights: For researchers and engineers, this paper mandates a shift in evaluation strategy. First, cross-linguistic evaluation must become a standard stress test for any claim about a model's linguistic capabilities, moving beyond the Anglo-centric benchmark suite. Second, we need more "probes" that separate architectural bias from genuine learning, perhaps by designing adversarial datasets in a single language. Third, for those building production systems for non-English languages, this is a stark warning: off-the-shelf architectures may embed syntactic biases that are alien to the target language, potentially degrading performance on complex parsing tasks. The path forward involves either designing more linguistically-informed model architectures or developing training objectives that explicitly penalize these unwanted inductive biases, moving beyond simple next-word prediction.

7. Future Applications & Research Directions

Multilingual & Low-Resource NLP: Developing evaluation frameworks and model architectures that are robust across typologically diverse languages, ensuring equitable performance.
Diagnostic Benchmarking: Creating a suite of "bias detection" tasks to audit pre-trained models for spurious correlations and architectural artifacts before deployment.
Linguistically-Informed Model Design: Exploring hybrid models that incorporate explicit, parameterized linguistic priors (e.g., based on Universal Dependencies) to guide learning, especially for lower-resource languages.
Cognitive Modeling: Using the disconnect between model performance and human data (as in Spanish) to generate new hypotheses about human language processing and the nature of the "training signal" humans use.
Robust Machine Translation: Improving translation quality for sentences involving structural ambiguities by ensuring source-language parsing biases do not incorrectly transfer to the target language.

8. References

Davis, F., & van Schijndel, M. (2020). Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment. arXiv:2005.00165.
Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics.
Carreiras, M., & Clifton, C. (1999). Another word on parsing relative clauses: Eye-tracking evidence from Spanish and English. Memory & Cognition.
Fernández, E. M. (2003). Bilingual sentence processing: Relative clause attachment in English and Spanish. John Benjamins Publishing.
Radford, A., et al. (2018). Improving language understanding by generative pre-training. OpenAI Blog.
Dyer, C., et al. (2019). How to train your RNN to capture linguistic structure. BlackboxNLP Workshop.

Table of Contents