Evaluating Neural Language Models as Cognitive Models of Language Acquisition

1 Introduction

The paper critically examines the growing trend of using neural language models (LMs) as proxies for theories of human language acquisition. While LMs have achieved remarkable success on various NLP tasks, their relevance as cognitive models is questioned due to fundamental differences in training data scale and nature compared to child language learning.

The authors argue that popular syntactic evaluation benchmarks (e.g., BLiMP, SyntaxGym) may lack the structural diversity and psychological validity needed to assess whether LMs acquire language in a human-like way. They advocate for using more rigorous, linguistically-curated datasets like the LI-Adger dataset, which contains gradient acceptability judgments from native speakers.

1.1 Implications for Language Acquisition?

This section highlights the stark data disparity: models like BERT are trained on billions of tokens, while a child receives only about 10 million words per year. Recent work attempts to bridge this gap by training models on child-directed speech (CDS) at a more human-like scale (e.g., 5M tokens). The central question is whether models trained on such "ablated" input can still perform well on behavioral benchmarks and thus serve as valid cognitive models.

2 Core Insight: The Benchmarking Mirage

The paper's core thesis is a direct challenge to the NLP community's complacency. Impressive performance on templated, synthetic benchmarks like BLiMP creates an illusion of grammatical competence. The authors expose this as a methodological artifact. When LMs are tested on the LI-Adger dataset—a carefully constructed set of minimal pairs designed by theoretical linguists to probe specific syntactic principles—their evaluations diverge significantly from human judgments. This isn't just a performance gap; it's evidence of a fundamental representational mismatch. LMs may be learning surface statistical patterns that coincidentally align with simple syntactic templates, not the abstract, hierarchical structures that underpin human grammar.

3 Logical Flow: From Data Disparity to Methodological Critique

The argument proceeds with surgical precision. First, it establishes the undeniable data-scale chasm between LM training and child acquisition, framing the "small-scale training" research as a necessary but insufficient corrective. Second, it demonstrates that even on this leveled playing field (small data), LMs can be matched by simpler baselines, questioning their added cognitive value. The logical pivot is the critique of benchmark design: templated tasks lack the "structural diversity" of real linguistic inquiry. The final, damning evidence comes from the LI-Adger test, where LM performance flatly contradicts human linguistic intuition. The flow is: problem statement (data mismatch) -> attempted solution (small-scale training) -> exposure of deeper problem (flawed evaluation) -> conclusive counter-evidence.

4 Strengths & Flaws: A Critical Dissection

Strengths: The paper's greatest strength is its methodological rigor and interdisciplinary grounding. It doesn't just criticize; it offers a superior alternative (LI-Adger). By tying evaluation to core theoretical linguistics and psycholinguistics, it raises the bar for what constitutes evidence of "human-like" knowledge. The focus on data scale is also prescient, aligning with broader trends in efficient ML.

Flaws & Omissions: The analysis, while sharp, potentially overstates the failure. Does divergence on LI-Adger invalidate all parallels between LM learning and acquisition? Perhaps not. The paper could engage more with what LMs do get right and why. Furthermore, it leans heavily on syntactic knowledge; a fuller cognitive model must also account for semantic, pragmatic, and social learning aspects. The call for "more realistic data" is valid but underspecified—how do we model the multimodal, interactive, and error-filled nature of child-directed input?

5 Actionable Insights: A Path Forward

For researchers, the mandate is clear: abandon the comfort of easy benchmarks. Integrate resources from theoretical linguistics (like the LI-Adger paradigm) and developmental psychology into evaluation suites. Prioritize the creation of "cognitive benchmarks" that test for the hallmarks of human language learning: generalization from sparse data, robustness to noise, and adherence to abstract grammatical principles. For model developers, the goal should shift from maximizing benchmark scores to designing architectures and training regimes that are data-efficient and can learn from human-like input (e.g., incorporating curriculum learning or active learning mechanisms inspired by development). The ultimate insight: building a true cognitive model is a different—and harder—problem than building a performant NLP system.

6 Original Analysis: The Cognitive Chasm in Language Modeling

This paper by Vázquez Martínez et al. delivers a necessary and sobering critique in an era often dazzled by scale. It correctly identifies a fundamental tension: while modern LMs, especially large language models (LLMs), exhibit impressive surface-level linguistic competence, their path to that competence is astronomically different from a child's. The authors' focus on benchmark insufficiency is particularly astute. It echoes concerns in other AI domains where benchmark performance fails to translate to robust, generalizable intelligence. For instance, in computer vision, models that excel on ImageNet can be fooled by simple adversarial perturbations, revealing a lack of true visual understanding—a phenomenon detailed in research from institutions like MIT and Google Brain. Similarly, the paper shows that LMs' success on BLiMP may be a similar kind of "Clever Hans" effect, where models exploit statistical regularities in the benchmark construction rather than learning the underlying syntactic rule.

The advocacy for the LI-Adger dataset is the paper's most significant contribution. By grounding evaluation in minimal pairs and gradient acceptability judgments—the gold standard in theoretical syntax—it forces models to demonstrate knowledge of grammaticality, not just likelihood. The finding that LMs fail here is telling. It suggests that the probability distributions learned from vast text corpora ($P(w_n | w_{1:n-1})$) do not necessarily converge on the categorical or gradient judgments that characterize human grammatical knowledge. This aligns with the arguments of linguists like Noam Chomsky, who have long contended that statistical learning from surface forms is insufficient to explain the poverty of the stimulus and the abstract nature of syntactic rules.

However, the paper's conclusion shouldn't be that LMs are irrelevant to cognitive science. Instead, it reframes the challenge. The future lies in "cognitive architecture-informed" modeling. This might involve incorporating inductive biases inspired by linguistic theory (e.g., a predisposition for hierarchical structure), as seen in some neuro-symbolic approaches, or designing training objectives that go beyond next-word prediction. The work of researchers like Brenden Lake and Marco Baroni on few-shot learning and compositionality points in this direction. The path forward is not to discard LMs but to rigorously test them against the right cognitive benchmarks and iteratively redesign them based on the failures, much like the cycle of theory and experiment in other sciences.

7 Technical Details & Mathematical Framework

The core evaluation method discussed is using a language model's output probabilities to predict human acceptability judgments. For a sentence $S = w_1, w_2, ..., w_n$, a standard autoregressive LM assigns a probability: $$P_{LM}(S) = \prod_{i=1}^{n} P(w_i | w_1, ..., w_{i-1}; \theta)$$ where $\theta$ are the model parameters. The surprisal or negative log-likelihood is often used as a proxy for (un)acceptability: $$\text{Surprisal}(S) = -\frac{1}{n} \sum_{i=1}^{n} \log P(w_i | w_1, ..., w_{i-1}; \theta)$$ The hypothesis is that higher probability (lower surprisal) should correlate with higher human acceptability ratings. The paper's critical finding is that this correlation breaks down on the LI-Adger dataset, indicating a disconnect between the LM's probability-based "grammaticality" metric and human judgment.

The paper also references models trained on child-directed speech. The key technical challenge here is learning from very small datasets ($\approx 5\times10^6$ tokens) compared to standard LM corpora ($>10^9$ tokens). This requires efficient architectures and training techniques to avoid overfitting and to extract generalizable patterns from sparse data.

8 Experimental Results & Chart Analysis

The paper presents a key result in Figure 1 (described in the PDF content). The chart compares the performance of different LMs (BabyBERTa, AO-CHILDES, AO-NEWSELA, Wikipedia-1) on the LI-Adger dataset against a baseline of human performance.

Chart Interpretation: The vertical line representing human performance acts as a benchmark. The chart likely shows the correlation coefficient (e.g., Spearman's $\rho$) between model surprisal and human acceptability ratings for each LM. The critical finding is that all LM bars fall significantly short of the human benchmark line. This visually demonstrates the paper's central claim: even models specifically trained on child-like data (BabyBERTa, AO-CHILDES) fail to match human judgments on this syntactically nuanced dataset. The performance gap indicates that current LM training objectives do not lead to the acquisition of human-like grammatical knowledge, as measured by this rigorous test.

9 Analysis Framework: The LI-Adger Case Study

Framework: Evaluating LMs as Cognitive Models via Minimal Pair Acceptability.

Objective: To determine if an LM's internal probability distribution aligns with human grammatical intuition for structurally contrastive sentences.

Procedure:

Stimulus Selection: Use a dataset like LI-Adger, which consists of minimal pairs (e.g., "Who do you think that John saw?" vs. "Who do you think John saw?") where one variant is grammatical and the other is less acceptable or ungrammatical, based on a specific syntactic principle (e.g., the "that-trace" filter).
Model Query: For each sentence $S$ in a minimal pair, compute the model's average token surprisal: $\text{Surprisal}(S) = -\frac{1}{|S|} \sum \log P(w_i | context)$.
Prediction Generation: The model "prefers" the sentence with lower surprisal. For a minimal pair (A, B), if $\text{Surprisal}(A) < \text{Surprisal}(B)$, the model predicts A is more acceptable.
Comparison to Human Data: Compare the model's preference pattern across hundreds of such minimal pairs to the aggregated acceptability judgments from human participants. Calculate a correlation coefficient (e.g., Spearman's $\rho$) between model surprisal and human rating scores.
Interpretation: A high, significant positive correlation would suggest the LM's knowledge aligns with human syntactic judgment. A low or non-significant correlation (as found in the paper) indicates a divergence.

Non-Code Example: Consider testing knowledge of subject-verb agreement across a distracting clause: "The key to the cabinets *are/*is on the table." Humans robustly rate "is" as correct. An LM that has learned the abstract agreement rule (subject 'key' -> verb 'is') should assign higher probability to the correct sentence. An LM relying on local n-gram statistics might be misled by the proximity of "cabinets" and prefer "are." Applying the framework above to many such pairs reveals the nature of the LM's acquired knowledge.

10 Future Applications & Research Directions

1. Development of "Cognitive Benchmarks": A major direction is the creation of standardized, multi-faceted evaluation suites that go beyond syntax to include semantics, pragmatics, and language acquisition milestones (e.g., vocabulary spurt, overgeneralization errors). These benchmarks should be co-designed by computational linguists, developmental psychologists, and cognitive scientists.

2. Architectures with Linguistic Inductive Biases: Future models may incorporate explicit structural priors. For example, architectures that inherently build hierarchical representations or enforce syntactic constraints during generation, moving closer to the principles-and-parameters framework in linguistics.

3. Interactive and Multimodal Training: To better simulate child learning, models could be trained not on static text but on interactive, multimodal data streams (vision + speech + text) within a grounded environment, as explored in embodied AI research.

4. Data-Efficient and Curriculum Learning: Developing training algorithms that succeed with orders-of-magnitude less data, perhaps by implementing curriculum learning strategies that mirror the progression of complexity in child-directed speech.

5. Bridging to Neurolinguistics: Comparing the internal representations and processing dynamics of LMs with neural data from humans (e.g., fMRI, EEG) during language tasks, as pioneered by the work of researchers at MIT's McGovern Institute, could provide a new level of validation for cognitive models.

11 References

Linzen, T., & Baroni, M. (2021). Syntactic structure from deep learning. Annual Review of Linguistics.
Warstadt, A., et al. (2020). BLiMP: The Benchmark of Linguistic Minimal Pairs. Proceedings of ACL.
Huebner, P. A., et al. (2021). BabyBERTa: Learning More Grammar With Small-Scale Child-Directed Language. Proceedings of CoNLL.
Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press.
Lake, B. M., & Baroni, M. (2023). Human-like systematic generalization through a meta-learning neural network. Nature.
Hewitt, J., & Manning, C. D. (2019). A Structural Probe for Finding Syntax in Word Representations. Proceedings of NAACL.
Warstadt, A., & Bowman, S. R. (2022). What Artificial Neural Networks Can Tell Us About Human Language Acquisition. Algebraic Structures in Natural Language.
Fenson, L., et al. (1994). Variability in early communicative development. Monographs of the Society for Research in Child Development.