Analysis: Do Character Language Models Learn English Morpho-Syntax?

1. Introduction & Overview

This analysis is based on the research paper "Indications that character language models learn English morpho-syntactic units and regularities" by Kementchedjhieva and Lopez (2018). The central question addressed is whether character-level Recurrent Neural Networks (RNNs), specifically LSTMs, move beyond merely memorizing surface character patterns to learning abstract linguistic structures like morphemes and syntactic categories.

While prior work (e.g., Chung et al., 2016; Kim et al., 2016) claimed such models possess morphological awareness, this paper provides direct empirical evidence through systematic probing experiments. The authors instrument a character LSTM language model trained on English Wikipedia text to investigate its internal representations and generalization capabilities.

Core Thesis:

The paper argues that a character-level language model can, under certain conditions (e.g., when morphemes largely overlap with words), learn to identify higher-order linguistic units (morphemes, words) and capture some of their underlying properties and combinatorial regularities.

2. Language Modeling & Architecture

The model under investigation is a 'wordless' character-level RNN with Long Short-Term Memory (LSTM) units, following the architecture popularized by Karpathy (2015). Input is a continuous stream of characters, including spaces treated as regular tokens, with no explicit word segmentation.

2.1 Model Formulation

The model operates as follows at each timestep $t$:

Character Embedding: Input character $c_t$ is converted to a dense vector: $\mathbf{x}_{c_t} = E^T \mathbf{v}_{c_t}$, where $E \in \mathbb{R}^{|V| \times d}$ is the embedding matrix, $|V|$ is character vocabulary size, $d$ is embedding dimension, and $\mathbf{v}_{c_t}$ is a one-hot vector.
Hidden State Update: The LSTM updates its hidden state: $\mathbf{h}_t = \text{LSTM}(\mathbf{x}_{c_t}, \mathbf{h}_{t-1})$.
Output Probability: A linear layer followed by softmax predicts the next character: $p(c_{t+1} = c | \mathbf{h}_t) = \text{softmax}(\mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o)_i$ for all $c \in V$, where $i$ is the index of $c$.

2.2 Training Details

The model was trained on the first 7 million character tokens from the English Wikipedia, presented as a continuous stream. This setup forces the model to infer word and morphological boundaries from distributional patterns alone.

3. Core Findings & Evidence

The authors employ several probing techniques to uncover what the model has learned.

3.1 Productive Morphological Processes

The model demonstrates an ability to apply English morphological rules productively. For instance, when prompted with a novel stem, it can generate plausible inflected or derived forms, suggesting it has abstracted morphemic units (e.g., recognizing "-ed" as a past tense suffix) rather than just memorizing whole words.

3.2 The "Boundary Unit" Discovery

A critical finding is the identification of a specific hidden unit within the LSTM that consistently exhibits high activation at word boundaries (spaces). This unit effectively acts as a learned word segmenter. Crucially, its activation pattern extends to morpheme boundaries within words (e.g., at the junction of "un" and "happy"), providing a mechanistic explanation for how the model identifies sub-word units.

3.3 Learning Morpheme Boundaries

Experiments suggest the model learns morpheme boundaries by extrapolating from the more frequent and clear signal of word boundaries. The statistical regularity of spaces provides a scaffold for discovering internal morphological structure.

3.4 Encoding Syntactic Information (POS)

Probing classifiers trained on the model's hidden states can accurately predict a word's part-of-speech (POS) tag. This indicates that the character-level model encodes not just morphological but also syntactic information about the words it processes, likely inferred from sequential context.

4. Key Experiment: Selectional Restrictions

The most compelling evidence comes from testing the model's knowledge of selectional restrictions of English derivational morphemes. This task sits at the morphology-syntax interface. For example, the suffix "-ity" typically attaches to adjectives to form nouns ("active" → "activity"), not to verbs ("*runity").

The authors test the model by comparing the probability it assigns to a correct derivation (e.g., completing "active" with "-ity") versus an incorrect one (e.g., completing "run" with "-ity"). The model shows a strong preference for linguistically valid combinations, demonstrating it has learned these abstract constraints.

Experimental Result Highlight:

The character LM successfully distinguished between licit and illicit morpheme combinations with high accuracy, confirming it captures morpho-syntactic regularities beyond surface form.

5. Technical Details & Mathematical Formulation

The core learning mechanism is the LSTM's ability to compress sequential history into a state vector $\mathbf{h}_t$. The probability of the next character is given by: $$p(c_{t+1} | c_{1:t}) = \text{softmax}(\mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o)$$ where $\mathbf{h}_t = f_{\text{LSTM}}(\mathbf{x}_{c_t}, \mathbf{h}_{t-1})$. The model's "understanding" of morphology and syntax is implicitly encoded in the parameters of the LSTM ($\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o, \mathbf{W}_c$, etc.) and the projection matrices, which are optimized to minimize cross-entropy loss on character prediction.

The probing experiments involve training simple classifiers (e.g., logistic regression) on frozen hidden state representations $\mathbf{h}_t$ to predict external linguistic labels (e.g., "is this a word boundary?"), revealing what information is linearly encoded in those states.

6. Results & Interpretation

The results collectively paint a convincing picture:

Boundary Detection: Existence of a dedicated "boundary unit" provides a clear, interpretable mechanism for unit discovery.
Productive Generalization: The model applies rules to novel items, ruling out pure memorization.
Syntactic Awareness: POS information is encoded, enabling syntax-sensitive operations.
Morpho-Syntactic Integration: Success on selectional restriction tasks shows the model integrates morphological and syntactic knowledge.

Limitation Noted: The authors acknowledge the model sometimes makes incorrect generalizations, indicating its learned abstractions are imperfect approximations of human linguistic competence.

7. Analysis Framework & Case Example

Framework: The paper employs a multi-pronged probing framework: 1. Generative Probing: Test productive use (e.g., novel word completion). 2. Diagnostic Classifier Probing: Train auxiliary models on hidden states to predict linguistic features. 3. Unit Analysis: Manually inspect activation patterns of individual neurons.

Case Example - Probing for "-ity": To test knowledge of the suffix "-ity", the framework would: 1. Extract the hidden state $\mathbf{h}$ after processing the stem (e.g., "active"). 2. Use a diagnostic classifier on $\mathbf{h}$ to predict if the next morpheme is a noun-forming suffix. 3. Compare the model's probability $p(\text{'ity'} | \text{'active'})$ vs. $p(\text{'ity'} | \text{'run'})$. 4. Analyze activation of the "boundary unit" at the stem end to see if it signals a morpheme boundary suitable for derivation.

8. Analyst's Perspective: Core Insight & Critique

Core Insight: This paper delivers a masterclass in model interrogation. It moves beyond performance metrics to ask *what* is learned and *how*. The finding of a "boundary neuron" is particularly elegant—it's a rare instance of clear, mechanistic interpretability in a deep network. The work convincingly argues that character LSTMs are not mere pattern matchers but can induce abstract linguistic categories from distributional signals, supporting claims made in earlier applied work like the Byte-based Machine Translation systems of Lee et al. (2016).

Logical Flow: The argument is tightly constructed: from observing productive generalization (the "what") to discovering the boundary unit (a potential "how"), then validating it explains morpheme learning, and finally testing a complex, integrated capability (selectional restrictions). This stepwise validation is robust.

Strengths & Flaws: Strengths: Methodological rigor in probing; compelling, interpretable evidence (the boundary unit); tackling a fundamental question in NLP interpretability. Flaws: The scope is limited to English, a language with relatively simple morphology and near-perfect alignment between spaces and word boundaries. The conclusion's caveat—"when morphemes overlap extensively with the words of a language"—is crucial. This likely breaks down for agglutinative languages (e.g., Turkish, Finnish) or scriptio continua languages. The model's "abstraction" may be heavily scaffolded by orthographic conventions, a point less emphasized. As noted in resources like the ACL Anthology on morphological modeling, the challenge varies dramatically cross-linguistically.

Actionable Insights: For practitioners: 1) Character-level models *can* capture linguistic structure, validating their use in low-resource or morphologically rich settings—but verify for your language. 2) The probing framework is a blueprint for auditing model capabilities. For researchers: The paper sets a benchmark for interpretability work. Future directions must stress-test these findings across typologically diverse languages and in modern Transformer-based character models (e.g., ByT5). The field must ask if the impressive results here are a product of English's peculiarities or a general capacity of sequence models.

In essence, Kementchedjhieva and Lopez provide strong evidence for emergent linguistic abstraction in character LSTMs, but they also implicitly map the boundaries of that abstraction. It's a foundational piece that pushes the community from intuition to evidence.

9. Future Applications & Research Directions

Low-Resource & Morphologically Rich Languages: Character/subword models that learn morphology intrinsically could reduce dependency on costly morphological analyzers for languages like Arabic or Turkish.

Improved Model Interpretability: Techniques for identifying "functional neurons" like the boundary unit can be generalized to understand how models represent other linguistic features (tense, negation, semantic roles).

Bridging Symbolic and Sub-Symbolic AI: Understanding how neural models learn discrete, rule-like patterns (e.g., selectional restrictions) can inform hybrid AI architectures.

Robustness Testing: Applying this probing methodology to state-of-the-art large language models (LLMs) to see if they develop similar or more sophisticated linguistic representations.

Cross-Linguistic Generalization: A major open direction is to test if these findings hold in languages with different morphological systems and orthographies, moving beyond the Indo-European bias.

10. References

Kementchedjhieva, Y., & Lopez, A. (2018). Indications that character language models learn English morpho-syntactic units and regularities. arXiv preprint arXiv:1809.00066.

Chung, J., Cho, K., & Bengio, Y. (2016). A character-level decoder without explicit segmentation for neural machine translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware neural language models. Proceedings of the AAAI Conference on Artificial Intelligence.

Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy blog.

Lee, J., Cho, K., & Hofmann, T. (2016). Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017.

Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning.

Association for Computational Linguistics (ACL) Anthology. A digital archive of research papers in computational linguistics and NLP. Retrieved from https://aclanthology.org/