Select Language

Indications that Character Language Models Learn English Morpho-Syntactic Units and Regularities

Analysis of how character-level language models learn abstract morphological regularities, word boundaries, and syntactic properties without explicit supervision.
learn-en.org | PDF Size: 0.2 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - Indications that Character Language Models Learn English Morpho-Syntactic Units and Regularities

Table of Contents

1.1 Introduction

Character-level language models (LMs) have demonstrated remarkable capabilities in open-vocabulary generation, enabling applications in speech recognition and machine translation. These models achieve success through parameter sharing across frequent, rare, and unseen words, leading to claims about their ability to learn morphosyntactic properties. However, these claims have been largely intuitive rather than empirically supported. This research investigates what character LMs actually learn about morphology and how they learn it, focusing on English language processing.

1.2 Language Modeling

The study employs a 'wordless' character RNN with LSTM units, where input is not segmented into words and spaces are treated as regular characters. This architecture enables morphological-level analysis by allowing partial word inputs and completion tasks.

1.2.1 Model Formulation

At each timestep $t$, character $c_t$ is projected into embedding space: $x_{c_t} = E^T v_{c_t}$, where $E \in \mathbb{R}^{|V| \times d}$ is the character embedding matrix, $|V|$ is character vocabulary size, $d$ is embedding dimension, and $v_{c_t}$ is a one-hot vector.

The hidden state is computed as: $h_t = \text{LSTM}(x_{c_t}; h_{t-1})$

The probability distribution over next characters is: $p(c_{t+1} = c | h_t) = \text{softmax}(W_o h_t + b_o)_i$ for all $c \in V$

1.2.2 Training Details

The model was trained on the first 7 million character tokens from English text data, using standard backpropagation through time with cross-entropy loss optimization.

2.1 Productive Morphological Processes

When generating text, the LM applies English morphological processes productively in novel contexts. This surprising finding suggests the model can identify relevant morphemes for these processes, demonstrating abstract morphological learning beyond surface patterns.

2.2 Boundary Detection Unit

Analysis of the LM's hidden units reveals a specific unit that activates at morpheme and word boundaries. This boundary detection mechanism appears crucial for the model's ability to identify linguistic units and their properties.

3.1 Learning Morpheme Boundaries

The LM learns morpheme boundaries through extrapolation from word boundaries. This bottom-up learning approach enables the model to develop hierarchical representations of linguistic structure without explicit supervision.

3.2 Part-of-Speech Encoding

Beyond morphology, the LM encodes syntactic information about words, including their part-of-speech categories. This dual encoding of morphological and syntactic properties enables more sophisticated linguistic processing.

4.1 Selectional Restrictions

The LM captures the syntactic selectional restrictions of English derivational morphemes, demonstrating awareness at the morphology-syntax interface. However, the model makes some incorrect generalizations, indicating limitations in its learning.

4.2 Experimental Results

The experiments demonstrate that the character LM can:

  1. Identify higher-order linguistic units (morphemes and words)
  2. Learn underlying linguistic properties and regularities of these units
  3. Apply morphological processes productively in novel contexts
  4. Encode both morphological and syntactic information

5. Core Insight & Analysis

Core Insight

Character-level language models aren't just memorizing character sequences—they're developing genuine linguistic abstractions. The most significant finding here is the emergence of a dedicated "boundary detection unit" that essentially performs unsupervised morphological segmentation. This isn't trivial pattern recognition; it's the model constructing a theory of word structure from raw character data.

Logical Flow

The research progression is methodical and convincing: 1) Observe productive morphological behavior, 2) Probe the network to find explanatory mechanisms, 3) Validate through boundary detection experiments, 4) Test higher-order syntactic-morphological integration. This mirrors the approach in landmark papers like the original Transformer paper (Vaswani et al., 2017), where architectural innovations were validated through systematic probing.

Strengths & Flaws

Strengths: The boundary unit discovery is genuinely novel and has implications for how we understand neural network linguistic representations. The experimental design is elegant in its simplicity—using completion tasks to test morphological productivity. The connection to selectional restrictions shows the model isn't just learning morphology in isolation.

Flaws: The English focus limits generalizability to morphologically richer languages. The 7M character training corpus is relatively small by modern standards—we need to see if these findings scale to billion-token corpora. The "incorrect generalizations" mentioned but not detailed represent a missed opportunity for deeper error analysis.

Actionable Insights

For practitioners: This research suggests character-level models deserve reconsideration for morphologically complex languages, especially low-resource scenarios. The boundary detection mechanism could be explicitly engineered rather than emergent—imagine initializing a dedicated boundary unit. For researchers: This work connects to broader questions about linguistic abstraction in neural networks, similar to investigations in vision models like CycleGAN (Zhu et al., 2017) that probe what representations emerge during unsupervised learning. The next step should be comparative studies across languages with different morphological systems, perhaps using resources like UniMorph (Kirov et al., 2018).

The most compelling implication is that character models might offer a path toward more human-like language acquisition—learning morphology from distributional patterns rather than explicit segmentation rules. This aligns with psycholinguistic theories of morphological processing and suggests neural networks can develop linguistically plausible representations without symbolic supervision.

6. Technical Details

6.1 Mathematical Formulation

The character embedding process can be formalized as:

$\mathbf{x}_t = \mathbf{E}^\top \mathbf{v}_{c_t}$

where $\mathbf{E} \in \mathbb{R}^{|V| \times d}$ is the embedding matrix, $\mathbf{v}_{c_t}$ is the one-hot vector for character $c_t$, and $d$ is the embedding dimension.

The LSTM update equations follow the standard formulation:

$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$

$\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$

$\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_C)$

$\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$

$\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$

$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)$

6.2 Experimental Setup

The model uses 512-dimensional LSTM hidden states and character embeddings trained on 7M characters. Evaluation involves both quantitative metrics (perplexity, accuracy) and qualitative analysis of generated text and unit activations.

7. Analysis Framework Example

7.1 Probing Methodology

The research employs several probing techniques to investigate what the model learns:

  1. Completion Tasks: Feed partial words (e.g., "unhapp") and analyze probabilities assigned to possible completions ("-y" vs "-ily")
  2. Boundary Analysis: Monitor specific hidden unit activations around space characters and morpheme boundaries
  3. Selectional Restriction Tests: Present stems with derivational morphemes and evaluate grammaticality judgments

7.2 Case Study: Boundary Unit Analysis

When processing the word "unhappiness," the boundary detection unit shows peak activation at:

This pattern suggests the unit learns to segment at both word and morpheme boundaries through exposure to similar patterns in training data.

8. Future Applications & Directions

8.1 Immediate Applications

8.2 Research Directions

8.3 Long-Term Implications

This research suggests character-level models might provide a more cognitively plausible approach to language learning, potentially leading to:

  1. More data-efficient language models
  2. Better handling of novel words and morphological creativity
  3. Improved interpretability through linguistically meaningful representations
  4. Bridges between computational linguistics and psycholinguistics

9. References

  1. Kementchedjhieva, Y., & Lopez, A. (2018). Indications that character language models learn English morpho-syntactic units and regularities. arXiv preprint arXiv:1809.00066.
  2. Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning.
  3. Chung, J., Cho, K., & Bengio, Y. (2016). A character-level decoder without explicit segmentation for neural machine translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
  4. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware neural language models. Proceedings of the AAAI Conference on Artificial Intelligence.
  5. Vaswani, A., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
  6. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision.
  7. Kirov, C., et al. (2018). UniMorph 2.0: Universal Morphology. Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
  8. Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy blog.