Table of Contents
1.1 Introduction
Character-level language models (LMs) have demonstrated remarkable capabilities in open-vocabulary generation, enabling applications in speech recognition and machine translation. These models achieve success through parameter sharing across frequent, rare, and unseen words, leading to claims about their ability to learn morphosyntactic properties. However, these claims have been largely intuitive rather than empirically supported. This research investigates what character LMs actually learn about morphology and how they learn it, focusing on English language processing.
1.2 Language Modeling
The study employs a 'wordless' character RNN with LSTM units, where input is not segmented into words and spaces are treated as regular characters. This architecture enables morphological-level analysis by allowing partial word inputs and completion tasks.
1.2.1 Model Formulation
At each timestep $t$, character $c_t$ is projected into embedding space: $x_{c_t} = E^T v_{c_t}$, where $E \in \mathbb{R}^{|V| \times d}$ is the character embedding matrix, $|V|$ is character vocabulary size, $d$ is embedding dimension, and $v_{c_t}$ is a one-hot vector.
The hidden state is computed as: $h_t = \text{LSTM}(x_{c_t}; h_{t-1})$
The probability distribution over next characters is: $p(c_{t+1} = c | h_t) = \text{softmax}(W_o h_t + b_o)_i$ for all $c \in V$
1.2.2 Training Details
The model was trained on the first 7 million character tokens from English text data, using standard backpropagation through time with cross-entropy loss optimization.
2.1 Productive Morphological Processes
When generating text, the LM applies English morphological processes productively in novel contexts. This surprising finding suggests the model can identify relevant morphemes for these processes, demonstrating abstract morphological learning beyond surface patterns.
2.2 Boundary Detection Unit
Analysis of the LM's hidden units reveals a specific unit that activates at morpheme and word boundaries. This boundary detection mechanism appears crucial for the model's ability to identify linguistic units and their properties.
3.1 Learning Morpheme Boundaries
The LM learns morpheme boundaries through extrapolation from word boundaries. This bottom-up learning approach enables the model to develop hierarchical representations of linguistic structure without explicit supervision.
3.2 Part-of-Speech Encoding
Beyond morphology, the LM encodes syntactic information about words, including their part-of-speech categories. This dual encoding of morphological and syntactic properties enables more sophisticated linguistic processing.
4.1 Selectional Restrictions
The LM captures the syntactic selectional restrictions of English derivational morphemes, demonstrating awareness at the morphology-syntax interface. However, the model makes some incorrect generalizations, indicating limitations in its learning.
4.2 Experimental Results
The experiments demonstrate that the character LM can:
- Identify higher-order linguistic units (morphemes and words)
- Learn underlying linguistic properties and regularities of these units
- Apply morphological processes productively in novel contexts
- Encode both morphological and syntactic information
5. Core Insight & Analysis
Core Insight
Character-level language models aren't just memorizing character sequences—they're developing genuine linguistic abstractions. The most significant finding here is the emergence of a dedicated "boundary detection unit" that essentially performs unsupervised morphological segmentation. This isn't trivial pattern recognition; it's the model constructing a theory of word structure from raw character data.
Logical Flow
The research progression is methodical and convincing: 1) Observe productive morphological behavior, 2) Probe the network to find explanatory mechanisms, 3) Validate through boundary detection experiments, 4) Test higher-order syntactic-morphological integration. This mirrors the approach in landmark papers like the original Transformer paper (Vaswani et al., 2017), where architectural innovations were validated through systematic probing.
Strengths & Flaws
Strengths: The boundary unit discovery is genuinely novel and has implications for how we understand neural network linguistic representations. The experimental design is elegant in its simplicity—using completion tasks to test morphological productivity. The connection to selectional restrictions shows the model isn't just learning morphology in isolation.
Flaws: The English focus limits generalizability to morphologically richer languages. The 7M character training corpus is relatively small by modern standards—we need to see if these findings scale to billion-token corpora. The "incorrect generalizations" mentioned but not detailed represent a missed opportunity for deeper error analysis.
Actionable Insights
For practitioners: This research suggests character-level models deserve reconsideration for morphologically complex languages, especially low-resource scenarios. The boundary detection mechanism could be explicitly engineered rather than emergent—imagine initializing a dedicated boundary unit. For researchers: This work connects to broader questions about linguistic abstraction in neural networks, similar to investigations in vision models like CycleGAN (Zhu et al., 2017) that probe what representations emerge during unsupervised learning. The next step should be comparative studies across languages with different morphological systems, perhaps using resources like UniMorph (Kirov et al., 2018).
The most compelling implication is that character models might offer a path toward more human-like language acquisition—learning morphology from distributional patterns rather than explicit segmentation rules. This aligns with psycholinguistic theories of morphological processing and suggests neural networks can develop linguistically plausible representations without symbolic supervision.
6. Technical Details
6.1 Mathematical Formulation
The character embedding process can be formalized as:
$\mathbf{x}_t = \mathbf{E}^\top \mathbf{v}_{c_t}$
where $\mathbf{E} \in \mathbb{R}^{|V| \times d}$ is the embedding matrix, $\mathbf{v}_{c_t}$ is the one-hot vector for character $c_t$, and $d$ is the embedding dimension.
The LSTM update equations follow the standard formulation:
$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$
$\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$
$\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_C)$
$\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$
$\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$
$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)$
6.2 Experimental Setup
The model uses 512-dimensional LSTM hidden states and character embeddings trained on 7M characters. Evaluation involves both quantitative metrics (perplexity, accuracy) and qualitative analysis of generated text and unit activations.
7. Analysis Framework Example
7.1 Probing Methodology
The research employs several probing techniques to investigate what the model learns:
- Completion Tasks: Feed partial words (e.g., "unhapp") and analyze probabilities assigned to possible completions ("-y" vs "-ily")
- Boundary Analysis: Monitor specific hidden unit activations around space characters and morpheme boundaries
- Selectional Restriction Tests: Present stems with derivational morphemes and evaluate grammaticality judgments
7.2 Case Study: Boundary Unit Analysis
When processing the word "unhappiness," the boundary detection unit shows peak activation at:
- Position 0 (word beginning)
- After "un-" (prefix boundary)
- After "happy" (stem boundary)
- After "-ness" (word ending)
This pattern suggests the unit learns to segment at both word and morpheme boundaries through exposure to similar patterns in training data.
8. Future Applications & Directions
8.1 Immediate Applications
- Low-Resource Languages: Character models could outperform word-based models for languages with rich morphology and limited training data
- Morphological Analyzers: The emergent boundary detection could bootstrap unsupervised morphological segmentation systems
- Educational Tools: Models that learn morphology naturally could help teach language structure
8.2 Research Directions
- Cross-Linguistic Studies: Test whether findings generalize to agglutinative (Turkish) or fusional (Russian) languages
- Scale Effects: Investigate how morphological learning changes with model size and training data quantity
- Architectural Innovations: Design models with explicit morphological components informed by these findings
- Multimodal Integration: Combine character-level linguistic learning with visual or auditory inputs
8.3 Long-Term Implications
This research suggests character-level models might provide a more cognitively plausible approach to language learning, potentially leading to:
- More data-efficient language models
- Better handling of novel words and morphological creativity
- Improved interpretability through linguistically meaningful representations
- Bridges between computational linguistics and psycholinguistics
9. References
- Kementchedjhieva, Y., & Lopez, A. (2018). Indications that character language models learn English morpho-syntactic units and regularities. arXiv preprint arXiv:1809.00066.
- Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning.
- Chung, J., Cho, K., & Bengio, Y. (2016). A character-level decoder without explicit segmentation for neural machine translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
- Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware neural language models. Proceedings of the AAAI Conference on Artificial Intelligence.
- Vaswani, A., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
- Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision.
- Kirov, C., et al. (2018). UniMorph 2.0: Universal Morphology. Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
- Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy blog.