1. Introduction
Scaling laws for Large Language Models (LLMs) have traditionally focused on model parameters and training data size, largely overlooking vocabulary size as a critical scaling dimension. This paper investigates the impact of vocabulary size on LLM performance and proposes methods to determine the compute-optimal vocabulary size for given training budgets.
The research demonstrates that current LLMs like Llama2-70B use suboptimal vocabulary sizes (32K vs predicted optimal of 216K), highlighting significant efficiency gaps in current practices.
Model Range
33M - 3B
Parameters Trained
Training Data
500B
Characters Processed
Vocabulary Gap
7x
Llama2-70B Underestimation
2. Methodology
2.1 Normalized Loss Formulation
To ensure fair comparison across models with varying vocabulary sizes, the authors introduce a normalized loss function that accounts for tokenization efficiency differences. The normalization prevents models with larger vocabularies from having artificial advantages in loss metrics.
2.2 Three Prediction Approaches
The paper proposes three complementary methods for predicting optimal vocabulary size:
2.2.1 IsoFLOPs Analysis
Training models with identical computational budgets but different vocabulary sizes to identify the minimum loss point for each budget level.
2.2.2 Derivative Estimation
Using gradient-based methods to find where the loss function derivative with respect to vocabulary size equals zero, indicating optimal points.
2.2.3 Parametric Fit
Fitting power-law relationships between model parameters, vocabulary size, and loss to derive predictive formulas.
3. Experimental Results
3.1 Model Training Setup
Models ranging from 33M to 3B parameters were trained on up to 500B characters with various vocabulary configurations. The training spanned different FLOPs budgets to establish comprehensive scaling relationships.
3.2 Optimal Vocabulary Findings
The research reveals a power-law relationship: $N_v^{opt} \propto N_{nv}^\gamma$ where $\gamma < 1$, indicating that optimal vocabulary parameters should scale slower than non-vocabulary parameters. This contradicts the common practice of using fixed vocabulary sizes across model scales.
Figure 1: Vocabulary Scaling Relationship
The visualization shows empirical results aligning with theoretical predictions, with larger circles indicating higher loss values. The plot demonstrates clear optimal vocabulary sizes for different model scales, forming a distinct power-law curve.
3.3 Downstream Performance Validation
Empirical validation with 3B parameter models shows consistent improvements when using predicted optimal vocabulary sizes. On ARC-Challenge, increasing vocabulary from 32K to 43K improved performance from 29.1 to 32.0 with identical 2.3e21 FLOPs budget.
Key Insights
- Vocabulary size significantly impacts LLM scaling efficiency
- Optimal vocabulary scales with compute budget and model size
- Current LLMs generally use suboptimal vocabulary sizes
- Joint consideration of tokenization and model scaling is essential
4. Technical Analysis & Framework
4.1 Mathematical Formulation
The core mathematical relationship discovered is expressed as:
$L(N_{nv}, N_v, D) = E + \frac{A}{N_{nv}^\alpha} + \frac{B}{N_v^\beta} + \frac{C}{D^\gamma}$
Where $L$ is the normalized loss, $N_{nv}$ are non-vocabulary parameters, $N_v$ are vocabulary parameters, $D$ is training data size, and $E, A, B, C, \alpha, \beta, \gamma$ are fitted constants.
The optimal vocabulary size satisfies: $\frac{\partial L}{\partial N_v} = 0$
4.2 Analysis Framework Example
Case Study: Determining Optimal Vocabulary for a 10B Parameter Model
Given: Training budget = 1e23 FLOPs, Target domain = general language understanding
Framework Application:
- Estimate non-vocabulary parameters: $N_{nv} = 9.5\text{B}$ (95% of total)
- Apply power-law: $N_v^{opt} \propto N_{nv}^{0.7}$ (from empirical fit)
- Calculate: $N_v^{opt} \approx 150\text{K}$ tokens
- Validate with IsoFLOPs analysis for given budget
- Adjust for domain-specific token distribution
This framework provides a systematic approach to vocabulary sizing that current model developers often overlook.
5. Industry Analyst Perspective
5.1 Core Insight
The industry has been fundamentally misguided in treating vocabulary size as a static hyperparameter. This paper exposes a critical blind spot: we've been optimizing LLMs with one hand tied behind our backs. The finding that Llama2-70B's vocabulary should be 7x larger isn't just an academic curiosity—it represents billions of dollars in wasted compute and suboptimal model performance across the entire AI ecosystem. This oversight is reminiscent of early neural network research that underestimated the importance of activation functions, as documented in the seminal work by Glorot and Bengio (2010) on understanding the difficulty of training deep feedforward neural networks.
5.2 Logical Flow
The paper's argument progresses with surgical precision: First, they establish that vocabulary matters (contrary to prevailing scaling law assumptions). Second, they demonstrate it matters systematically through power laws. Third, they provide practical tools for optimization. The logical chain is airtight—from problem identification through methodological innovation to empirical validation. This is how rigorous research should be conducted, unlike the trend of publishing incremental improvements without fundamental insights.
5.3 Strengths & Flaws
Strengths: The triple-methodology approach (IsoFLOPs, derivatives, parametric fits) provides robust validation. The scale of experimentation (33M to 3B parameters) is impressive and convincing. The practical implications are immediately actionable for any organization training LLMs.
Flaws: The study focuses primarily on English text—multilingual implications remain unexplored. The computational cost of their methodology may be prohibitive for smaller research groups. They don't address how vocabulary optimization interacts with other architectural choices like attention mechanisms, an area where the Transformer architecture paper (Vaswani et al., 2017) established foundational principles that still dominate the field.
5.4 Actionable Insights
Every AI lab training LLMs should immediately: 1) Re-evaluate their vocabulary sizing strategy, 2) Implement the IsoFLOPs analysis for current projects, 3) Consider vocabulary size as a first-class scaling dimension alongside parameters and data. For hardware companies like NVIDIA and AMD, this research suggests new optimization opportunities in memory architecture for larger embedding tables. The 7x vocabulary gap for Llama2-70B implies that current hardware is fundamentally mismatched to optimal model configurations.
6. Future Applications & Directions
Immediate Applications:
- Redesign of vocabulary strategies for next-generation LLMs (GPT-5, Gemini 2.0, etc.)
- Hardware optimization for larger embedding tables
- Improved efficiency in model serving and inference
Research Directions:
- Multilingual vocabulary optimization across diverse languages
- Dynamic vocabulary sizing during training
- Integration with mixture-of-experts architectures
- Vocabulary optimization for domain-specific models
- Cross-modal vocabulary considerations for multimodal models
The principles established in this work could extend beyond language models to other sequence models in bioinformatics, code generation, and time series analysis, similar to how convolutional neural network principles from computer vision (as in the AlexNet paper by Krizhevsky et al., 2012) transferred to other domains.
7. References
- Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models.
- Brown, T., et al. (2020). Language Models are Few-Shot Learners.
- Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Vaswani, A., et al. (2017). Attention Is All You Need.
- Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
- Krizhevsky, A., et al. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
- Team, G., et al. (2024). Gemma: Open Models Based on Gemini Research and Technology.
- Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models.