Scaling Laws with Vocabulary: Why Larger Models Need Larger Vocabularies

1. Introduction

Scaling laws for Large Language Models (LLMs) have traditionally focused on model parameters and training data size, largely overlooking vocabulary size as a critical scaling dimension. This paper investigates the impact of vocabulary size on LLM performance and proposes methods to determine the compute-optimal vocabulary size for given training budgets.

The research demonstrates that current LLMs like Llama2-70B use suboptimal vocabulary sizes (32K vs predicted optimal of 216K), highlighting significant efficiency gaps in current practices.

Model Range

33M - 3B

Parameters Trained

Training Data

500B

Characters Processed

Vocabulary Gap

Llama2-70B Underestimation

2. Methodology

2.1 Normalized Loss Formulation

To ensure fair comparison across models with varying vocabulary sizes, the authors introduce a normalized loss function that accounts for tokenization efficiency differences. The normalization prevents models with larger vocabularies from having artificial advantages in loss metrics.

2.2 Three Prediction Approaches

The paper proposes three complementary methods for predicting optimal vocabulary size:

2.2.1 IsoFLOPs Analysis

Training models with identical computational budgets but different vocabulary sizes to identify the minimum loss point for each budget level.

2.2.2 Derivative Estimation

Using gradient-based methods to find where the loss function derivative with respect to vocabulary size equals zero, indicating optimal points.

2.2.3 Parametric Fit

Fitting power-law relationships between model parameters, vocabulary size, and loss to derive predictive formulas.

3. Experimental Results

3.1 Model Training Setup

Models ranging from 33M to 3B parameters were trained on up to 500B characters with various vocabulary configurations. The training spanned different FLOPs budgets to establish comprehensive scaling relationships.

3.2 Optimal Vocabulary Findings

The research reveals a power-law relationship: $N_v^{opt} \propto N_{nv}^\gamma$ where $\gamma < 1$, indicating that optimal vocabulary parameters should scale slower than non-vocabulary parameters. This contradicts the common practice of using fixed vocabulary sizes across model scales.

Figure 1: Vocabulary Scaling Relationship

The visualization shows empirical results aligning with theoretical predictions, with larger circles indicating higher loss values. The plot demonstrates clear optimal vocabulary sizes for different model scales, forming a distinct power-law curve.

3.3 Downstream Performance Validation

Empirical validation with 3B parameter models shows consistent improvements when using predicted optimal vocabulary sizes. On ARC-Challenge, increasing vocabulary from 32K to 43K improved performance from 29.1 to 32.0 with identical 2.3e21 FLOPs budget.

Key Insights

Vocabulary size significantly impacts LLM scaling efficiency
Optimal vocabulary scales with compute budget and model size
Current LLMs generally use suboptimal vocabulary sizes
Joint consideration of tokenization and model scaling is essential

4. Technical Analysis & Framework

4.1 Mathematical Formulation

The core mathematical relationship discovered is expressed as:

$L(N_{nv}, N_v, D) = E + \frac{A}{N_{nv}^\alpha} + \frac{B}{N_v^\beta} + \frac{C}{D^\gamma}$

Where $L$ is the normalized loss, $N_{nv}$ are non-vocabulary parameters, $N_v$ are vocabulary parameters, $D$ is training data size, and $E, A, B, C, \alpha, \beta, \gamma$ are fitted constants.

The optimal vocabulary size satisfies: $\frac{\partial L}{\partial N_v} = 0$

4.2 Analysis Framework Example

Case Study: Determining Optimal Vocabulary for a 10B Parameter Model

Given: Training budget = 1e23 FLOPs, Target domain = general language understanding

Framework Application:

Estimate non-vocabulary parameters: $N_{nv} = 9.5\text{B}$ (95% of total)
Apply power-law: $N_v^{opt} \propto N_{nv}^{0.7}$ (from empirical fit)
Calculate: $N_v^{opt} \approx 150\text{K}$ tokens
Validate with IsoFLOPs analysis for given budget
Adjust for domain-specific token distribution

This framework provides a systematic approach to vocabulary sizing that current model developers often overlook.

5. Industry Analyst Perspective

5.1 Core Insight

The industry has been fundamentally misguided in treating vocabulary size as a static hyperparameter. This paper exposes a critical blind spot: we've been optimizing LLMs with one hand tied behind our backs. The finding that Llama2-70B's vocabulary should be 7x larger isn't just an academic curiosity—it represents billions of dollars in wasted compute and suboptimal model performance across the entire AI ecosystem. This oversight is reminiscent of early neural network research that underestimated the importance of activation functions, as documented in the seminal work by Glorot and Bengio (2010) on understanding the difficulty of training deep feedforward neural networks.

5.2 Logical Flow

The paper's argument progresses with surgical precision: First, they establish that vocabulary matters (contrary to prevailing scaling law assumptions). Second, they demonstrate it matters systematically through power laws. Third, they provide practical tools for optimization. The logical chain is airtight—from problem identification through methodological innovation to empirical validation. This is how rigorous research should be conducted, unlike the trend of publishing incremental improvements without fundamental insights.

5.3 Strengths & Flaws

Strengths: The triple-methodology approach (IsoFLOPs, derivatives, parametric fits) provides robust validation. The scale of experimentation (33M to 3B parameters) is impressive and convincing. The practical implications are immediately actionable for any organization training LLMs.

Flaws: The study focuses primarily on English text—multilingual implications remain unexplored. The computational cost of their methodology may be prohibitive for smaller research groups. They don't address how vocabulary optimization interacts with other architectural choices like attention mechanisms, an area where the Transformer architecture paper (Vaswani et al., 2017) established foundational principles that still dominate the field.

5.4 Actionable Insights

Every AI lab training LLMs should immediately: 1) Re-evaluate their vocabulary sizing strategy, 2) Implement the IsoFLOPs analysis for current projects, 3) Consider vocabulary size as a first-class scaling dimension alongside parameters and data. For hardware companies like NVIDIA and AMD, this research suggests new optimization opportunities in memory architecture for larger embedding tables. The 7x vocabulary gap for Llama2-70B implies that current hardware is fundamentally mismatched to optimal model configurations.

6. Future Applications & Directions

Immediate Applications:

Redesign of vocabulary strategies for next-generation LLMs (GPT-5, Gemini 2.0, etc.)
Hardware optimization for larger embedding tables
Improved efficiency in model serving and inference

Research Directions:

Multilingual vocabulary optimization across diverse languages
Dynamic vocabulary sizing during training
Integration with mixture-of-experts architectures
Vocabulary optimization for domain-specific models
Cross-modal vocabulary considerations for multimodal models

The principles established in this work could extend beyond language models to other sequence models in bioinformatics, code generation, and time series analysis, similar to how convolutional neural network principles from computer vision (as in the AlexNet paper by Krizhevsky et al., 2012) transferred to other domains.

7. References

Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models.
Brown, T., et al. (2020). Language Models are Few-Shot Learners.
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.
Vaswani, A., et al. (2017). Attention Is All You Need.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
Krizhevsky, A., et al. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
Team, G., et al. (2024). Gemma: Open Models Based on Gemini Research and Technology.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models.