Select Language

SLABERT: Modeling Second Language Acquisition with BERT

A novel framework using BERT to model cross-linguistic transfer effects in second language acquisition, with a focus on negative transfer and language family distance.
learn-en.org | PDF Size: 4.7 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - SLABERT: Modeling Second Language Acquisition with BERT

Table of Contents

1. Introduction

Second language acquisition (SLA) research has extensively studied cross-linguistic transfer, the influence of linguistic structure of a speaker's native language (L1) on the successful acquisition of a foreign language (L2). Effects of such transfer can be positive (facilitating acquisition) or negative (impeding acquisition). This paper introduces SLABERT, a novel framework that models sequential second language acquisition using BERT, focusing on both positive and negative transfer effects.

2. Related Work

While cross-lingual transfer has received considerable attention in NLP research, most work concentrates on practical implications like tokenizer optimization. The TILT approach (Papadimitriou and Jurafsky, 2020) focuses on positive transfer with divergent training sets. SLABERT extends this by modeling sequential transfer relationships that arise in human SLA.

3. Methodology

3.1 Dataset Construction

The MAO-CHILDES dataset consists of 5 typologically diverse languages: German, French, Polish, Indonesian, and Japanese. It uses Child-Directed Speech (CDS) to create naturalistic L1 training sets that are ecologically valid and fine-tuned for language acquisition.

3.2 Model Architecture

SLABERT uses a Transformer-based architecture with BERT as the backbone. The model is pre-trained on L1 CDS data and then fine-tuned on L2 English data, mimicking sequential SLA.

3.3 Training Procedure

The training involves two stages: first, pre-training on L1 CDS data; second, fine-tuning on L2 English data. The TILT-based cross-lingual transfer learning approach is used to examine the impact of native CDS.

4. Experiments and Results

4.1 BLiMP Evaluation

Models are tested on the BLiMP grammar test suite. Results show that L1 may facilitate or interfere with L2 learning. Language family distance predicts more negative transfer, consistent with human SLA.

4.2 Language Family Distance Analysis

Table 1 shows the performance of SLABERT models on BLiMP across different L1 languages. German (closer to English) shows higher accuracy than Japanese (more distant).

L1 LanguageBLiMP Accuracy (%)
German78.5
French74.2
Polish71.8
Indonesian68.3
Japanese65.1

5. Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight: SLABERT demonstrates that negative transfer in SLA is not just a human phenomenon—it can be modeled and measured in LMs, with language family distance as a key predictor.

Logical Flow: The paper moves from SLA theory to dataset construction (MAO-CHILDES), to model training, to evaluation on BLiMP, and finally to analysis of transfer effects. The flow is coherent but could be tighter in connecting NLP metrics to SLA theory.

Strengths & Flaws: Strengths include the novel use of CDS data and the focus on negative transfer, which is underexplored. Flaws include limited language coverage (only 5 languages) and lack of comparison with human learner data.

Actionable Insights: Researchers should extend this to more languages and incorporate human learner benchmarks. Practitioners can use SLABERT to design better cross-lingual NLP systems that account for negative transfer.

6. Original Analysis

SLABERT represents a significant step toward bridging computational linguistics and second language acquisition research. By modeling negative transfer, it addresses a gap in NLP where most work focuses on positive transfer. The use of Child-Directed Speech is particularly innovative, as it provides ecologically valid training data that mirrors natural language acquisition. However, the study's reliance on BLiMP as the sole evaluation metric may not capture all aspects of SLA, such as pragmatic or discourse-level transfer. Future work should incorporate more comprehensive benchmarks and compare with human learner data to validate the model's predictions. The finding that conversational speech data shows greater facilitation than scripted speech aligns with research on the importance of interactive input in SLA (e.g., Long, 1996). This suggests that SLABERT could be used to optimize language learning materials by prioritizing conversational data.

7. Technical Details

The model uses a Transformer architecture with 12 layers, 768 hidden dimensions, and 12 attention heads. The loss function is cross-entropy with masked language modeling. The training objective is to minimize the negative log-likelihood of the masked tokens: $\mathcal{L} = -\sum_{i \in \text{masked}} \log P(x_i | x_{\backslash i})$.

8. Case Study: Cross-Linguistic Transfer Example

Consider a German L1 speaker learning English. German has flexible word order, while English is more rigid. SLABERT trained on German CDS shows higher accuracy on English word order tasks (e.g., subject-verb-object) compared to Japanese-trained models, reflecting positive transfer. However, German-trained models show lower accuracy on English article usage (since German has gendered articles), reflecting negative transfer.

9. Future Directions

Future work should extend SLABERT to more languages, incorporate multimodal data (e.g., visual context), and develop interactive learning scenarios. The framework could also be applied to study language attrition and multilingualism. Additionally, integrating insights from cognitive science could improve the model's psychological plausibility.

10. References