Select Language

MENmBERT: Transfer Learning for Malaysian English NLP

Research on transfer learning from English PLMs to Malaysian English for improved Named Entity Recognition and Relation Extraction performance in low-resource settings.
learn-en.org | PDF Size: 0.2 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - MENmBERT: Transfer Learning for Malaysian English NLP

Table of Contents

26.27%

Improvement in RE Performance

14,320

News Articles in MEN Corpus

6,061

Annotated Entities

1. Introduction

Malaysian English represents a unique linguistic challenge in NLP - a low-resource creole language that incorporates elements from Malay, Chinese, and Tamil languages alongside Standard English. This research addresses the critical performance gap in Named Entity Recognition (NER) and Relation Extraction (RE) tasks when applying standard pre-trained language models to Malaysian English text.

The morphosyntactic adaptations, semantic features, and code-switching patterns characteristic of Malaysian English cause significant performance degradation in existing state-of-the-art models. Our work introduces MENmBERT and MENBERT, specifically tailored language models that bridge this gap through strategic transfer learning approaches.

2. Background and Related Work

The adaptation of pre-trained language models to domain-specific or language-specific corpora has demonstrated significant improvements across various NLP tasks. Research by Martin et al. (2020) and Antoun et al. (2021) has shown that further pre-training on specialized corpora enhances model performance in targeted linguistic contexts.

Malaysian English presents unique challenges due to its creole nature, featuring loanwords, compound words, and derivations from multiple source languages. The code-switching phenomenon, where speakers mix English and Malay within single utterances, creates additional complexity for standard NLP models.

3. Methodology

3.1 Pre-training Approach

MENmBERT leverages transfer learning from English PLMs through continued pre-training on the Malaysian English News (MEN) Corpus. The pre-training objective follows the masked language modeling approach:

$$L_{MLM} = -\mathbb{E}_{x \sim D} \sum_{i=1}^{n} \log P(x_i | x_{\\backslash i})$$

where $x$ represents the input sequence, $D$ is the MEN Corpus distribution, and $x_{\\backslash i}$ denotes the sequence with the $i$-th token masked.

3.2 Fine-tuning Strategy

The models were fine-tuned on the MEN-Dataset containing 200 news articles with 6,061 annotated entities and 4,095 relation instances. The fine-tuning process employed task-specific layers for NER and RE, with cross-entropy loss optimization:

$$L_{NER} = -\sum_{i=1}^{N} \sum_{j=1}^{T} y_{ij} \log(\hat{y}_{ij})$$

where $N$ is the number of sequences, $T$ is sequence length, $y_{ij}$ is the true label, and $\hat{y}_{ij}$ is the predicted probability.

4. Experimental Results

4.1 NER Performance

MENmBERT achieved a 1.52% overall improvement in NER performance compared to bert-base-multilingual-cased. While the overall improvement appears modest, detailed analysis reveals significant improvements across specific entity labels, particularly for Malaysian-specific entities and code-switched expressions.

Figure 1: NER performance comparison showing MENmBERT outperforming baseline models on Malaysian-specific entity types, with particularly strong performance on location and organization entities unique to Malaysian context.

4.2 RE Performance

The most dramatic improvement was observed in Relation Extraction, where MENmBERT achieved a 26.27% performance gain. This substantial improvement demonstrates the model's enhanced capability to understand semantic relationships in Malaysian English context.

Key Insights

  • Language-specific pre-training significantly improves performance on low-resource dialects
  • Code-switching patterns require specialized model architectures
  • Transfer learning from high-resource to low-resource languages shows promising results
  • Geographically-focused corpora enhance model performance for regional language variants

5. Analysis Framework

Industry Analyst Perspective

Core Insight

This research fundamentally challenges the one-size-fits-all approach to multilingual NLP. The 26.27% RE performance leap isn't just an incremental improvement - it's a damning indictment of how mainstream models fail marginalized language variants. Malaysian English isn't a niche case; it's the canary in the coal mine for hundreds of under-served linguistic communities.

Logical Flow

The methodology follows a brutally efficient three-step demolition of conventional wisdom: identify the performance gap (standard models fail spectacularly), deploy targeted transfer learning (MENmBERT architecture), and validate through rigorous benchmarking. The approach mirrors successful domain adaptation strategies seen in medical NLP (Lee et al., 2019) but applies them to linguistic diversity preservation.

Strengths & Flaws

Strengths: The 14,320-article corpus represents serious data curation effort. The dual-model approach (MENmBERT and MENBERT) shows methodological sophistication. The RE performance jump is undeniable.

Flaws: The modest 1.52% NER improvement raises eyebrows - either the evaluation metrics are flawed or the approach has fundamental limitations. The paper dances around this discrepancy without satisfactory explanation. The model's dependency on news domain data limits generalizability.

Actionable Insights

For enterprises operating in Southeast Asia: immediate adoption consideration. For researchers: replicate this approach for Singapore English, Indian English variants. For model developers: this proves that "multilingual" in practice means "dominant languages only" - time for a paradigm shift.

Analysis Framework Example

Case Study: Entity Recognition in Code-Switched Text

Input: "I'm going to the pasar malam in Kuala Lumpur then meeting Encik Ahmad at KLCC"

Standard BERT Output: [ORG] pasar malam, [LOC] Kuala Lumpur, [MISC] Encik Ahmad, [MISC] KLCC

MENmBERT Output: [EVENT] pasar malam, [CITY] Kuala Lumpur, [PERSON] Encik Ahmad, [LANDMARK] KLCC

This demonstrates MENmBERT's superior understanding of Malaysian cultural context and entity types.

6. Future Applications

The success of MENmBERT opens several promising directions for future research and application:

  • Cross-lingual Transfer: Applying similar approaches to other English variants (Singapore English, Indian English)
  • Multi-modal Integration: Combining text with audio data for improved code-switching detection
  • Real-time Applications: Deployment in customer service chatbots for Malaysian markets
  • Educational Technology: Language learning tools tailored to Malaysian English speakers
  • Legal and Government Applications: Document processing for Malaysian legal and administrative texts

The approach demonstrates scalability to other low-resource language variants and creole languages worldwide.

7. References

  1. Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  2. Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  3. Conneau, A., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale.
  4. Lan, Z., et al. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
  5. Martin, L., et al. (2020). CamemBERT: a Tasty French Language Model.
  6. Antoun, W., et al. (2021). AraBERT: Transformer-based Model for Arabic Language Understanding.
  7. Chanthran, M., et al. (2024). Malaysian English News Dataset for NLP Tasks.
  8. Lee, J., et al. (2019). BioBERT: a pre-trained biomedical language representation model.