Solving ESL Sentence Completion Questions via Pre-trained Neural Language Models

1. Introduction

Sentence Completion (SC) questions are a fundamental tool in assessing English as a Second Language (ESL) proficiency. They present a sentence with one or more blanks and a set of candidate words/phrases, testing a learner's grasp of grammar, syntax, and semantics. Automating the solution to these questions has significant value for intelligent tutoring systems, providing instant feedback, evaluating question quality, and generating practice material.

Traditional approaches, such as n-gram language models, struggle with the nuanced challenges of real-world ESL questions: highly confusing distractors crafted by professionals, deep linguistic knowledge requirements, and variable numbers of blanks/tokens. This paper proposes a neural framework leveraging large-scale pre-trained language models to address these challenges effectively.

2. Our Approach

The core of the proposed framework is adapting pre-trained sequence-to-sequence models, specifically Transformer-based architectures, for the SC task.

2.1 Problem Formulation

An SC question is defined as a tuple $(q, O)$, where $q$ is the sentence with $k$ blanks denoted by a special `[MASK]` token, and $O = \{o_1, o_2, ..., o_m\}$ is the set of $m$ candidate options (each option may fill one or multiple blanks). The goal is to select the option $o^* \in O$ that makes the completed sentence most plausible.

2.2 Model Architecture

The model is based on a pre-trained encoder-decoder architecture (e.g., BART or T5). The input is the masked sentence $q$. For each candidate option $o_i$, the model generates a completed sentence by replacing the `[MASK]` tokens. The model scores each completion based on its generation probability or a fine-tuned classifier head. The score $S(o_i | q)$ can be derived from the negative log-likelihood of generating the completed sequence:

$S(o_i | q) = -\sum_{t=1}^{T} \log P(w_t | w_{

where $w_t$ are the tokens of the completed sentence. The option with the highest score (lowest perplexity) is selected.

2.3 Training Strategy

The model is fine-tuned on a dataset of SC questions using a denoising autoencoder objective initially, followed by task-specific fine-tuning. The loss function typically combines a masked language modeling loss and a sequence classification loss to optimize for both sentence fluency and correct option discrimination.

3. Experiments & Results

3.1 Dataset

Experiments were conducted on a real-world K-12 ESL SC question dataset collected from an online education platform. The dataset contains thousands of questions with high-quality, professionally designed distractors, covering various grammar and vocabulary points.

Dataset Statistics

Source: Real-world K-12 Online Education Platform
Question Count: Several thousand
Blanks per Question: 1 or more
Options per Blank: 3 to 5
Focus: Grammar, Syntax, Semantics

3.2 Baselines

The proposed model was compared against several strong baselines:

N-gram LM: Traditional statistical language model.
Blank LM [10]: An iterative language model for blank filling.
BERT (Masked LM): Using BERT's masked token prediction probabilities directly.
Fine-tuned BERT (Classifier): BERT with a classification layer on the `[CLS]` token.

3.3 Main Results

The proposed pre-trained sequence-to-sequence model significantly outperformed all baseline methods in prediction accuracy on the held-out test set. The key advantage stemmed from its ability to model the entire sentence coherence after insertion, rather than just local context, effectively handling multi-blank questions and phrasal options.

Key Insights from Results

Pre-trained models (BERT, proposed) vastly outperform traditional n-gram LMs.
The sequence-to-sequence generation approach outperforms masked LM and classification approaches, especially for multi-token options.
The model demonstrates robustness against professionally crafted, confusing distractors.

3.4 Precision-Recall Analysis

The paper presents a precision-recall trade-off analysis, crucial for real-world deployment. By adjusting the score threshold for accepting an answer, the system can be tuned for high-precision (conservative, only answering when very sure) or high-recall (attempting more questions) modes. This flexibility is vital for adaptive learning systems where confidence estimation matters.

4. Technical Analysis & Insights

Core Insight: This paper isn't about a novel architecture; it's a masterclass in pragmatic AI engineering. The authors correctly identify that the brute force of modern pre-trained LMs, specifically sequence-to-sequence models like BART or T5, is the most effective tool for the messy, constrained, yet semantically rich problem of ESL sentence completion. The real innovation is in the framing and fine-tuning strategy for a niche educational domain.

Logical Flow: The logic is compellingly straightforward: 1) ESL SC questions are hard due to expert-level distractors and complex constraints. 2) Pre-trained LMs have vast world and linguistic knowledge. 3) Therefore, fine-tune a powerful, general-purpose LM (a seq2seq model) on domain-specific data to solve the task. The experimental results validate this pipeline decisively, showing the seq2seq approach's superiority over pure masked LMs (like BERT) which struggle with multi-token coherence.

Strengths & Flaws: The major strength is the direct application of state-of-the-art NLP to a real, impactful educational problem with rigorous evaluation. The use of a real K-12 dataset adds immense credibility, as noted in educational data mining literature (e.g., work from the International Educational Data Mining Society). However, the paper's flaw is a common one in applied AI: opacity in the "how." While it mentions fine-tuning a denoising autoencoder, details on the exact loss functions, hyperparameters, and data augmentation techniques for generating `[MASK]`ed training samples are sparse. This makes replication difficult. Furthermore, it doesn't deeply analyze why the model fails on certain questions—a crucial step for educational diagnostic systems. Contrast this with the interpretability efforts in models like CycleGAN, where attention maps or feature visualizations are used to explain outcomes.

Actionable Insights: For EdTech companies, the takeaway is clear: stop building custom rule-based or simple statistical systems for language assessment. The ROI lies in leveraging and carefully fine-tuning foundation models. The precision-recall analysis provides a blueprint for product integration: build a dual-mode system where high-precision mode aids formal assessment, and high-recall mode drives exploratory practice. The next step, as seen in advanced tutoring systems research (e.g., Carnegie Learning's platforms), is to extend this from "answer scoring" to "distractor analysis" and "personalized hint generation," using the model's confidence scores and internal representations to diagnose specific student misconceptions.

5. Analysis Framework Example

Scenario: Analyzing why a model might fail on a particular SC question.

Question: "She _____ to the store yesterday and bought some milk."
Options: (A) go (B) goes (C) went (D) going

Framework Application:

Input Representation: Model receives: "She [MASK] to the store yesterday and bought some milk."
Option Scoring: For each option, the model generates/completes the sentence and computes a score.
- Score("went") = -log P("She went to the store...") // Should be lowest (best).
- Score("goes") = -log P("She goes to the store yesterday...") // Higher due to tense mismatch.
Failure Diagnosis: If the model incorrectly chooses "goes," we investigate:
- Data Bias: Was "goes" overly frequent in the training data in similar contexts?
- Context Window: Did the model fail to give enough weight to the temporal cue "yesterday"?
- Distractor Strength: Is "goes" a particularly strong distractor because it's grammatically correct for the subject "She" in a vacuum?
Remediation: Augment training data with more examples emphasizing temporal adverb-verb agreement, or adjust the fine-tuning objective to penalize tense inconsistencies more heavily.

This structured analysis moves beyond simple accuracy metrics to actionable model improvement.

6. Future Applications & Directions

Personalized Learning Paths: Using model confidence and error patterns to identify a student's specific grammatical weaknesses and recommend targeted exercises.
Automatic Question Generation: Reversing the model to generate novel, high-quality SC questions with plausible distractors by masking words in authentic sentences and using the model to propose alternatives, similar to methods explored in arXiv:2005.05909.
Multimodal Integration: Combining text-based models with speech recognition to assess spoken sentence completion, providing holistic language proficiency evaluation.
Explainable AI for Education (XAI-Ed): Developing techniques to make the model's "reasoning" transparent—e.g., highlighting which words in the sentence were key to rejecting a distractor—to build trust and provide deeper feedback.
Cross-lingual Transfer: Applying the framework to SC questions for other languages, leveraging multilingual pre-trained models like mT5 or mBART.

7. References

Zweig, G., et al. (2012). SAT Sentence Completion. Microsoft Research Tech Report.
Shen, L., et al. (2015). Blank Language Model. EMNLP.
Donahue, J., et al. (2020). Pre-training with Masked Text. NeurIPS.
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
Lewis, M., et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ACL.
Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR.
Koedinger, K.R., et al. (2012). The Knowledge-Learning-Instruction Framework: Bridging the Science-Practice Chasm to Enhance Robust Student Learning. Cognitive Science.
Zhu, J.Y., et al. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV. (Cited as an example of interpretability efforts).
International Educational Data Mining Society (IEDMS). Resources on Real-world Educational Datasets. https://educationaldatamining.org/

Table of Contents