Solving ESL Sentence Completion Questions via Pre-trained Neural Language Models

1. Introduction

Sentence Completion (SC) questions are a fundamental tool in assessing English as a Second Language (ESL) proficiency. They present a sentence with one or more blanks and a set of candidate words or phrases. Automating the solution to these questions offers significant benefits for language learners (instant feedback), educators (question quality evaluation), and the development of intelligent tutoring systems.

Previous computational approaches, such as n-gram language models or specialized blank LMs, face challenges in real-world educational settings: highly confusing distractors crafted by professionals, the need for deep linguistic knowledge (grammar, syntax, semantics), and the variable number of blanks and tokens per blank.

This work proposes a neural framework leveraging large-scale pre-trained language models to address these challenges, demonstrating superior performance on a real-world K-12 ESL dataset.

2. Our Approach

2.1 Problem Formulation

An SC question is defined as a tuple $(q, O)$, where $q$ is the sentence with $m$ blanks denoted by `[MASK]` tokens, and $O = \{o_1, o_2, ..., o_n\}$ is the set of $n$ candidate options (typically 3-5). Each option $o_i$ is a sequence of tokens intended to fill all blanks collectively. The goal is to select the option $o^* \in O$ that makes the completed sentence most plausible.

2.2 Model Architecture

The core of the approach is a sequence-to-sequence model based on the Transformer architecture, pre-trained using a denoising autoencoder objective (e.g., BART or T5). The model is fine-tuned for the SC task. For a given question $q$ and an option $o_i$, the model is tasked with reconstructing the original, fully-formed sentence.

The input to the encoder is the corrupted sequence (the question with blanks). The decoder is conditioned on this and must generate the original sentence. The option $o_i$ is inserted into the blanks of $q$ to create the target sequence for the decoder. The model's performance is scored by the negative log-likelihood of generating the target sequence given the input.

2.3 Training and Inference

During training, the model learns to reconstruct sentences from their masked versions. For inference, given a question $q$ and its options $O$, the model computes a score $s_i$ for each option $o_i$: $$s_i = -\sum_{t=1}^{T} \log P(w_t | w_{

3. Experiments & Results

3.1 Dataset

A real-world dataset collected from an online K-12 education platform was used. It contains thousands of SC questions created by English teaching professionals for Chinese ESL learners. The dataset features questions with 1-3 blanks and high-quality, semantically similar distractors.

Dataset Statistics

Source: Real-world K-12 Online Platform

Questions: Several Thousand

Blanks per Question: 1 to 3

Options per Question: 3 to 5

3.2 Baselines

The proposed model was compared against several strong baselines:

N-gram Language Model (LM): A traditional statistical model trained on a large corpus.
Blank LM [Shen et al.]: A specialized iterative language model for filling blanks.
Masked LM (e.g., BERT): Using a pre-trained masked language model to score the probability of the option tokens in the blank positions.
Sequence-to-Sequence LM (non-pretrained): A standard Transformer model trained from scratch on the SC task.

3.3 Main Results

The proposed pre-trained sequence-to-sequence model significantly outperformed all baseline models in terms of prediction accuracy on the held-out test set. The key advantage stems from its pre-training on massive text corpora, which imbues it with deep linguistic knowledge and world knowledge crucial for disambiguating subtle distractors. The sequence-to-sequence formulation also naturally handles multiple blanks and multi-token options.

3.4 Precision-Recall Analysis

The paper conducted a precision-recall trade-off analysis to discuss practical deployment. By adjusting the score threshold for accepting an answer, the system can be tuned for high precision (providing feedback only when very confident, minimizing errors) or high recall (attempting to answer more questions, potentially with more mistakes). This is critical for real-life educational applications where the cost of incorrect feedback is high.

4. Key Insights & Analysis

Core Insight: The paper's fundamental breakthrough isn't just applying a pre-trained model to a new task; it's recognizing that the sequence-to-sequence denoising objective is a near-perfect proxy for the cognitive process behind solving SC questions. The model isn't just picking a word; it's mentally "completing" the sentence and checking for coherence—a process mirrored by reconstructing the full sentence from a masked version. This is a more elegant and powerful approach than simply using a Masked LM to score individual tokens, which fails to capture inter-dependencies between multiple blanks.

Logical Flow: The argument is compellingly simple: 1) Real-world ESL questions are hard due to expert-crafted distractors and complex linguistic constraints. 2) Traditional and even early neural methods lack the nuanced understanding to tackle this. 3) Large-scale pre-trained LMs, specifically those trained with a denoising objective (like BART or T5), have this nuanced understanding. 4) Therefore, framing SC as a sequence reconstruction task using these models should yield state-of-the-art results. The experiments robustly validate this flow.

Strengths & Flaws: The major strength is the conceptual elegance and empirical success of the method. The use of a real-world K-12 dataset, not a cleaned academic corpus, adds tremendous practical credibility. The precision-recall analysis shows thoughtful consideration for deployment. The primary flaw, common to many AI-in-education papers, is the black box nature of the solution. It doesn't provide explainable feedback—a student gets "D is correct" but not "because 'must' indicates logical certainty in the first clause, and 'can't' is the correct negation in the second clause based on the evidence 'hates black color'." As noted in the 2022 review "Explainable AI for Education" (XAIED), this lack of interpretability limits direct pedagogical utility. Furthermore, the model's performance is inherently tied to its pre-training data, which may contain biases or lack coverage of certain ESL error patterns.

Actionable Insights: For EdTech companies, this research is a ready-made blueprint. The first step is to fine-tune a model like T5 or BART on proprietary question banks. However, the real competitive edge won't come from mere accuracy but from explainability. The next iteration should integrate techniques from interpretable AI—perhaps using attention weights to highlight the parts of the sentence most relevant to the chosen answer or generating natural language justifications. Secondly, this technology's prime application is not in high-stakes testing but in practice and formative assessment. Integrating it into adaptive learning platforms to generate infinite, personalized practice questions (by masking words in authentic texts) is a logical and high-value direction, moving from a solver to a generator, as hinted at in the introduction.

5. Technical Details

The model leverages the Transformer architecture's encoder-decoder framework. The pre-training objective is crucial. For a model like BART, it is trained by corrupting text with an arbitrary noising function (e.g., token masking, sentence permutation, document rotation) and then learning to reconstruct the original text. This makes it ideal for the SC task, which is a controlled form of text corruption and reconstruction.

The fine-tuning objective is to minimize the cross-entropy loss between the decoder's output distribution and the target sequence (the sentence completed with the correct option). For a batch of data, the loss function is: $$\mathcal{L} = -\frac{1}{N} \sum_{j=1}^{N} \sum_{t=1}^{T_j} \log P(w_t^{(j)} | w_{

6. Analysis Framework Example

Scenario: Evaluating a candidate model for an SC task.

Framework Application:

Task Decomposition: Break down the SC question: Identify the number of blanks, the part-of-speech or syntactic role required for each, and the semantic relationship between the sentence clues and the correct answer.
Model Scoring: For each option, use the model to compute the sequence score $s_i$. For example, for the question "He _ to the store yesterday," with options {go, went, goes}, the model would score the sequences "He went to the store yesterday" highest due to correct past tense agreement.
Error Analysis: If the model fails, analyze the failure mode. Did it choose "go"? This suggests a weakness in grammatical tense understanding. Did it choose "goes"? This suggests a weakness in subject-verb agreement. This analysis guides further data collection or model adjustment.
Distractor Strength Assessment: Use the model's score distribution across options. A high score for the correct answer and very low scores for distractors indicates an easy question. If two options have similar, high scores, it indicates a high-quality, confusing distractor, which is valuable for diagnostic assessment.

This framework moves beyond simple accuracy to a diagnostic understanding of both student and model capabilities.

7. Future Applications & Directions

Explainable AI (XAI) Integration: The most critical direction is evolving from a "black-box" solver to an "explainable tutor." Future models should generate rationales, highlight key sentence evidence, or even identify the specific grammar rule being tested.
Personalized Distractor Generation: The model can be used to generate plausible but incorrect distractors tailored to a student's common error patterns, creating hyper-personalized practice.
Automated Question Generation (AQG): Reverse the process. Given a text, the model can identify key words to mask and generate plausible distractors, automatically creating new SC questions for practice banks, scaling content creation massively.
Multimodal Extension: For younger learners or specific contexts, SC questions may involve images. Future work could involve multimodal pre-trained models (like VL-T5) to solve or generate questions combining text and visual clues.
Cross-lingual Transfer: Applying the framework to other languages by leveraging multilingual pre-trained models (like mT5), aiding ESL learners whose first language is not Chinese.

8. References

Liu, Q., Liu, T., Zhao, J., et al. (2021). Solving ESL Sentence Completion Questions via Pre-trained Neural Language Models. arXiv:2107.07122.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
Lewis, M., Liu, Y., Goyal, N., et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of ACL.
Shen, L., Allauzen, C., & Ji, H. (2015). Blank Language Models. Proceedings of EMNLP.
Zweig, G., & Burges, C. J. (2012). A Challenge Set for Advancing Language Modeling. Proceedings of the NAACL-HLT Workshop.
Holstein, K., McLaren, B. M., & Aleven, V. (2022). Explainable AI for Education (XAIED). In The Handbook of Artificial Intelligence in Education.
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research.

Table of Contents