Performance Comparison of ChatGPT, Bing Chat, and Bard on VNHSGE English Dataset

1. Introduction

This paper presents a performance comparison of three prominent large language models (LLMs)—OpenAI's ChatGPT (GPT-3.5), Microsoft's Bing Chat, and Google's Bard—on the Vietnamese High School Graduation Examination (VNHSGE) English dataset. The study aims to evaluate their capabilities in the specific context of Vietnamese high school English education, particularly as ChatGPT is not officially available in Vietnam. The research addresses three key questions regarding model performance, comparison to human students, and the potential applications of LLMs in this educational setting.

2. Related Works

The paper situates itself within the broader context of AI integration in education, highlighting the transformative potential of LLMs like BERT and GPT architectures.

2.1 Large Language Models

LLMs, powered by transformer architectures, have demonstrated significant potential in educational applications, including personalized learning, content development, and language translation. Their human-like conversational abilities make them suitable for virtual assistants and online learning support systems.

3. Methodology

The core methodology involves administering the VNHSGE English dataset to the three LLMs. The dataset likely consists of standardized test questions assessing English language proficiency at the high school level. Performance is measured by the accuracy of the models' responses compared to the official answer key.

4. Experimental Results

Bing Chat Performance

92.4%

Accuracy on VNHSGE English Dataset

Google Bard Performance

86.0%

Accuracy on VNHSGE English Dataset

ChatGPT (GPT-3.5) Performance

79.2%

Accuracy on VNHSGE English Dataset

Key Findings:

Performance Ranking: Microsoft Bing Chat (92.4%) outperformed both Google Bard (86%) and OpenAI ChatGPT (79.2%).
Practical Implication: Bing Chat and Bard are presented as viable alternatives to ChatGPT for English education in Vietnam, where ChatGPT access is restricted.
Human Comparison: All three LLMs surpassed the average performance of Vietnamese high school students on the same English proficiency test, indicating their potential as superior knowledge resources or tutoring aids.

Chart Description: A bar chart would effectively visualize this performance hierarchy, with the y-axis representing accuracy (%) and the x-axis listing the three LLMs. Bing Chat's bar would be the tallest, followed by Bard, then ChatGPT. A separate benchmark line could indicate the average Vietnamese student score for direct comparison.

5. Discussion

The results demonstrate the significant potential of commercially available LLMs as tools for English language education. The superior performance of Bing Chat may be attributed to its integration with a search engine, providing access to more current or context-specific information. The fact that all models outperformed human students highlights a paradigm shift, where AI can serve not just as an assistant but as a high-competency reference point, potentially personalizing instruction and providing instant, accurate feedback.

6. Original Analysis & Expert Commentary

Core Insight: This paper isn't just a benchmark; it's a market signal. In a region (Vietnam) where the flagship model (ChatGPT) is gated, the research proactively identifies and validates functional alternatives (Bing Chat, Bard), revealing a pragmatic, application-first approach to AI adoption in education. The finding that all LLMs outstrip average student performance isn't merely an academic point—it's a disruptive force, suggesting AI's role may evolve from a supplemental tool to a primary didactic agent or benchmark.

Logical Flow & Strengths: The methodology is straightforward and impactful: use a nationally recognized, high-stakes exam as the evaluation metric. This provides immediate, relatable credibility for educators and policymakers. The focus on accessibility ("what's actually available") over theoretical superiority is a major strength, making the research immediately actionable. It aligns with trends noted by institutions like the Stanford Institute for Human-Centered AI, which emphasize evaluating AI in real-world, constrained contexts.

Flaws & Critical Gaps: The analysis is surface-level. It reports scores but offers little on the nature of errors. Did models fail on grammar, reading comprehension, or cultural nuance? This black-box evaluation mirrors a limitation in the field itself. Furthermore, comparing to an "average" student score is statistically shallow. A more robust analysis, akin to the item-response theory used in psychometrics, could map model proficiency to specific skill levels on the test. The paper also completely sidesteps the critical issue of how to integrate these tools. Simply having a high-scoring AI doesn't translate to effective pedagogy, a challenge extensively documented in the International Journal of Artificial Intelligence in Education.

Actionable Insights: For educators in similar restricted-access markets, this paper is a playbook: 1) Benchmark locally: Don't rely on global hype; test available tools against your specific curriculum. 2) Look beyond the leader: Competitive models may offer sufficient or contextually better performance. 3) Focus on the "how": The next urgent research phase must shift from if LLMs work to how to deploy them responsibly—designing prompts that encourage critical thinking over answer retrieval, creating frameworks for AI-augmented assessment, and addressing equity in access. The real victory won't be a higher AI test score, but improved human learning outcomes.

7. Technical Details & Mathematical Framework

While the paper does not delve into model architectures, the performance can be conceptualized through the lens of probability and task accuracy. The core evaluation metric is accuracy ($Acc$), defined as the ratio of correctly answered items to the total number of items ($N$).

$Acc = \frac{\text{Number of Correct Responses}}{N} \times 100\%$

For a more nuanced understanding, one could model an LLM's performance on a multiple-choice test item as a probability distribution over possible answers. Let the model's probability of selecting the correct answer $c$ from a set of options $O$ be $P_M(c | q, \theta)$, where $q$ is the question and $\theta$ represents the model's parameters and any retrieved context (particularly relevant for Bing Chat's search augmentation). The final score is an aggregation of these probabilities across all items. The performance gap between models suggests significant differences in their internal representations $\theta$ or their retrieval-augmentation mechanisms $R(q)$ for generating $P_M$.

$P_{\text{BingChat}}(c|q) \approx P(c|q, \theta_{\text{Bing}}, R_{\text{Web}}(q))$

$P_{\text{ChatGPT}}(c|q) \approx P(c|q, \theta_{\text{GPT-3.5}})$

8. Analysis Framework: A Non-Code Case Study

Scenario: An English department head in Hanoi wants to evaluate AI tools for supporting Grade 12 students.

Framework Application:

Define Local Objective: Improve student performance on the grammar and reading comprehension sections of the VNHSGE.
Tool Identification & Access Check: List available tools: Bing Chat (accessible), Google Bard (accessible), ChatGPT (requires VPN, not officially supported). Prioritize the first two based on this paper's findings.
Granular Benchmarking: Don't just use full past papers. Create a focused diagnostic test:
- Subset A: 20 grammar questions (tense, prepositions).
- Subset B: 20 reading comprehension questions.
- Administer subsets A & B to Bing Chat and Bard. Record not just accuracy, but also the reasoning provided in their answers.
Error Analysis & Mapping: Categorize errors made by each AI. For example: "Bing Chat failed on 3/5 subjunctive mood questions; Bard gave concise but sometimes incomplete reasoning for inference questions."
Integration Design: Based on the analysis: Use Bing Chat for grammar drill explanations due to higher accuracy. Use Bard's responses as "sample answers" for reading comprehension, but design a student worksheet that asks: "Compare Bard's summary to your own. What did it miss?" This promotes critical evaluation rather than passive acceptance.

This framework moves beyond "which AI is better" to "how can we use each AI's strengths strategically within our pedagogical constraints."

9. Future Applications & Research Directions

Immediate Applications:

Personalized Tutoring Systems: Deploying Bing Chat or Bard as the backbone for AI tutors that provide practice and explanation on demand, tailored to the VNHSGE syllabus.
Automated Material Generation: Using these LLMs to create practice questions, sample essays, and simplified explanations of complex texts aligned with the national curriculum.
Teacher Support Tool: Assisting teachers in grading, providing feedback on student writing, and generating lesson plan ideas.

Critical Research Directions:

Prompt Engineering for Pedagogy: Systematic research into designing prompts that force LLMs to explain reasoning, identify student misconceptions, or scaffold learning rather than just give answers.
Longitudinal Impact Studies: Does using an LLM tutor actually improve student learning outcomes and exam scores over a semester or year? Controlled studies are needed.
Multimodal Evaluation: Future high-stakes exams may include oral components. Evaluating LLMs' speech recognition and generation capabilities in an educational context is the next frontier.
Equity and Access: Research into mitigating the risk of widening the digital divide—ensuring benefits reach students in under-resourced schools without reliable internet or devices.
Cultural & Contextual Adaptation: Fine-tuning or developing retrieval mechanisms that allow global LLMs to better understand and reference local Vietnamese educational materials, history, and culture.

10. References

Dao, X. Q. (2023). Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard. arXiv preprint arXiv:2307.02288v3.
OpenAI. (2023). ChatGPT: Optimizing Language Models for Dialogue. OpenAI Blog.
Kasneci, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Kung, T. H., et al. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198.
Stanford Institute for Human-Centered Artificial Intelligence (HAI). (2023). The AI Index 2023 Annual Report. Stanford University.
International Society for Artificial Intelligence in Education (IAIED). International Journal of Artificial Intelligence in Education.
Thorp, H. H. (2023). ChatGPT is fun, but not an author. Science, 379(6630), 313.