Your resource for web content, online publishing
and the distribution of digital products.
«  
  »
S M T W T F S
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
 
 
 
 
 

Can AI pass school exams? K12Vista puts top models to the test

DATE POSTED:June 9, 2025
Can AI pass school exams? K12Vista puts top models to the test

New research sheds light on the capabilities of multimodal large language models (MLLMs) in handling complex K-12 educational material. This article will summarize the findings of the research paper titled “K12Vista: Exploring the Boundaries of MLLMs in K-12 Education” by Chong Li, Chenglin Zhu, Tao Zhang, Mingan Lin, Zenan Zhou, and Jian Xie from Baichuan Inc. and Peking University. The study introduces a comprehensive benchmark, K12Vista, and associated tools to evaluate MLLMs’ understanding and reasoning skills across various subjects and question types relevant to primary and secondary education.

Researchers introduce a new benchmark for evaluating AI in K-12 education

The core of 21st-century skills lies in understanding science concepts taught from kindergarten through 12th grade, or K-12. Mastering these concepts requires more than just memorization. Students must apply logical thinking, solve problems in a step-by-step manner, and have specialized knowledge in different subjects. These skills are essential for navigating real-world challenges, such as writing code, analyzing data, and planning business ventures. Furthermore, K-12 education uses different types of evaluations, like multiple-choice, fill-in-the-blank, and open-ended questions, to assess a student’s knowledge comprehensively. Therefore, how well MLLMs perform in relation to K-12 material shows how well they can understand and reason in general.

However, current methods for testing MLLMs on K-12 material have limitations. Some evaluations focus on a single subject, lack sufficient data, use only one type of question, or only check the final answer instead of evaluating the model’s reasoning process. Current evaluation methods fail to accurately measure the full capabilities of these models. To tackle these issues, researchers have developed K12Vista, a multimodal benchmark designed to provide a more thorough evaluation of MLLMs within the context of K-12 education.

The K12Vista benchmark offers a more complete assessment

The K12Vista benchmark includes 33,000 questions across five subjects: mathematics, physics, chemistry, biology, and geography. The questions cover topics from primary to high school levels and include multiple-choice, fill-in-the-blank, and free-response formats. Each question contains extra information, such as the grade level, type of question, specific concepts covered, and the difficulty level. This additional information allows users to examine the model’s performance in detail, considering different subjects, grade levels, and question types. Importantly, K12Vista goes beyond just checking the final answer. It introduces a step-by-step evaluation method using a custom-built process evaluation model called K12-PEM. This model identifies the key reasoning steps in the MLLM’s response, assesses the accuracy of each intermediate step and answer, classifies and analyzes any errors made, and calculates a total score for the response. Therefore, this approach reveals insights into the quality of the model’s reasoning, unlike superficial assessments that only look at the final answer.

Furthermore, the researchers created a massive dataset called K12-PEM-800K featuring step-by-step evaluations for MLLM’s reasoning to facilitate the development of the K12-PEM model. They also developed K12-PEBench, a high-quality, human-annotated benchmark to evaluate the ability of MLLMs-based process evaluation.

Can AI tell when it’s being tested?

Why is evaluating the reasoning process so important?

A model’s reasoning process is a critical factor in determining overall performance. Some models may arrive at the correct answer through flawed reasoning; therefore, a superficial correct/incorrect evaluation would not reveal these flaws. Other models might appear to be improving their answers through better Chain-of-Thought (CoT) reasoning, meaning they are attempting to mimic human thought-processes so that they can solve complex problems. Evaluating this reasoning proves essential, yet many methods have yet to be fully studied.

Researchers tested several advanced MLLMs on K12Vista. Results indicate that models with enhanced reasoning capabilities, such as Gemini-2-thinking demonstrate better performance. Additional analysis exposed the MLLMs’ reasoning gaps, providing essential information for the creation of models for the next generation.

How researchers built the K12Vista dataset and evaluation tools

The K12Vista project had three stages. They sourced questions from offline school exams rather than textbooks or online question banks to reduce data contamination risks. The researchers gathered from the data providers over a six-month period.

Collecting and refining the K12Vista data

First, questions were extracted from original PDF documents and automatically processed into LaTeX files using OCR software (Mathpix) to retrieve text. Mathpix converted the LaTeX files into JSONL format. The team resized images to standardized dimensions, while all mathematical and scientific formulas remained as native LaTeX notation to maintain accuracy. Through these steps, researchers created a large-scale question bank, which included approximately 300,000 questions, covering the entire K-12 education spectrum, multiple disciplines, diverse knowledge points, and question formats. Therefore, the team was able to build the metadataset for K12Vista.

Second, they filtered out blurry images and those with resolutions lacking a predefined threshold using rules to ensure quality. The team developed a specialized prompt framework based on the Qwen-72B-Instruct model to validate structural integrity on the metadataset. Finally, entries with JSON parsing errors, such as missing answer fields, question garbled text, or incomplete metadata were removed. Following these steps, researchers obtained approximately 160,000 valid questions.

Third, to further optimize data quality, the team systematically enhanced the dataset by:

  • Filtering out low-challenge questions to refine difficulty gradients.
  • Excluding questions solvable by text-only inputs to ensure strict multimodal reasoning dependency.

Subsequently, researchers clustered questions based on their manually annotated knowledge points, identifying 17,000 core knowledge units. Afterwards, the team adopted a stratified sampling strategy. The team ensured a minimum sample size of 1,000 questions for each discipline-grade-question type combination to maintain balanced sample sizes. At the same time, uniform sampling across core knowledge points occurred, requiring at least one representative instance of each key knowledge point within evaluation subsets so that each key knowledge point had comprehensive coverage.

K12Vista contains three types of questions:

  • Multiple Choice Questions: each question provides 4 options with only one correct answer.
  • Fill-in-the-blank: The model fills in the blanks with the correct answer to complete the sentence or article.
  • Free-response questions: The model uses its knowledge, understanding, and thinking skills to respond in writing to the questions posed.
Ensuring data quality through manual validation

To ensure the K12-Vista benchmark’s quality, the team implemented a careful manual verification process. The team used Qwen2.5-VL-72B to reconstruct the raw unstructured reference solutions, splitting them into logically clear structured reasoning steps, providing a high-quality standard of solutions. A validation team of ten senior undergraduate students meticulously reviewed each data item, conducting multidimensional verification of the question content, image, and reconstructed reference solutions to correct logical fallacies or scientific inaccuracies and ensured a uniformly standardized step-by-step solutions. The result was high-quality benchmark data for process evaluation evaluation.

K12-PEM-800K dataset and K12-PE-Bench assessment

To enable reliable evaluation of CoT reasoning processes, researchers first analyzed common errors. The team collected CoT solutions for each question in K12Vista from various MLLMs. A comprehensive analysis of MLLMs’ errors during CoT reasoning helped to define nine step-wise categories. These categories include image cognition error, question misunderstanding, lack of relevant knowledge, knowledge application error, logical reasoning error, hallucination error, calculation error, and incomplete answer error.

Collecting MLLMs’ CoT for data in the K12 benchmark also occurred across 40 MLLMs, including GPT-4o, QwenVL series, Internal2.5VL series, and LLaVA-Onevision series. Then, these outputs were decomposed using GPT-4o, being split into structured-stepwise reasoning paths. The newly created paths moved to a panel of experts, including GPT-4o, Gemini2-Thinking, Qwen2.5-V1-72B, and InternVL2.5-78B-MPO to conduct a step-level evaluation. Individual models judged the correctness of each reasoning step and labeled error types. The final results used majority votes. GPT-4o then generated step-specific explanations, producing a triple-tag list for each step of the process. During this process, the team filtered out formats that lacked integrity and samples with unreasonable explanations. K12-PEM-800K emerged from this thorough process.

The team selected from their collected CoT evaluation samples to create K12-PEBench, a high-quality set of over 3,000 selected samples with rich reasoning content. A separate team of undergraduate students then performed a second round of manual annotation on these data, judging correctness, and identifying errors in reasoning steps, ensuring that K12-PEBench provided reliable feedback.

How AI models were evaluated using K12Vista

The researchers developed two main ways to evaluate the AI models: the direct inference evaluation and the CoT reasoning step-by-step evaluation.

Direct inference evaluation

In this type of evaluation, the model produces answers directly without intermediate reasoning steps. Afterwards, Qwen2.5-VL-72B extracts the final answer from the model’s output. Once Qwen extracts the final answer, the model’s answer is tested against the reference answer. The team then determines correctness and calculates the score.

CoT reasoning step-by-step evaluation

In the step-by-step evaluation mode, the model evaluates the student’s entire CoT output based on the problem and its ground truth, as well as the reference solution.
For each CoT output, the model separates each step encompassing the reasoning process and the answer to any sub-problem, and creates steps for each element. Finally, the team equally weighted each sub-problem answer and reasoning step, reflecting the quality of both the final answer and reasoning process.

Findings: How well do current AI models perform on K12Vista?

The researchers evaluated a range of AI models, including closed-source models like GPT-4o and Gemini2-thinking, and open-source models like Qwen2.5-VL and InternVL2.5. To evaluate text-only models, the team gave the AI captioning to the questions, and concatenated captions with questions as the LLM inputs. The team evaluated Closed-source models via official APIs, while open-source models were assessed using VLLM on NVIDIA H200 GPUs with default VLLM parameters.

The AI models’ results varied across testing environments. Gemini2-thinking consistently achieved the top levels of accuracy, showcasing its ability to comprehend multimodal settings. The other tested AI products had weaker reasoning performance. Across all MLLMs, a consistent trend of decreasing accuracy as the material’s grade level increases. Moreover, visual information plays a critical role in maintaining an appropriate response in the K12Vista environment.

When examining the fill-in-blank, multiple-choice, and free-response question types, all models scored below 60%. One contributing factor to the lower scores is the more complex knowledge integration and generation needed for the fill-in-blank questions. Similarly, performance gaps increased in the free-response section because of the reasoning required.

In examining performance versus the K12 subjects directly, Chemistry, Biology, and Geography questions led to relatively superior results. The reliance on factual and rule-based reasoning increased the score. Mathematics and Physics questions involved more abstract concepts, requiring quantitative and logical analysis, which creates a challenging environment.

Analysis of step-wise errors revealed significant variations across subjects, question types, and grade levels. Geography questions revealed a higher proportion of miscomprehension because of images and the surrounding text. Mathematics and Physics questions showed more complexity for comprehension and solving questions, while Fill-In-The-Blank questions often contained multiple sub-questions. Finally, logical reasoning errors had more significance at declining grade levels.

Featured image credit