Advancements in large language models (LLMs) have revolutionized various fields. The paper “ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation,” a preprint authored by Xinyi Liu, Lipeng Ma, Yixuan Li, Weidong Yang, Qingyuan Zhou, Jiayi Song, and Shuhao Li from Fudan University, and Ben Fei from The Chinese University of Hong Kong. The researchers investigate how to improve the performance of LLMs, typically strong in areas like mathematics and coding, when applied to complex chemistry problems that require specialized knowledge and reasoning.
Researchers find risks of harmful behavior in emotional chatbotsLLMs excel in natural language understanding and reasoning, making them useful in diverse applications, including natural language processing, computer vision, and even legal and medical domains. Chain-of-thought reasoning and self-reflection techniques have further enhanced their inferential capabilities, demonstrating their potential in scientific fields. Specifically, LLMs have been used in chemistry for tasks like molecular property prediction and experimental protocol design.
Existing approaches to improve chemistry-focused LLMsCurrent strategies for leveraging LLMs in chemistry fall into two primary categories. The first involves pre-training and fine-tuning domain-specific models. These models are initially pre-trained on chemical data (e.g., SMILES or SELFIES molecular representations) to learn relevant features. They are then fine-tuned on task-specific datasets for goals like chemical toxicity or drug solubility prediction. However, these models often have limited scale compared to general-purpose LLMs because they rely on smaller, high-quality datasets curated by experts. This limits their scalability and adaptability, as they are optimized for narrow tasks and specific input-output formats.
The second approach focuses on instruction-tuning general LLMs with chemistry knowledge. This retains the model’s general abilities while enhancing its chemistry expertise. These models support diverse inputs, requirements, and dialogue capabilities, unlike smaller, specialized models. However, the large parameter sizes and need for extensive training data with domain-specific data greatly increase computational costs and resource demands, limiting broader applicability. Retrieval-Augmented Generation (RAG) is another approach, integrating knowledge through information retrieval. However, retrieved information can lack coherence and accuracy, introducing noise and negatively impacting performance in tasks that require precision and context.
ChemAU: A new frameworkTo overcome these limitations, the researchers introduce ChemAU, a framework that combines the reasoning of general LLMs with the knowledge of chemistry-specific models. Building on recent advancements in uncertainty estimation, ChemAU incorporates a step-by-step mechanism to evaluate the LLM’s confidence at each reasoning stage. When the confidence is low, the specialized model is activated to supplement domain-specific knowledge, ensuring greater accuracy and reliability in solving chemistry problems.
Prior research in uncertainty estimation in LLMs utilizes token-level probability, self-verbalization, and semantic-similarity methods, often relying on predefined thresholds to classify responses. However, in chemistry, the logit values of chemistry-specific tokens tend to increase as reasoning progresses. This is likely because LLMs, trained on general text, struggle to represent domain-specific symbols effectively. The model gradually treats these tokens as “active vocabulary,” artificially increasing their probabilities, which undermines the accuracy of existing uncertainty estimation, especially in areas like chemistry with specialized terminology.
To address this, ChemAU uses a step-wise uncertainty estimation method that adjusts uncertainty values depending on where each reasoning step falls within the entire chain. This enhances the model’s ability to identify where domain-specific expertise is needed, leading to better collaboration between general and specialized models.
Figure 2 in the original preprint provides an overview of the ChemAU framework. It uses a general LLM to produce a reasoning chain for a given chemistry question, then assesses the uncertainty of each step sequentially. High uncertainty, often related to unfamiliar chemistry tokens, triggers the specialized chemistry model to evaluate the accuracy of the step and provide relevant knowledge. That knowledge is incorporated to guide subsequent steps, leading to more accurate output.
How ChemAU works in practiceThe ChemAU framework involves three main components:
The researchers used Qwen2.5-7B-Instruct, LLaMA3-8B-Instruct, and DeepSeek-R1-Distill-Qwen-14B to evaluate ChemAU across the GPQA, MMLU-Pro, and SuperGPQA datasets. Experimental results showed that the framework enhances LLM performance on chemistry problems.
Key findings and contributionsThe research also looked at work done by others. Due to the substantial computational requirements of LLMs, the community moved away from standard uncertainty estimation techniques used for learned models. Instead they came up with approximate techniques based on the LLM architecture to evaluate model uncertainty. Applications of LLMs in chemistry have also increased.
Experimental evaluationThe researchers conducted experiments using three open-source LLMs with different parameter sizes. A self-constructed chemistry domain model was also used. All models used chain-of-thought prompting for step-by-step reasoning, applying prompt templates across the experiments. Answer accuracy was the primary metric.
The scientists examined the performance of general LLMs without enhancements, evaluated the chemistry model by directly answering questions, tested retrieval-augmented generation (RAG) by retrieving and feeding chemistry knowledge to the LLM, and compared their new system to others.
The ChemAU method improved the general LLM’s accuracy in chemistry problems. On the MMLU-Pro dataset, using LLaMA-3 as the general model, ChemAU achieved 53.56% accuracy — a 26.12% improvement. This even surpassed the performance of a larger 14B parameter model. The researchers also tested two methods of uncertainty estimation, and the experiments showed better performance with the method they proposed.
Ablation studies to confirm effectivenessTo confirm whether the chemistry domain model effectively compensates for knowledge gaps in the general LLM, the scientists conducted an ablation experiment and removed the chemistry domain model from the framework. This showed a significant performance decline, validating that general LLMs struggle on their own to correct domain-specific reasoning errors. They often have knowledge gaps or conceptual misunderstandings that can’t be solved through simple re-reasoning. This highlights the need for integrating domain-specific knowledge through specialized models.
Another ablation study examined the need for fine-grained uncertainty estimation and knowledge injection during the reasoning process. Evaluating the entire reasoning chain as a unit can prevent the system from pinpointing critical knowledge gaps in individual steps, which can compromise both uncertainty detection and knowledge supplementation. It was found that fine-grained detection at each step enabled more precise identification of knowledge gaps, and allowed for targeted knowledge exactly where it was needed.
All Rights Reserved. Copyright , Central Coast Communications, Inc.