Research: Why even smart LLMs fail at chemistry

DM Television

July

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Research: Why even smart LLMs fail at chemistry

Tags: applications content framework

Author: DATE POSTED:June 9, 2025

Feed: Dataconomy

View: Original article

Advancements in large language models (LLMs) have revolutionized various fields. The paper “ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation,” a preprint authored by Xinyi Liu, Lipeng Ma, Yixuan Li, Weidong Yang, Qingyuan Zhou, Jiayi Song, and Shuhao Li from Fudan University, and Ben Fei from The Chinese University of Hong Kong. The researchers investigate how to improve the performance of LLMs, typically strong in areas like mathematics and coding, when applied to complex chemistry problems that require specialized knowledge and reasoning.

Researchers find risks of harmful behavior in emotional chatbots

LLMs excel in natural language understanding and reasoning, making them useful in diverse applications, including natural language processing, computer vision, and even legal and medical domains. Chain-of-thought reasoning and self-reflection techniques have further enhanced their inferential capabilities, demonstrating their potential in scientific fields. Specifically, LLMs have been used in chemistry for tasks like molecular property prediction and experimental protocol design.

Existing approaches to improve chemistry-focused LLMs

Current strategies for leveraging LLMs in chemistry fall into two primary categories. The first involves pre-training and fine-tuning domain-specific models. These models are initially pre-trained on chemical data (e.g., SMILES or SELFIES molecular representations) to learn relevant features. They are then fine-tuned on task-specific datasets for goals like chemical toxicity or drug solubility prediction. However, these models often have limited scale compared to general-purpose LLMs because they rely on smaller, high-quality datasets curated by experts. This limits their scalability and adaptability, as they are optimized for narrow tasks and specific input-output formats.

The second approach focuses on instruction-tuning general LLMs with chemistry knowledge. This retains the model’s general abilities while enhancing its chemistry expertise. These models support diverse inputs, requirements, and dialogue capabilities, unlike smaller, specialized models. However, the large parameter sizes and need for extensive training data with domain-specific data greatly increase computational costs and resource demands, limiting broader applicability. Retrieval-Augmented Generation (RAG) is another approach, integrating knowledge through information retrieval. However, retrieved information can lack coherence and accuracy, introducing noise and negatively impacting performance in tasks that require precision and context.

ChemAU: A new framework

To overcome these limitations, the researchers introduce ChemAU, a framework that combines the reasoning of general LLMs with the knowledge of chemistry-specific models. Building on recent advancements in uncertainty estimation, ChemAU incorporates a step-by-step mechanism to evaluate the LLM’s confidence at each reasoning stage. When the confidence is low, the specialized model is activated to supplement domain-specific knowledge, ensuring greater accuracy and reliability in solving chemistry problems.

Prior research in uncertainty estimation in LLMs utilizes token-level probability, self-verbalization, and semantic-similarity methods, often relying on predefined thresholds to classify responses. However, in chemistry, the logit values of chemistry-specific tokens tend to increase as reasoning progresses. This is likely because LLMs, trained on general text, struggle to represent domain-specific symbols effectively. The model gradually treats these tokens as “active vocabulary,” artificially increasing their probabilities, which undermines the accuracy of existing uncertainty estimation, especially in areas like chemistry with specialized terminology.

To address this, ChemAU uses a step-wise uncertainty estimation method that adjusts uncertainty values depending on where each reasoning step falls within the entire chain. This enhances the model’s ability to identify where domain-specific expertise is needed, leading to better collaboration between general and specialized models.

Figure 2 in the original preprint provides an overview of the ChemAU framework. It uses a general LLM to produce a reasoning chain for a given chemistry question, then assesses the uncertainty of each step sequentially. High uncertainty, often related to unfamiliar chemistry tokens, triggers the specialized chemistry model to evaluate the accuracy of the step and provide relevant knowledge. That knowledge is incorporated to guide subsequent steps, leading to more accurate output.

How ChemAU works in practice

The ChemAU framework involves three main components:

Adaptive uncertainty estimation: This dynamically assigns uncertainty values to each reasoning step based on its position in the reasoning chain. Addresses the issue of traditional methods using fixed thresholds, which are not suitable for chemistry problems where the reliability of steps changes throughout the reasoning process.
Extraction and supplementation of chemistry knowledge: Decomposes each potentially erroneous step into atomic units of knowledge. A specialized chemistry model (Qwen2.5-1.5B-Instruct fine-tuned on a chemistry knowledge dataset) evaluates the accuracy of each knowledge point and provides precise and comprehensive knowledge when inaccuracies are detected. This targets knowledge deficiencies in the LLM during reasoning.
Adjustment of reasoning steps: Recognizes that successive steps in the reasoning chain are interconnected. Therefore, a process that detects the specific step(s) where external knowledge is needed improves outcomes.

The researchers used Qwen2.5-7B-Instruct, LLaMA3-8B-Instruct, and DeepSeek-R1-Distill-Qwen-14B to evaluate ChemAU across the GPQA, MMLU-Pro, and SuperGPQA datasets. Experimental results showed that the framework enhances LLM performance on chemistry problems.

Key findings and contributions

They propose ChemAU, which combines the reasoning capabilities of general LLMs and domain knowledge of chemistry models. The researchers believe that this is the first framework to introduce a model collaboration strategy for chemistry reasoning tasks.
They identified the pattern of raised logit values of domain tokens and introduced a method to dynamically adjust uncertainty values in order to perform stepwise uncertainty detection.
Experiments demonstrated an improvement in LLM performance in chemistry reasoning tasks through ChemAU , advancing domain-specific applications.

The research also looked at work done by others. Due to the substantial computational requirements of LLMs, the community moved away from standard uncertainty estimation techniques used for learned models. Instead they came up with approximate techniques based on the LLM architecture to evaluate model uncertainty. Applications of LLMs in chemistry have also increased.

Experimental evaluation

The researchers conducted experiments using three open-source LLMs with different parameter sizes. A self-constructed chemistry domain model was also used. All models used chain-of-thought prompting for step-by-step reasoning, applying prompt templates across the experiments. Answer accuracy was the primary metric.

The scientists examined the performance of general LLMs without enhancements, evaluated the chemistry model by directly answering questions, tested retrieval-augmented generation (RAG) by retrieving and feeding chemistry knowledge to the LLM, and compared their new system to others.

The ChemAU method improved the general LLM’s accuracy in chemistry problems. On the MMLU-Pro dataset, using LLaMA-3 as the general model, ChemAU achieved 53.56% accuracy — a 26.12% improvement. This even surpassed the performance of a larger 14B parameter model. The researchers also tested two methods of uncertainty estimation, and the experiments showed better performance with the method they proposed.

Ablation studies to confirm effectiveness

To confirm whether the chemistry domain model effectively compensates for knowledge gaps in the general LLM, the scientists conducted an ablation experiment and removed the chemistry domain model from the framework. This showed a significant performance decline, validating that general LLMs struggle on their own to correct domain-specific reasoning errors. They often have knowledge gaps or conceptual misunderstandings that can’t be solved through simple re-reasoning. This highlights the need for integrating domain-specific knowledge through specialized models.

Another ablation study examined the need for fine-grained uncertainty estimation and knowledge injection during the reasoning process. Evaluating the entire reasoning chain as a unit can prevent the system from pinpointing critical knowledge gaps in individual steps, which can compromise both uncertainty detection and knowledge supplementation. It was found that fine-grained detection at each step enabled more precise identification of knowledge gaps, and allowed for targeted knowledge exactly where it was needed.

Featured image credit

Feed: Dataconomy

View: Original article

Tags: applications content framework