News

News

Shanghai Jiao Tong University Unveils Large Language Model for Chemical Synthesis

Date:2025-07-28

Views:2467

A joint research team from the Institute of Artificial Intelligence and the Center for Transformative Molecular Science at Shanghai Jiao Tong University (SJTU) has achieved a major original breakthrough in the field of AI for Chemistry. Their work introduces a new chemical synthesis large language model (LLM) named Chemma, which significantly accelerates the entire process of organic chemistry synthesis.

The research, titled "Large language models to accelerate organic chemistry synthesis" was published online in the prestigious journal Nature Machine Intelligence on July 1, 2025. The study highlights the immense potential of general-purpose AI large models to empower organic chemical synthesis.

图片1.png

Paper information:

Yu Zhang, Yang Han, Shuai Chen, Ruijie Yu, Xin Zhao, Xianbin Liu, Kaipeng Zeng, Mengdi Yu, Jidong Tian, Feng Zhu*, Xiaokang Yang*, Yaohui Jin*, and Yanyan Xu*, Large language models to accelerate organic chemistry synthesis. Nature Machine Intelligence, 1-13, 2025.

Model online website:

https://ai4chem.sjtu.edu.cn

Research Background:

Chemical synthesis is a fundamental method for creating transformative molecules, having a major impact across fields like life sciences, materials, and energy. Despite significant advances in chemical instrumentation over the past few decades, chemists still face the challenge of repeatedly consulting literature, designing protocols, and conducting wet-lab experiments when dealing with the vast reaction space and complex molecular structures. Traditional AI methods, such as those based on Density Functional Theory (DFT) calculations or Bayesian Optimization, have made progress in specific tasks but have distinct limitations. They typically rely heavily on expert knowledge for feature engineering and molecular parameterization. Most can only optimize within a "closed reaction space" (e.g., a fixed library of ligands or solvents) pre-set by experts, which may lead to missing unknown, better-performing options. In recent years, large language models (LLMs) like GPT have demonstrated powerful general capabilities, but their application in the chemistry domain is still in its early stages. Their chemical expertise is limited, making it difficult for them to autonomously explore and optimize previously unreported reactions.

Chemma's Core Goal

To overcome these challenges, researchers posed a core question: Can we build an LLM deeply integrated with chemical knowledge that can both understand chemical structures and rules from SMILES formulas and reaction data like a human chemist, and possess the strong generative capabilities of an LLM, thereby enabling genuine exploration and discovery in an open reaction space?

The research thus proposes and designs Chemma. This model, which is one of the achievements of the Magnolia Chemical Large Model, is designed to be a generative AI assistant that can interact with chemists, assist in experimental decision-making, and ultimately accelerate the process of organic synthesis. Chemma's key capabilities include: (1) learning molecular representations and understanding chemical structures from SMILES sequences; (2) Learning the complex relationships between reactants, products, and conditions through pre-training on massive reaction data, similar to a chemist. (3) Its generative ability allows it to design novel molecules (e.g., recommend new ligands), breaking through the constraints of pre-set conditions and guiding the exploration of new reactions.

图片2.png

Figure 1. Functions and applications of Chemma.

The team validated Chemma's performance across multiple chemical benchmark tasks. On the USPTO-50k dataset, Chemma achieved a Top-1 accuracy of 72.2% in the single-step retrosynthesis task. This result significantly outperformed the best reported Top-1 accuracy of 57.7% in the literature. In multi-step synthesis testing, Chemma was capable of designing reasonable reaction steps, which were subsequently validated by experts. For yield prediction and selectivity prediction tasks, which include regioselectivity and enantioselectivity, Chemma achieved an R2 of 0.88 on high-throughput experimental data prediction without the need for DFT features. Regarding the ligand and catalyst recommendation task, Chemma could provide the optimal ligand under pre-set conditions. In the majority of test combinations, its recommended ligands resulted in a higher median yield, with an overall accuracy rate of 93.7%. Relying on the Center for Transformative Molecular Science, Chemma can design and generate more than 20 types of catalysts, over 10 types of reagents, and various additives online for specific reactions. This capability simultaneously enables experimental optimization and rapidly improves the efficiency of chemical experiments.

图片3.png

Figure 2. Performance evaluation of Chemma’s capabilities for different organic synthesis tasks using both open benchmark and HTE datasets.

The model is not only capable of reaction prediction but also of exploring, designing, and optimizing reactions within the unknown reaction space. With the strong support of Associate Professor Feng Zhu from the Center for Transformative Molecular Science, the team conducted wet-lab validation. For a previously unreported N-heterocyclic cross-coupling reaction, researchers integrated Chemma into an active learning framework to explore suitable ligands and solvents for the reaction. Through a "human-machine collaboration" active learning cycle, after the failure of the first round of attempts, Chemma performed experimental data feedback and online fine-tuning, and precisely recommended the highly efficient ligand (PAd3) in the second round. Ultimately, a 67% isolated yield was successfully achieved with only 15 experiments. This task fully demonstrates Chemma's potential to assist in exploring unknown reaction conditions within an open reaction space.

图片4.png

Figure 3. Illustrations of using Chemma to explore reaction spaces

Research Significance 

In this study, the research team adopted a unique approach, treating chemical reactions as a natural language task and learning their structure and rules. The model demonstrated excellent performance across multiple organic chemistry tasks and showed good human-machine collaboration capabilities. Specifically, the precise prediction of yield and selectivity without the need for DFT calculations, as well as the ability to perform autonomous optimization in an open space, fully proves the applicability of language models in chemical synthesis.

Author information

First Author: Zhang Yu, Ph.D. student at the Institute of Artificial Intelligence, Shanghai Jiao Tong University.

Key Wet-Lab Contributors: Han Yang and Chen Shuai, Ph.D. students at the Center for Transformative Molecular Science.

Corresponding Authors: Associate Professor Yanyan Xu, Professor Yaohui Jin, and Professor Xiaokang Yang from the School of Computer Science, along with Associate Professor Feng Zhu from the Center for Transformative Molecular Science, Shanghai Jiao Tong University.

Guidance and Support: Academician Ding Kuiling provided valuable advice and guidance for this research 

Team Profile 

The AI for Science team at the Sc of Artificial Intelligence, Shanghai Jiao Tong University, is led by Professor Xiaokang Yang, Professor Yaohui Jin, and Associate Professor Yanyan Xu. The team comprises over ten post-doctoral fellows and master's/Ph.D. students. The lab addresses fundamental challenges in complex system optimization through the development of innovative AI-driven solutions. Since 2023, the team has placed particular emphasis on AI for chemistry. A key milestone has been the development of BAI-Chem, a suite of large models including chemical synthesis, molecular design, and mass spectrometry. Currently, the team is collaborating with chemists to build a fully autonomous chemical laboratory, which combines embodied AI and large models to achieve completely automated chemical synthesis. Building on this foundation, our goal is to tackle unsolved problems in organic synthesis by developing advanced AI methodologies.