Zihan Zhao1, Bo Chen1,2, Jinbiao Li1,2, Da Ma, 1 Lu Chen1,2*, Kai Yu1,2*, Xin Chen2*
1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
2Suzhou Laboratory, Suzhou, China
*E-mail: chenlusz@sjtu.edu.cn; kai.yu@sjtu.edu.cn; mail.xinchen@gmail.com
Rapid developments of AI tools are expected to offer unprecedented assistance to the research of chemistry and materials science. However, neither existing task-specific models nor emerging general large language models (LLM) can cover the wide range of data modality and task categories. The specialized language and knowledge used in the field including various forms of molecular presentations and spectroscopic methods, hinders the performance of general-domain LLMs in the disciplines.
We first developed a 13B LLM trained on 34B tokens from chemical literature, textbooks, and instructions. The resulting model, ChemDFM1, can store, understand, and reason over chemical knowledge while still possessing generic language comprehension capabilities. In our quantitative evaluation, ChemDFM surpasses GPT-4 on most chemical tasks, despite the significant size difference. In an extensive third-party test2, ChemDFM significantly outperforms most of representative open-sourced LLMs.
We further developed a multi-modal LLM for chemistry and materials science: ChemDFM-X. Diverse multimodal data includes SMILES, GNN, mass spectroscopy and IR spectroscopy, etc, generating a large domain-specific training corpora containing 7.6M data. ChemDFM-X is evaluated on extensive experiments of various cross-modality tasks. The results demonstrate the great potential of ChemDFM-X in inter-modal knowledge comprehension.
This study illustrates the potential of LLM as a co-scientist in the general area of chemistry and materials science tasks. A few examples using ChemDFM-X to assist material research will be demonstrated.
Keywords: Multi-modality, Large Language Model, Spectroscopy, Materials Science
References
1. Zhao, Z.H. et al.. "ChemDFM: Dialogue Foundation Model for Chemistry." https://arxiv.org/abs/2401.14818
2. Feng, K.H. et al.. "SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models.” https://arxiv.org/pdf/2406.09098.
Dr. Runhai Ouyang (DCTMD2024@163.com)