Unexpected Failure and Success in Data-Driven Materials Science

Back

Unexpected Failure and Success in Data-Driven Materials Science

Kangming Li, Jason Hattrick-Simpers 

Acceleration Consortium, University of Toronto, 27 King’s College Cir, Toronto, ON, Canada

High-throughput computation and experiments combined with data-driven methods have the promise to revolutionize materials science. Central to this paradigm is machine learning (ML) for autonomous discovery in place of traditional approaches relying on trials-and-errors or intuitions. However, biases in the use of ML attract little attention. These biases can make the use of ML less effective or even problematic, thereby decelerating materials discovery. This talk features four examples [1-4] from our recent studies on unexpected failure modes in robustness and redundancy as well as unexpected success in prediction tasks considered as challenging. 

First, we show that model performance from community benchmarks does not reflect the true generalization in materials discovery. Using Materials Project database as the case study, we reveal that ML models can achieve excellent performance when benchmarked within an earlier database version, but these pretrained models have severely degraded performance on new materials from the latest database. In the second example on data redundancy across large materials datasets, we find that up to 95% of data can be removed without impacting model performance, highlighting the inefficiency in existing data acquisition practices. Next, we reveal the biases in interpreting generalization capability of ML models. With our recently curated dataset for high entropy materials, we demonstrate that ML models trained on simpler structures can generalize well to more complex disordered, high-order alloys, thereby unlocking new strategies to explore the high entropy materials space. Through a comprehensive investigation across large materials datasets, we reveal that existing ML models can generalize well beyond the chemical or structural groups of the training set. Application domains of ML models may therefore be broader than what our intuitions define. In addition, we also show that scaling up dataset size has marginal or even adverse effects on out-of-domain generalization, contrary to the conventional scaling wisdom. These results call for a rethinking of usual criteria for materials classifications and the strategy for neural scaling.  

Keywords: Machine Learning, Out-Of-Distribution Generalization, DFT calculations, High-Entropy Materials.

References:

[1] K. Li et al, “A critical examination of robustness and generalizability of machine learning prediction of materials properties”, npj Computational Materials 9, 55 (2023). 

[2] K. Li et al, “Exploiting redundancy in large materials datasets for efficient machine learning with less data”, Nature Communications 14, 7283 (2023). 

[3] K. Li et al, “Efficient first principles-based modeling via machine learning: from simple representations to high entropy materials”, Journal of Materials Chemistry A 12, 12412 (2024). 

[4] K. Li et al, “Probing out-of-distribution generalization in machine learning for materials”, arXiv:2406.06489.

00
DAYS
00
HOURS
00
MINUTES
00
SECONDS

Important Dates

Online registration starts & first-round announcement
March 28, 2024
Abstract submission starts
May 1, 2024
Early bird registration closes & second-round announcement
July 1, 2024
Abstract submission closes
September 25, 2024
Workshop
October 9-13, 2024

Contact

Dr. Runhai Ouyang (DCTMD2024@163.com)

Organizer

WechatIMG34975.jpg图片1.pngWechatIMG3832.jpg

Partners and Sponsors

中德logo1.pngWechatIMG34976.jpgWechatIMG3381.jpgWechatIMG2879.jpgWechatIMG2875.jpgWechatIMG35956.jpg WechatIMG2128.jpgWechatIMG2206.jpg  WechatIMG3785.jpgWechatIMG2214.jpg