Impact of Data Bias on Machine Learning for Crystal Compound Synthesizability Predictions

The overall framework of the synthesizability likelihood prediction model. (a) Data collection for the two distinct data sets used in this study. For data set 1 or mixed-source data, crystal samples for the synthesizable class are obtained from the Crystallographic Open Database (COD) and crystal samples for the synthesizable class are generated using the CSPD. For data set 2 or single-source data, crystal samples for both classes are collected from the Materials Project database. (b) The crystal information files (CIFs) are converted into color-coded voxel images, which are used as the inputs for the convolutional encoder. (c) The convolution encoder followed by a multi-layer perceptron (MLP) binary classifier trained on labeled data, referred to as the CNN classifier.
The overall framework of the synthesizability likelihood prediction model. (a) Data collection for the two distinct data sets used in this study. For data set 1 or mixed-source data, crystal samples for the synthesizable class are obtained from the Crystallographic Open Database (COD) and crystal samples for the synthesizable class are generated using the CSPD. For data set 2 or single-source data, crystal samples for both classes are collected from the Materials Project database. (b) The crystal information files (CIFs) are converted into color-coded voxel images, which are used as the inputs for the convolutional encoder. (c) The convolution encoder followed by a multi-layer perceptron (MLP) binary classifier trained on labeled data, referred to as the CNN classifier.

Machine learning models are susceptible to being misled by biases in training data that emphasize incidental correlations over the intended learning task. In this study, the impact of data bias is demonstrated on the performance of a machine learning model designed to predict the likelihood of synthesizability of crystal compounds. The model performs a binary classification on labeled crystal samples.

Despite using the same architecture for the machine learning model, how the model’s learning and prediction behavior differs once trained on distinct data is showcased. Two data sets are used for illustration: a mixed-source data set that integrates experimental and computational crystal samples and a single-source data set consisting of data exclusively from one computational database. Simple procedures are presented to detect data bias and to evaluate its effect on the model’s performance and generalization.

This study reveals how inconsistent, unbalanced data can propagate bias, undermining real-world applicability even for advanced machine learning techniques.

Designing Materials to Revolutionize and Engineer our Future (DMREF)