The world’s leading publication for data science, AI, and ML professionals.

Automatic Spectral Identification Using Deep Metric Learning with 1D RegNet and AdaCos

Introduction of powerful deep metric learning for much more accurate and robust automatic spectral analysis to identify the substance from…

Photo by National Cancer Institute on Unsplash
Photo by National Cancer Institute on Unsplash

Background & Motivation

Television shows like CSI, Law and Order, and Forensic Files have introduced forensic science to popular audiences and dramatically demonstrated the importance of scientific evidence to solve crimes. A tiny piece left on the site could identify the murderer by breaking their alibi. Identification of substances based on spectral analysis plays a vital role in forensic science. Similarly, the material identification process is of paramount importance for malfunction reasoning in manufacturing sectors and materials research. Spectrum, such as X-Ray Diffraction (XRD) for inorganic materials and Fourier-transformed Infrared Spectroscopy (FT-IR) for organic materials, is typical to identify the ingredients. The unique spectral features, so-called "spectral fingerprints", enable us to identify the materials by comparing the measured spectra to the ones in the database. The identification accuracy is roughly based on the search engine algorithm of the spectrum database and how noiseless spectral data taken from a considerately retrieved sample is. However, the insufficient amount of samples or contamination of others could make spectral data noisy and challenging to identify. Specifically, most of the database search algorithms utilise the peak position of input spectra. Still, peak shift usually occurs in the case of tiny samples, resulting in poor identifiability. Without the aid of human experts, search engine results won’t become trustworthy analytics. A more robust identification algorithm could make the searched result itself more reliable analytic results. As such, Deep Learning, a more accurate identifying machine learning model, could possibly automatise forensic analysis processes.

This code will be explained below and is also available on GitHub!

GitHub – ma921/XRDidentifier: Pytorch implementation of XRD spectral identification from COD…

Machine Learning Strategies

But, how can we exploit deep learning for this noisy sample identification problem? The idea that immediately came to my mind was convolutional neural networks, denoted as CNN, which is an amazingly successful deep learning classifier for object recognition. But wait, there are four differences from usual image classification tasks.

  1. input spectra are not 2D tensors but 1 dimensional, unlike images.
  2. the goal of the task is not classification but identification. That is, the number of classes to be identified is over 10,000, like face recognition tasks.
  3. unique noise are contained in raw spectra, unlike image noise types.
  4. the spectral database consists of a single spectrum for each material, without any variation or noisy sample.

Let’s look into each of the problems. Number corresponds to each problem.

1. 1D-CNN

Firstly, the difference of dimension matters in terms of transfer learning. In image classification tasks, we can use pre-trained models published on GitHub, enabling us to reduce the number of data we should prepare. Fortunately, there are several 1D-CNN models on GitHub, but there are no pre-trained models for spectra. As I tried these models, 1D-RegNet works the best. RegNet is the state-of-the-art CNN model resulted from Neural Architecture Search (NAS), created by Facebook researchers (paper). The original version of RegNet is only for 2D images. But, thanks to Georgia Tech researchers, the 1D version of RegNet is available on GitHub. The link to the repository is here: https://github.com/hsd1503/resnet1d

2. Deep Metric Learning

Secondly, we should regard this spectral identification problem as Few-Shot Metric Learning, likewise face recognition. Generally speaking, classification tasks require over 100 samples for each class. In our setting, the available spectra are 1 or 2 for each class, meaning few-shot learning. Secondly, classification tasks generally treat under 100 classes; otherwise, the accuracy lowers. Over 5,000 classes, we usually switch the problem from learning the criteria differentiating each class to learning to the evaluation metric to measure the similarity of samples. This is so-called metric learning. This works very well on face recognition, and what is better, all we have to do metric learning is to replace loss function. CosFace, ArcFace are pioneering loss functions for deep metric learning using CNN. Nowadays, these loss functions advanced to AdaCos, automatically identifying hyperparameters of CosFace, works very well without additional tuning procedures. The AdaCos implementation is available here: https://github.com/4uiiurz1/pytorch-adacos/blob/master/metrics.py

3. Bayesian Denoising Preprocess

Thirdly, we should consider the denoising process of unique noise in spectra. The noise types in spectra can broadly classify into two categories; white noise and background noise. Background noise is derived from signals of sample holders, plates, or insufficient crystallised samples. The removal was used to be done by humans until BEADS. Thanks to the power of Bayesian Sparsity analysis, we can automatically denoise both noise types with few hyperparameters. The repository is here: https://github.com/skotaro/pybeads

4. Physics-informed Data Augmentation

Even though we consider a few-shot learning setting, 1 or 2 training data for each class is too insufficient. This is because there are several variants of spectra for identical materials. The variants and their reasons are as follows:

  1. Peak Scaling; deviating the peak intensities from the standard spectra in the database due to the alignment difference of crystallographic orientation in powder samples
  2. Peak elimination; vanishing some peaks from the standard spectra in database due to experimental apparatus setting
  3. Pattern Shifting; shifting while patterns to lower or higher from the standard spectra in database due to the strain in samples or the experimental conditions.
  4. Peak splitting; occurrence of new peaks from the standard spectra in the database due to crystal structure disorder.

As such, 1 or 2 data cannot tell these spectral changes to the deep learning model. Thus, we should apply data augmentation. Thanks to MIT researchers, the data augmentation codes for 1–3 is available here: https://github.com/PV-Lab/autoXRD

With regards to peak splitting, we can use the pymatgen library to create disordered crystal structures and their X-ray diffraction spectra.


Problem Setting

Let’s set up a concrete problem to understand the above strategy. Let’s say we wanna identify the unknown crystalline powder sample. We can measure X-Ray Diffraction (XRD) as fingerprint spectrum. All we know about the sample is that it must contain lithium. XRD spectra database can be accessed through the Crystallography Open Database (COD), converting the Crystal Information File (CIF) into XRD spectrum using the pymatgen. According to COD, lithium compounds reported so far are 8,172. Checking 8,172 spectra similar to the target sample is infeasible for humans, so that we consider narrowing down the candidates into 5. The identification accuracy of the likeliest candidate doesn’t matter, but the specified 5 candidates must contain the true materials. Let’s say the accuracy should be above 95%. As such, we can formulise this problem as follows:

Using the spectral identification technique, how can we identify the unknown sample out of 8,172 candidates with over 95% top-5 accuracy?


The Code

Python Library Requirements:

  • pymatgen
  • pytorch
  • scikit-learn

Environments; Python 3.7.3, Pytorch 1.4.0,

Dataset Construction

Download CIF from COD

We will use the COD for constructing the dataset. First, click Accessing COD Data > Search page on the left table. Afterwards, enter "Li" in the "1 to 8 elements" search box, then click send. Following jumping to the search result page, you will see the text "Result: there are 8172 entries in the selection". Click "list of CIF URLs", then you can download "COD-selection.txt". This way, you can get the list of 8,172 cif URLs. Next, the following python code will collectively download all cif files from COD.

Convert CIF to XRD spectra

At the first step, you need to check if the cif files are readable by the pymatgen. Unfortunately, some CIF files cannot be read by pymatgen. The following read_CIF module can interpret CIF to pymatgen readable structure data, saving it as a pickle file with an arbitrary name.

Next, we can convert pymatgen structure data to XRD spectra using the following XRDspectra module.

Thus, after running the above codes in order, we obtain an XRD spectra database named XRD_epoch5.pkl. The crystal names like LiTaO3 are converted into numerical int labels. The correspondence between the numerical label and human-interpretable crystal name is summarised in material_labels.csv.

Deep Metric Learning Model (1D-RegNet + AdaCos)

First, download net1d.py from this repository. Then, execute the following code to train the deep metric learning model.

As a result, you can visualise the learning curve (learning_curve.csv) with the following code.

Learning curve created by the author
Learning curve created by the author

As we can see, at 100 epoch, the validation top-5 accuracy reached 97.3, meaning that our target has been satisfied. Test accuracy was identical (97.3), so that the overfitting to validation dataset is not concerned.

This code is also available in GitHub, linked here. Play as if you were a forensician!

If you like this article, please follow me on Medium, GitHub, and Twitter!

Twitter: https://twitter.com/masaki_adachi GitHub: https://github.com/ma921


Related Articles