WK. Mini-symposium: Machine Learning
Wednesday, 2022-06-22, 02:30 PM
Noyes Laboratory 217
SESSION CHAIR: Daniel R. Nascimento (The University of Memphis, Memphis, TN)
|
|
|
WK01 |
Invited Mini-Symposium Talk |
30 min |
02:30 PM - 03:00 PM |
P5834: INTERPRETABLE DEEP LEARNING FOR MOLECULES AND MATERIALS |
ANDREW WHITE, Chemical Engineering, University of Rochester, Rochester, NY, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK01 |
CLICK TO SHOW HTML
Deep learning has begun a renaissance in chemistry and materials. We can devise and fit models to predict molecular properties in a few hours and deploy them in a web browser. We can create novel generative models that were previously PhD theses in an afternoon. In my group, we’re exploring deep learning in soft materials and molecules. We are focused on two major problems: interpretability and data scarcity. Now that we can make deep learning models to predict any molecular property ad naseum, what can we learn? I will discuss our recent efforts on interpreting deep learning models through symbolic regression and counterfactuals. Data scarcity is a common problem in chemistry: how can we learn new properties without significant expense of experiments? One method is in judicious choose of experiments, which can be done with active learning. Another approach is self-supervised learning and constraining symmetries, which both try to exploit structure in data. I will cover recent progress in these areas. Finally, one consequence of the state of deep learning is that you can just make cool things in chemistry with minimal effort. I’ll review a few fun projects, including making molecules by banging on the keyboard, doing math with emojis, and doing molecular dynamics with ImageNet derived potentials.
|
|
WK02 |
Contributed Talk |
15 min |
03:06 PM - 03:21 PM |
P5966: SUPERVISED LEARNING FOR SELECTIVE MULTI-SPECIES QUANTIFICATION FROM NOISY INFRARED SPECTROSCOPY DATA |
EMAD AL IBRAHIM, AAMIR FAROOQ, Clean Combustion Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK02 |
CLICK TO SHOW HTML
A supervised learning approach is implemented to extract information from noisy vibrational spectroscopy data. Our method tackles two of the main problems in any commercial sensing application: sensitivity and selectivity. First, an encoder takes in noisy spectra of complex mixtures and learns reduced representations referred to as embeddings. The learned embeddings are then used in the decoder to filter out noise and unwanted species. Embeddings are also simultaneously used as input to a regression network for the prediction of concentrations and baseline shift. The model was applied for gas sensing using Fourier-Transform Infrared spectroscopy (FTIR) data. We focus on identifying common volatile organic compounds (VOCs) in a realistic scenario. The multitask nature of the model gives better results compared to single task denoising followed by regression and classical techniques like non-negative linear regression. The denoising capability was also compared to other denoising methods like Savitzky-Golay filters (SVG) and wavelet transformations (WT).
|
|
WK03 |
Contributed Talk |
15 min |
03:24 PM - 03:39 PM |
P6305: COMPUTATIONAL OPTIMAL TRANSPORT FOR MOLECULAR SPECTRA |
NATHAN A. SEIFERT, Department of Chemistry, University of New Haven, West Haven, CT, USA; KIRILL PROZUMENT, MICHAEL J. DAVIS, Chemical Sciences and Engineering Division, Argonne National Laboratory, Lemont, IL, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK03 |
CLICK TO SHOW HTML
The use of computational optimal transport for the comparison of molecular spectra is presented. Computational optimal transport provides a comparator, the transport distance, which can be used in machine learning applications and for the comparison of theoretical and experimental spectra. Unlike many other comparators, the transport distance encodes line positions and intensities. It can be used to compare two discrete spectra, a discrete spectrum and a continuous spectrum, as well as two continuous spectra. Because the transport distance reflects the movement of density from one spectrum to another, the two spectra being compared do not have to have the same number of lines or features and need not closely match up in frequency space.
Several well-chosen examples will be shown to demonstrate how computational optimal transport is used and its overall utility. In addition, it is used to make quantitative comparisons between theoretical and experimental spectra including a rotational spectrum of 1-hexanal and an electronic absorption spectrum of SO2. This work was supported by the U. S. Department of Energy, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences, and Biosciences operating under Contract Number DE-AC02-06CH11357.html:<hr /><h3>Footnotes:
This work was supported by the U. S. Department of Energy, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences, and Biosciences operating under Contract Number DE-AC02-06CH11357.
|
|
|
|
|
03:42 PM |
INTERMISSION |
|
|
WK04 |
Contributed Talk |
15 min |
04:21 PM - 04:36 PM |
P6077: INVERSE INFRARED SPECTROSCOPY WITH BAYESIAN METHODS |
JEZRIELLE R. ANNIS, DANIEL P. TABOR, Department of Chemistry, Texas A \& M University, College Station, TX, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK04 |
CLICK TO SHOW HTML
While calculating theoretical harmonic IR spectra is straightforward for most molecules, nearly all methods for incorporating anharmonic effects add a substantial computational footprint. For large systems such as molecular clusters, the brute force assignment of experimental spectra by computationally iterating over all candidate structures is infeasible at the anharmonic level. However, the developments of machine learning methods have provided an alternative route to evaluating the anharmonicities of new molecules and larger clusters, which is the subject of this talk. In this talk, we demonstrate that Bayesian optimization enables real time spectral evaluation of a range of anharmonic values applied to a calculated Hamiltonian. The Bayesian Optimization algorithm can be applied to explore anharmonic value ranges to minimize the integrated difference between the calculated and theoretical spectra. Further, this same computational framework can be adapted to assign spectra that originate from multiple isomers or cluster sizes.
|
|
WK05 |
Contributed Talk |
15 min |
04:39 PM - 04:54 PM |
P5854: SEQUENCE-TO-SEQUENCE LEARNING FOR MOLECULAR STRUCTURE DERIVATION FROM INFRARED SPECTRA |
ETHAN FRENCH, ZHOU LIN, Department of Chemistry, University of Massachusetts, Amherst, MA, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK05 |
CLICK TO SHOW HTML
r0pt
Figure
Fully identifying unknown molecules via infrared spectroscopy can be a challenging task for even the most experienced researchers. Current data-driven computational methods usually identify unknown spectra by matching them against databases of known spectra. However, this method can be problematic for novel complex molecules given the relative lack of information. Deep learning provides a potential solution to this problem. Sequence-to-sequence learning has had great success in a wide range of areas such as language translation and speech recognition. Ilya Sutskever, Oriol Vinyals, Quoc V. Le, "Sequence to Sequence Learning with Neural Networks", NeurIPS, 2014, 27.n this work, an unsupervised sequence-to-sequence model was extended to chemical systems and used to derive complete molecular structures from infrared spectra. The model was trained on the infrared spectra of small organic molecules containing C, H, O, N, and F atoms. These molecules were represented using SELFIES, an improved version of the SMILES string molecular fingerprint descriptor. Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik, "Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation", Mach. Learn.: Sci. Technol. 2020, 1, 045024ur model is able to achieve state-of-the-art results in successfully identifying a wide variety of molecules from their infrared spectra.
Ilya Sutskever, Oriol Vinyals, Quoc V. Le, "Sequence to Sequence Learning with Neural Networks", NeurIPS, 2014, 27.I
Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik, "Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation", Mach. Learn.: Sci. Technol. 2020, 1, 045024O
|
|
WK06 |
Contributed Talk |
15 min |
04:57 PM - 05:12 PM |
P6168: GAS-PHASE INFRARED SPECTRA ANALYSIS VIA DEEP NEURAL NETWORKS |
ABIGAIL A ENDERS, NICOLE NORTH, HEATHER C. ALLEN, Department of Chemistry and Biochemistry, The Ohio State University, Columbus, OH, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK06 |
CLICK TO SHOW HTML
Infrared spectroscopy provides unique molecular vibrational information that is molecule and environment specific. Spectral responses, as images rather than array-based data, were used to train a deep neural network to develop analytical methods capable of large-scale information processing. We label spectra based on the present and absent functional groups, but the model must determine the frequencies, peak shape, and variability of each molecular response to identify functional groups. The resultant machine learning models significantly reduce the time required for traditional infrared spectral analysis and the functional group assignments is found to be more accurate than expert chemists. Application of machine learning methods to spectroscopic data is made approachable by a straightforward model system that is generalizable, broad, and well-performing on thousands of gas-phase infrared spectra from the NIST spectral database. Future improvement will involve more specific and applied models, such as the investigation of field samples for environmental contaminants or component identification, with increased solvent complexity to continue developing a broad range of models. To the best of our knowledge, this is the first presentation of a generalizable machine learning model for infrared analysis because it is capable of analyzing a diverse spectral database.
|
|
WK07 |
Contributed Talk |
15 min |
05:15 PM - 05:30 PM |
P6088: COMPARISON OF EXPERIMENTAL AND SIMULATED RAMAN SPECTRA THROUGH REVERSE SELF MODELING CURVE RESOLUTION FOR REGRESSION-BASED MACHINE LEARNING |
NICOLE NORTH, ABIGAIL A ENDERS, HEATHER C. ALLEN, Department of Chemistry and Biochemistry, The Ohio State University, Columbus, OH, USA; |
IDEALS Archive (Abstract PDF / Presentation File) |
DOI: https://dx.doi.org/10.15278/isms.2022.WK07 |
CLICK TO SHOW HTML
Raman spectroscopy utilizes inelastic scattering to provide information about the vibrational environment of bonds within molecules. The intensity of vibrations can be cautiously used to determine relative concentrations of compounds. Machine learning methods are used to find patterns within datasets and commonly work better with large datasets, such as those from Raman spectral acquisitions; however, the generation of these large datasets is time consuming and the data manipulation is cumbersome. This work seeks to circumvent these issues by determining if experimental data can be fortified or substituted with simulated data. To initiate this process, Raman spectra were collected on a variety of different molecules in solutions made of 2-3 chemical species. These spectra were then used to simulate data with the same concentrations using Reverse Self Modeling Curve Resolution (RSMCR). The experimental and the RSMCR simulated data were used to train regression-type machine learning models. Models were then validated using previously withheld experimental Raman spectral data to determine how well each dataset worked as a generalizable basis for the regression model. We found that the RSMCR simulated data closely represents the experimental Raman spectral data with an approximate 2% error in the relative intensity.
|
|