The multiple task deep learning network (MT-DNN) and graph convolutional neural network (GCNN) methods play important role in the accuracy boost.
2.Materials
2.1.Dataset Overview
PubChem is a large-scale chemical database of bioactive molecules with drug like properties.
PubChem’s European counterpart ChEMBL is another database housing small molecule dataset for machine learning.
Some additional well-curated databases include the Aquasol database for aqueous solubility and Tox21 for toxicity.
2.2. Descriptor Set Overview
2D molecular descriptors are the most popular for traditional ADMET modeling. These include cLogP (BioByte Corp., Claremont, CA), Kier connectivity, shape, and E-state indices, a subset of MOE descriptors (Chemical Computing Group Inc., 2004, http://www.chemcomp.com), and a set of ADMET keys that are structural features were used for our ADMET modeling.
Some of the descriptors such as Kier shape indices contain implicit 3D information. Explicit 3D molecular descriptors were not routinely used to avoid bias of the analysis due to predicted conformational effects and speed of calculation for fast prediction.
In the deep learning approach, molecular graph convolutional neural network was applied to transform molecular structures to embeddings.
2.3.Machine Learning Algorithms
Cubist is a prediction-oriented regression algorithms developmented by Quinlan. The advantage of Cubist, comparing to other traditional statistical algorithms, is that it can handle large dataset with highly nonlinearity relationship.
Spotfire and JMP can perform data analysis and visualization.
Some commonly used software for calculating descriptors include Dragon, RDKit, Daylight, ACDlabs, Molecular Operating Environment (MOE), Schrodinger, and Pipeline Pilot.
The graph convolutional descriptors can be computed by several deep learning–based ADMET prediction software including DeepChem, Chemprop, and Chemi-Net.
The Sklearn and Caret packages in Python and R, respectively, are used for applying traditional machine learning algorithms. Tensorflow, Keras, and PyTorch are commonly used DL framework software.
Pipeline pilot is used for data pipelining and automating the whole ADMET training and inference processes.
2.5.Computer Hardware
Either on premise or on cloud or a hybrid of such computing hardware solution can be applied for performing the machine learning tasks.
For traditional machine learning tasks, an HP Z series workstation with at minimum a 4-core CPU with 16GB RAM and 1 TB hard drive or a similar setup with the M5 instance of Amazon Web Services (AWS) can meet the requirement. Preferred hardware setup includes an 8-core CPU with 64GB RAM and 4 TB SSD hard drive.
For deep learning tasks with large datasets, GPUs are preferred for the training process. Some preferred GPUs include Nvidia GeForce RTX 2080Ti, Quadro RTX 6000, Titan RTX, or Tesla V100. On AWS, the P2 or P3 instance is suitable for the GPU training tasks.
3.Methods
3.1.Training and Test Set preparation
To resemble real time prediction situations, training set and test set were split temporally with newer compounds selected as the test set.
3.2.Model Training with Machine Learning and Performance Evaluation
The test set was used solely for testing purposes to avoid bias in the training procedure.
3.3.Model Deployment and Automation
3.4.Performance Monitoring
During the model training update run, we retrieve molecules with newly measured data since last time training. We use the model from last training process to predict the ADMET activity of compounds with newly measured data. In this case, we make sure that the new molecules are not present in the last training model that we used for this evaluation.
3.5.Additional Tips When Training ADMET Models
There are several important factors which need to be considered when building in silico ADMET models.
One of the first considerations is the understanding of the ADMET property to be analyzed and how the research team intends to use this property to make design decisions.
Next, the variability of the experimental data should be examined. Since in silico modeling is intended to simulate an experimental assay, the models are only as good as the quality of the data based on which they are trained.
Following this, the machine learning method(s) to be used to analyze the structure–activity relationship (SAR) should be examined in the context of the structural diversity, SAR linearity, and size of the dataset to be analyzed.
For small datasets, especially for a congeneric series of compounds, simple multilinear regression analysis or partial least squares can be sufficient.
For large and structurally diverse sets of data with nonlinear SAR relationships, more sophisticated methods such as RF, ANN, Cubist, or advanced deep learning methods can be more practical.
The next aspect to be considered is the available molecular descriptor set, as accuracy, interpretability, reproducibility, and speed need to be evaluated.
Finally, the application domain or prediction confidence needs to be examined if the model is meant to be applied for prospective property predictions.
4.Notes
To incorporate DL-based ADMET prediction seamlessly with our existing ADMET prediction service, we had to stay with the same Pipeline Pilot platform.
5.Summary
We describe development and implementation of ADMET prediction methods. The methods are widely used in pharmaceutical industry.