Chapter20: Machine Learning for In Silico ADMET Prediction

Chapter20: Machine Learning for In Silico ADMET Prediction
reading notes of《Artificial Intelligence in Drug Design》
文章目录
1.Introduction
- The multiple task deep learning network (MT-DNN) and graph convolutional neural network (GCNN) methods play important role in the accuracy boost.
2.Materials

2.1.Dataset Overview
- PubChem is a large-scale chemical database of bioactive molecules with drug like properties.
- PubChem’s European counterpart ChEMBL is another database housing small molecule dataset for machine learning.
- Some additional well-curated databases include the Aquasol database for aqueous solubility and Tox21 for toxicity.
2.2. Descriptor Set Overview
- 2D molecular descriptors are the most popular for traditional ADMET modeling. These include cLogP (BioByte Corp., Claremont, CA), Kier connectivity, shape, and E-state indices, a subset of MOE descriptors (Chemical Computing Group Inc., 2004, http://www.chemcomp.com), and a set of ADMET keys that are structural features were used for our ADMET modeling.
- Some of the descriptors such as Kier shape indices contain implicit 3D information. Explicit 3D molecular descriptors were not routinely used to avoid bias of the analysis due to predicted conformational effects and speed of calculation for fast prediction.
- In the deep learning approach, molecular graph convolutional neural network was applied to transform molecular structures to embeddings.
2.3.Machine Learning Algorithms
- Cubist is a prediction-oriented regression algorithms developmented by Quinlan. The advantage of Cubist, comparing to other traditional statistical algorithms, is that it can handle large dataset with highly nonlinearity relationship.
- A deep learning algorithm for ADMET prediction is described in detail in Chemi-net: a molecular graph convolutional network for accurate drug property prediction.
2.4.Software
- Python and R are often used for data processing.
- Spotfire and JMP can perform data analysis and visualization.
- Some commonly used software for calculating descriptors include Dragon, RDKit, Daylight, ACDlabs, Molecular Operating Environment (MOE), Schrodinger, and Pipeline Pilot.
- The graph convolutional descriptors can be computed by several deep learning–based ADMET prediction software including DeepChem, Chemprop, and Chemi-Net.
- The Sklearn and Caret packages in Python and R, respectively, are used for applying traditional machine learning algorithms. Tensorflow, Keras, and PyTorch are commonly used DL framework software.
- Pipeline pilot is used for data pipelining and automating the whole ADMET training and inference processes.
2.5.Computer Hardware
- Either on premise or on cloud or a hybrid of such computing hardware solution can be applied for performing the machine learning tasks.
  - For traditional machine learning tasks, an HP Z series workstation with at minimum a 4-core CPU with 16GB RAM and 1 TB hard drive or a similar setup with the M5 instance of Amazon Web Services (AWS) can meet the requirement. Preferred hardware setup includes an 8-core CPU with 64GB RAM and 4 TB SSD hard drive.
  - For deep learning tasks with large datasets, GPUs are preferred for the training process. Some preferred GPUs include Nvidia GeForce RTX 2080Ti, Quadro RTX 6000, Titan RTX, or Tesla V100. On AWS, the P2 or P3 instance is suitable for the GPU training tasks.
3.Methods

3.1.Training and Test Set preparation
- To resemble real time prediction situations, training set and test set were split temporally with newer compounds selected as the test set.
3.2.Model Training with Machine Learning and Performance Evaluation
- The test set was used solely for testing purposes to avoid bias in the training procedure.
3.3.Model Deployment and Automation

3.4.Performance Monitoring
- During the model training update run, we retrieve molecules with newly measured data since last time training. We use the model from last training process to predict the ADMET activity of compounds with newly measured data. In this case, we make sure that the new molecules are not present in the last training model that we used for this evaluation.
3.5.Additional Tips When Training ADMET Models
- There are several important factors which need to be considered when building in silico ADMET models.
  - One of the first considerations is the understanding of the ADMET property to be analyzed and how the research team intends to use this property to make design decisions.
  - Next, the variability of the experimental data should be examined. Since in silico modeling is intended to simulate an experimental assay, the models are only as good as the quality of the data based on which they are trained.
  - Following this, the machine learning method(s) to be used to analyze the structure–activity relationship (SAR) should be examined in the context of the structural diversity, SAR linearity, and size of the dataset to be analyzed.
    For small datasets, especially for a congeneric series of compounds, simple multilinear regression analysis or partial least squares can be sufficient.
    For large and structurally diverse sets of data with nonlinear SAR relationships, more sophisticated methods such as RF, ANN, Cubist, or advanced deep learning methods can be more practical.
  - The next aspect to be considered is the available molecular descriptor set, as accuracy, interpretability, reproducibility, and speed need to be evaluated.
  - Finally, the application domain or prediction confidence needs to be examined if the model is meant to be applied for prospective property predictions.
4.Notes
- To incorporate DL-based ADMET prediction seamlessly with our existing ADMET prediction service, we had to stay with the same Pipeline Pilot platform.
5.Summary
- We describe development and implementation of ADMET prediction methods. The methods are widely used in pharmaceutical industry.
相关阅读:
OGG将Oracle全量同步到kafka
Promise详解：手写Promise底层-实现Promise所有的功能和方法
 DAMA-第三章（数据治理）
从F5 BIG-IP RCE漏洞（CVE-2023-46747）来看请求走私的利用价值
 单核和多核中的多线程环境下,如何保证i++,++i执行的原子性。
c++基础（九）——静态成员
 使用 Spring Data MongoDB 连接到多个数据库
 【JavaScript】DOM对象&JS事件总结&全局函数
 【iOS】UITableView的动态Cell高度（Masonry）
acwing算法基础之基础算法--浮点数二分算法
原文地址：https://blog.csdn.net/weixin_52812620/article/details/127034005

文章目录

1.Introduction

2.Materials

2.1.Dataset Overview

2.2. Descriptor Set Overview

2.3.Machine Learning Algorithms

2.4.Software

2.5.Computer Hardware

3.Methods

3.1.Training and Test Set preparation

3.2.Model Training with Machine Learning and Performance Evaluation

3.3.Model Deployment and Automation

3.4.Performance Monitoring

3.5.Additional Tips When Training ADMET Models

4.Notes

5.Summary