• linux集群上的SCHRODINGER 分子对接工作流程



    前言

    glide作为广为人知的分子对接软件,提供了非常方便的各类型对接工作,本篇文章旨在介绍如何方便的在无图形化界面的liinux集群中运行整个分子对接流程


    一、总体工作流简介

    在这里插入图片描述
    总体流程大致可以输入,预处理和分子对接三个部分
    • 输入为蛋白文件和小分子2D结构数据库。
    • 输入文件经过ligprepgrid_gen_from_pv.py进行预处理, 其中ligprep是将小分子2D结构转3D。grid_gen_from_pv.py是从蛋白复合物中的已知配体生成蛋白结结合位点网格。
    • 分子对接依赖于GSVrun程序进行,GSVrun包括多种常用的对接模式,以及对接过程中的可调参数。详见第4节。
    • 输出文件包括每个对接阶段的小分子3D构象输出以及得分。

    二、ligprep参数介绍

    Bash
    usage: ligprep [options] (-ismi|-icsv|-imae|-isd) infile (-osd|-omae) outfile
    
    General:
      -h, -help             Print brief help message.
      -long_help            Print exhaustive help message.
      -inp <filename>       Read arguments from the supplied input file.
      -sif_docs             List supported simplified input file format (SIF)
                            keywords.
    
    Ionization:
      -epik                 Use Epik for ionization and tautomerization
                            (Recommended, overrides -i).
      -emb, -epik_metal_binding
                            Run Epik with the metal_binding option so that states
                            appropriate for interactions with metal ions in
                            protein binding pockets are also generated.
      -i {0,1,2}            Ionization treatment: 0 - do not neutralize or ionize,
                            1 - neutralize only, 2 - neutralize and ionize
                            (Default: 1). Note that -epik option overrides -i and
                            always implies ionization and tautomerization.
      -ph <number>          Effective/target pH.
      -pht <number>         pH tolerance for generated structures.
    
    Stereoisomers:
      -ac                   Do not respect existing chirality properties and do
                            not respect chiralities from the input geometry.
                            Generate stereoisomers for all chiral centers up to
                            the number permitted (specified using the -s option).
                            This is equivalent to "Generate all combinations" in
                            the Ligand Preparation user interface. Default
                            behavior is to respect only explicitly indicated
                            chiralities.
      -g                    Respect chiralities from input geometry when
                            generating stereoisomers.
      -s #                  Generate up to this many (#) stereoisomers per input
                            structure. (Default: 32).
    
    Force-field based geometry optimization:
      -bff {14,16}          Force-field to be used for the final geometry
                            optimization. Default: 14 (OPLS_2005). For S-OPLS
                            specify 16.
    
    Standard and job control options:
      -NJOBS NJOBS          Divide the overall job into NJOBS subjobs.
      -NSTRUCTS NSTRUCTS    Divide the overall job into subjobs with no more than
                            NSTRUCTS structures.
      -HOST <hostname>      Run job remotely on the indicated host entry.
      -WAIT                 Do not return a prompt until the job completes.
      -LOCAL                Do not use temporary directory for job files. Keep
                            files in the current directory.
    
    Typical usage:
      Sample protonation states and stereoisomers, optimize geometry
      using OPLS force-field, and store results in Maestro format:
    
        ligprep <input> -omae outfile.maegz -epik
    
      where <input> is one of:
        -ismi infile.smi
        -icsv infile.csv
        -isd  infile.sd
        -imae infile.mae
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64

    一般情况下默认参数即可,在slurm集群上采用如下代码提交任务(ligprep的cpu利用率不高, 一般建议Njobs设置为cpu核数的2~3倍):

    srun -c 4 -J ${name}_ligprep /home/xiaxichen/hyd/Schrodinger-soft/ligprep -ismi ${name}.smi -omae ${name}.maegz -epik -NJOBS 12 -WAIT 
    
    • 1

    三、grid_gen_from_pv.py 参数介绍

    SCHRODINGER本身其实就提供了很多python脚本以实现各种功能
    脚本路径:path/to/schrodinger/mmshare-v5.4/python/script
    脚本执行方式 $SCHRODINGER/run XXX.py [Options]
    脚本查询:https://www.schrodinger.com/scriptcenter

    Bash
    Options:
      -v, -version          Show the program version and exit.
      -h, -help             Show this help message and exit.
      -o OUT_BASE, -outbase OUT_BASE
                            Base name for gridgen input to be produced.
      -a, -allprint         Report all levels of status message.
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    可以看到参数非常少,连基本的网格大小都不可以设置。所以建议还是在本地创建好grid文件在传到集群上为宜,首先是所需时间其实也非常短,其次就是能够加深对蛋白结构的理解和更多的网格设置选项。


    四、GVSrun参数介绍

    Bash
    Warning: The GVSrun need a $compound_library environmental variable, or you can use -d to specify a library.
    
    Perform Virtual Screening Workflow.
    
    Usage: GVSrun [OPTION] <parameter>
    
    Input parameter:
      -i        Gird file input.
            Use a file name (Multiple files are wrapped in "", and split by ' ') or regular expression to represent your input Grid file, default is *.zip.
      -D    Which databases do you want to screen? The Database basic path is <>.
            These database are Supported:
                Custom_DB-ab-site-Fast  GVSrun  GVXrun_help.txt  plmd  README.md  Scripts  XDock  
      -d    Provide your own database path, the compounds files are recommended as maegz or SDF file format.
      -R    Optional, reference Ligands correlated with the grid or use for RMSD calculation. 
                As *.mae, *.maegz (for query structure) or *.phypo (for pharmacophore hypothesis).
      -m    Running Mode: a serial combination of different computing tasks.
                Show all mode and task in this option using "-O".
        @ Available Running Mode: <Fast>
            Fast: Fast Virtual Screening.
            Normal: Filter The Drug-like compounds and dock screening.
            Reference: Virtual Screening with reference ligand restrain.
            Prep_Normal: Virtual Screening for un-prepared compounds database.
            *Normal_MMGBSA: Virtual Screening and MMGBSA re-scoring.
            *Cov_Screening: Virtual Screening to discover covalent durg.
            *Induce_Fit_Screening: Induce fit screening.
            *QM_Screening: Virtual Screening and QMMM re-docking.
            Local: Local docking and screening were carried out using the input ligand structure.
            Shape_Screening: perform screening based on reference ligand shape.
            Advance: dock screening and clustering.
            Advance_Shape_Screening: perform screening based on reference ligand shape.
        @ or Custom Running Mode, e.g. -m "EDL+R+HTVS+CD"
    
    Control parameters:
      -F    Force Fields for docking stage (SP/XP/MMGBSA...), OPLS_2005, OPLS3e or OPLS4. <OPLS4>
      -f    Force Fields for other stage (HTVS/LigPrep...), OPLS_2005, OPLS3e or OPLS4. <OPLS_2005>
      -W    Define a Max/Min Molecular weight for MW module, such as "100:400" <100:400>
      -T    Set a Job Name. Default is "Grid_name-Database_name-Run_Mode".
      -C    Aattach residue number on receptor, required in Covalent Docking.
                e.g. "cys:A:1425", the A is chain name and 145 is the atom number(Heavy atom). 
                The cys is residue name for A:1425, Supported residues: cys, ser, lys.
      -q    Define a "DFT:Basis_Set" to QM/QMMM, default is "B3LYP-D3(BJ):6-311G**".
                Other QM setting: "B3LYP-D3M(BJ):6-311G+**","M06‑2X:def2-tzvpp(-g)","wB97M‑V:cc-pVTZ-pp"
      -p    set a PH:PHT for ligPrep, default is 7.0:2.0, means 7.0±2.0. <7.0:2.0>
      -s    Set a SMARTs Expression for compounds filter at first step. such as [B]([O])[O].
      -E    Shape Screening Options. <10000:mmod:rapid:1200>
                The paramter means: "keep_num:atomtypes:shape_sample_method:max_confs";
                NOTE: element and qsar are alternative atomtypes; thorough are alternative shape_sample_method.
    
    OUTPUT parameters:
      -a    The number of Output compounds per Screening Task. <5%>
                e.g. 10% means reatin top 10% compounds.
                    10000 means reatin top 10000 compounds.
      -b    The number of Output compounds after Standard docking Task. <4000>
      -c    The number of conformations generated by each ligand in the docking task. <1>
      -e    The number of candidates to IFT/CD/MMGBSA/QMMM. <500>
    
    Job control:
      -H        Host Name of your Queue, defult is CPU.
      -N    The max number of subjobs. <100>
      -G    The number of Glide subjobs. <90>
      -P    The number of Prime subjobs. <10>
      -L    The number of LigPrep subjobs. <30>
      -A    The number of phase subjobs. <10>
      -Q    The number of Qsite subjobs. <10>
      -K    The number of QIKPROP subjobs. <10>
      -M    The number of MACROMODEL subjobs. <10>
      -S        Your Schrodinger path. </home/xiaxichen/hyd/Schrodinger-soft>
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69

    GVSrun通过环境变量compound_library来获取系统中化合物数据库的位置;如果不存在该变量需要利用-d参数给定化合物数据库路径(纯SDF或MAEGZ格式的多个文件,不可套叠文件夹)除此以外,脚本需要知道安装好的薛定谔路径:

    • ~/.bashrc中指定好**$SCHRODINGER**环境变量,这也是正常安装薛定谔已经完成的操作;
    • 也可以修改脚本,直接更改默认的SCHRODINGER路径(脚本第22行修改);
    • 可以使用各种参数进行组合进行所需分子对接的定制,在这里重点介绍Running mode参数(其余参数如上所示就不加赘述了),GVSrun一共拥有12种Running mode分别对应这常见的12种分子对接方案,他们分别是:

    在这里插入图片描述

    • 以及如何制定新的Running_model以Advance为例),每个Running_model其实就是个各项任务的组合,对于新的Running_model制定所需要添加的内容有三个部分:方法注释,运行逻辑, 任务运行代码:
    #===========================
    #添加方法注释
    #===========================
    @Available Running Mode:
         Advance: dock screening.
        "HTVS_Normal+SP_ExtensionA+SP_Enhanced" #任务组合,代码93行
    
    @Screening: Reduce the number of compounds rough docking. Related to -a option.
        HTVS_Normal: The standard HTVS screening.
    
    @Standard_Docking: Standard docking was performed to rank the compounds. Related to -b and -c option.
        SP_ExtensionA: Reward intramolecular ligand hydrogen bonds, and accept halogens as H-bond acceptors.
        SP_Enhanced: Increase the depth of sampling for larger grid or More accurate poses. 
    
    #===========================
    #添加运行逻辑
    #===========================
    Parse_Running_Mode(){
        # Parse_Running_Mode Running_Mode
        Running_Mode=$1
        if [ $Running_Mode == "Normal" ]; then
            Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA"
        elif [ $Running_Mode == "Prep_Normal" ]; then
            Pipeline="No_Dup+RDL+IONIZE+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA"
        elif [ $Running_Mode == "Normal_MMGBSA" ]; then
            Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA+MMGBSA_EN"
        elif [ $Running_Mode == "Reference" ]; then
            Pipeline="HTVS_REF+SP_REF+QIKPROP+R5R"
        elif [ $Running_Mode == "Induce_Fit_Screening" ]; then
            Pipeline="IFT_pre+IFT"
        elif [ $Running_Mode == "Cov_Screening" ]; then
            Pipeline="R+RDL+HTVS_Normal+SP_ExtensionA+CD"
        elif [ $Running_Mode == "Fast" ]; then
            Pipeline="HTVS_Normal+SP_Normal"
        elif [ $Running_Mode == "QM_Screening" ]; then
            Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA+QM_redock+RMSD"
        elif [ $Running_Mode == "Shape_Screening" ];then
            Pipeline="PhaseShape+HTVS_local+SP_local"
        elif [ $Running_Mode == "Local" ];then
            Pipeline="HTVS_local+SP_local+MMGBSA_OPT"
        elif [ $Running_Mode == "Advance" ];then
            Pipeline="HTVS_Normal+SP_ExtensionA+SP_Enhanced"
        elif [ $Running_Mode == "Advance_Shape_Screening" ];then
            Pipeline="HTVS_Shape+SP_Shape"
        else 
            Pipeline=$Running_Mode
            Running_Mode="User_Defined"
        fi
    }
    
    #===========================
    #添加任务运行代码
    #===========================
    HTVS_Normal(){
        KEEP_NUM=`parse_num_or_percentage ${HTVS_out_num}`
    cat<<HTVS >> ${Inp_Name}
    [STAGE:PRE_DOCK_HTVS]
        STAGECLASS   gencodes.RecombineStage
        INPUTS   ${Ligand_name},
        OUTPUTS   HTVS_RECOMBINE_OUT,
        NUMOUT   njobs
        OUTFORMAT   maegz
        MIN_SUBJOB_STS   4000
        MAX_SUBJOB_STS   40000
        GENCODES   YES
        OUTCOMPOUNDFIELD   s_vsw_compound_code
        OUTVARIANTFIELD   s_vsw_variant
        UNIQUEFIELD   NONE
    [STAGE:DOCK_HTVS]
        STAGECLASS   glide.DockingStage
        INPUTS   HTVS_RECOMBINE_OUT, GRID
        OUTPUTS   HTVS_OUT,
        RECOMBINE   NO
        PRECISION   HTVS
        UNIQUEFIELD   s_vsw_compound_code
        ${KEEP_NUM}
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77

    五、相关文件

    grid_gen_from_pv:
    链接:https://pan.baidu.com/s/1ymh7e2eesWmU5yMFgxK6mA?pwd=6666
    GVSrun:
    https://pan.baidu.com/s/1wTLhDBruY5F0iA-ndx944A?pwd=6666

    六、时间预估

    流程名速度1万个分子所需时长
    ligprep1.5 秒/分子4.1 小时/1万分子
    glide_HTVS0.5 秒/分子1.5 小时/1万分子
    glide_SP9 秒/分子25 小时/1万分子
    glide_XP100 秒/分子280 小时/1万分子

    六、 特别感谢

    感谢上海科技大学的王林写的GVSrun脚本并将其开源,用起来非常方便。附上他的github主页:
    https://github.com/Wang-Lin-boop

  • 相关阅读:
    UGUI学习笔记(七)自己实现圆形图片组件
    Numpy入门[18]——数组读写
    链表大总结(王道加红皮书)
    Golang Copy()方法学习
    用CSS实现对(√)和错(×)
    新手使用vue引入cesium
    数据屏蔽:静态与动态
    【一起来学C++】————(10)STL之string容器
    大数据:Sqoop 简介与安装
    成都瀚网科技有限公司:抖店怎么开通直播?
  • 原文地址:https://blog.csdn.net/recher_He1107/article/details/126537558