• Linux下安装Foldseek并从蛋白质的PDB结构中获取 3Di Token 和 3Di Embedding


    0. 说明:

    Foldseek 是由韩国国立首尔大学 (Seoul National University) 的 Martin Steinegger (MMseqs2Linclust 的作者) 开发的一款用于快速地从大型蛋白质结构数据库中检索相似结构蛋白质的工具,可以用于计算两个蛋白之间的结构相似性,可以用于蛋白质结构比对,也可以与 MMseqs2Linclust 结合实现基于结构对蛋白质聚类。

    本文的目的在于利用 Foldseek 将蛋白质的PDB结构转化为 3Di alphabet 表示的 3Di 序列,同时获取蛋白质蓄力的 3Di Embedding Matrix。

    1. 下载和安装 Foldseek:

    根据 github 上提供的安装教程(https://github.com/steineggerlab/foldseek),首先确定 Linux 的架构,然后根据相应的下载和安装命令进行 Foldseek 的下载和安装。

    # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
    wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
    
    # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
    wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
    
    # Linux ARM64 build
    wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
    
    # MacOS
    wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
    
    # Conda installer (Linux and macOS)
    conda install -c conda-forge -c bioconda foldseek
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    2. 利用 Foldseek 将PDB转化为3Di

    foldseek 程序所在的目录下,运行:
    命令:./foldseek structureto3didescriptor --help

    usage: foldseek structureto3didescriptor  ...   [options]
     By Martin Steinegger 
    options: misc:                         
     --mask-bfactor-threshold FLOAT mask residues for seeding if b-factor < thr [0,100] [0.000]
     --file-include STR             Include file names based on this regex [.*]
     --file-exclude STR             Exclude file names based on this regex [^$]
    common:                       
     --threads INT                  Number of CPU-cores used (all by default) [40]
     -v INT                         Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
    expert:                       
     --chain-name-mode INT          Add chain to name:
                                    0: auto
                                    1: always add
                                     [0]
     --write-mapping INT            write _mapping file containing mapping from internal id to taxonomic identifier [0]
     --coord-store-mode INT         Coordinate storage mode: 
                                    1: C-alpha as float
                                    2: C-alpha as difference (uint16_t) [2]
     --write-lookup INT             write .lookup file containing mapping from internal id, fasta id and file number [1]
     --tar-include STR              Include file names based on this regex [.*]
     --tar-exclude STR              Exclude file names based on this regex [^$]
    
    examples:
     Convert PDB/mmCIF/tar[.gz] files to a db
    
    references:
     - van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    从上述帮助文档的结果可以看出,将一个蛋白PDB转化为3Di,命令:
    ./foldseek structureto3didescriptor prot.pdb res_prot.3di --threads 1 (用一个线程将 prot.pdb 转化为 res_prot.3di)

    结果示例:
    将人类蛋白 A1IGU5.pdb 转化为 A1IGU5.3di,部分结果如下:
    在这里插入图片描述

    3. 从 3Di 结果中将 3Di Token 和 3Di Embedding 取出

    import numpy as np
        
    def deal3DiRes(threeDifile):
        with open(threeDifile) as inF:
            for line in inF:
                line = line.strip().split("\t")
    
                ## 3Di Token
                token_3di = line[-2].strip()
    
                ## 3Di Embedding
                matrix_3di = np.array(line[-1].strip().split(","), dtype=float)
                matrix_3di_reshape = matrix_3di.reshape(-1,10) ## 因为每个氨基酸是用长度为 10 的向量来表示的,所以把最后一列分成 nx10 的矩阵即可。
                break
            
            return token_3di, matrix_3di_reshape
    
                    
    if __name__ == "__main__":
        res = deal3DiRes(threeDifile="../VirusHumanProt3DiFiles/Human3Di/A1IGU5.3di")
        print(res[0]) ## 3Di 序列
        print(res[1]) ## 3Di matrix
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    A1IGU5.3di 的处理结果如下

    DDDDDDDDDPDPPPPVVVVVVVLLVVLLVQLVVLLVVVLVVLVVLLVLLCCVVPQLLVLVVVDDPVLSCLLCPVSVLVSVLSVVLSVQLVVLVVPSVCNLLSNLVSLVVCLVSLLVRLLRLLLSLVVNVVSLVVQVVVVVSVVSQQVSQCVSPVVCPPVRSVSSSCSSVVVLVCPLVSLVSSLVSDDCPDPSNVSSVVSSVSSVVSNVSSVLSSLLSVLLVVFLPPDPDDPVVVVVPDDPVVVVLVVQLVVLVVCCVVVVDPADDDPLVVVLVVLLVVLLVVLVVQLVVLVVVLVVLVVVLVDQPLPDDPVPPDAPVPLVSVLSVCCSVPLSVVLSVLCCVQQNVLSVVLNSVSSSVVRLVVVLSSLSSLQVVQVVCCVVPVDDDPVSVVSNVSNVSSVVSSSVVSVVSSVVSVVSVVVSVVSVVVSVVVSVVVSVVSVVVSLVPHPCSPPDPVRVVVVVVVVVVVVVVVVVVVVVVVLVPDDFPDLDDDDVPCPVQVVVVCVVANLQFKKFQRAFDDDDDDQDDGDHGGQIWGFPACADPVRHNQWTWIDSSPDIGIDGPVRIDRRDDDPPVVNVCVVVPDDDDDDDDDDDDDDDDDDDDDPPFKKFFCDWDDDDDPQADTHHHRAIKDFPACAPPVRHNQWTFIDGPNDTHIDGSVRMDTDDDDDPDDDDDDD
    [[ 2.629e-316  1.156e-316  2.629e-316 ...  1.482e-323 -1.661e+001
       2.872e+013]
     [ 7.838e-001  6.043e-001  7.838e-001 ...  3.854e+000  1.000e+000
       6.931e-001]
     [ 6.043e-001  1.280e-001  6.043e-001 ...  3.703e+000  1.000e+000
       6.931e-001]
     ...
     [ 3.958e-001  3.127e-001 -1.000e+000 ...  3.815e+000 -1.000e+000
      -6.931e-001]
     [ 3.945e-001  3.958e-001 -1.000e+000 ...  3.853e+000 -1.000e+000
      -6.931e-001]
     [ 0.000e+000  0.000e+000  0.000e+000 ...  0.000e+000  0.000e+000
       0.000e+000]]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    参考:

    [1]. van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    [2]. Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist LMC, Wein T, Varadi M, Velankar S, Beltrao P and Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature, doi:10.1038/s41586-023-06510-w (2023)
    [3]. https://github.com/steineggerlab/foldseek

  • 相关阅读:
    不锈钢怎么查看牌号 不锈钢牌号鉴定 材质鉴定
    查看linux版本是centos还是redhat linux
    Postgresql源码(116)提升子查询案例分析
    【0106】WAL之初始化XLOG访问(1)
    【开发心得】三步本地化部署llama3大模型
    3.zigbee开发,OSAL原理及使用(类似操作系统)
    被误删的HDFS文件如何有效恢复
    为什么我的文章就是审核不通过或者不推荐呢?
    influxdb 2.*版本与1.*版本区别
    六、《图解HTTP》一些关于Web的攻击手段
  • 原文地址:https://blog.csdn.net/weixin_44065416/article/details/134535506