• 根据Uniprot ID/PDB ID批处理获取蛋白质.pdb文件


    1.根据Uniprot ID批处理获取蛋白质.pdb文件

    由于Uniprot的ID号可能对应多个NCBI的ID,但是根据Alphafold可以获取其唯一的PDB文件,所以用代码批处理获得.pdb文件如下:

    1. import pandas as pd
    2. import numpy as np
    3. from Bio import SeqIO
    4. from Bio import PDB
    5. import requests
    6. # 但是可能会出现 InsecureRequestWarning 警告,
    7. # 虽然不影响代码采集但是看着不舒服,可以加上下面两行:
    8. import urllib3
    9. urllib3.disable_warnings()
    10. headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0'}
    11. def read_file(file_name):
    12. pro_swissProt = []
    13. with open(file_name, 'r') as fp:
    14. for line in fp:
    15. if line.startswith('>'):#作用:判断字符串是否以指定字符或子字符串开头
    16. pro_swissProt.append(line[1:-1])
    17. return pro_swissProt
    18. file1 = '/AD/all1.csv'
    19. ID=read_file(file1)
    20. j = 0
    21. not_exist_list = []
    22. for i in ID:
    23. j = j + 1
    24. print(j)
    25. print(i)
    26. url = 'https://alphafold.ebi.ac.uk/files/AF-'+i+'-F1-model_v1'+'.pdb'
    27. print(url)
    28. r = requests.get(url, headers=headers, verify=False)
    29. with open('/AD/Information/PDB/'+i+'.pdb','w') as files:
    30. r = r.text.splitlines() #np.array(pssm).tolist()
    31. for lines in r:
    32. files.write(lines)
    33. files.write('\n')
    34. if r[0][1]=='?':
    35. print(i + '没有pdb文件。')
    36. not_exist_list.append(i)
    37. #输出了未找到的蛋白质的.pdb文件,这些可以在网址里再手动查一下,有遗漏的
    38. print(not_exist_list)
    39. print(len(not_exist_list))

    其中,file1格式如下:

    1. >Q8BH75
    2. MGYDVTRFQGDVDEDLICPICSGVLEEPVQAPHCEHAFCNACITQWFSQQQTCPVDRSVVTVAHLRPVPRIMRNMLSKLQIACDNAVFGCSAVVRLDNLMSHLSDCEHNPKRPVTCEQGCGLEMPKDELPNHNCIKHLRSVVQQQQSRIAELEKTSAEHKHQLAEQKRDIQLLKAYMRAIRSVNPNLQNLEETIEYNEILEWVNSLQPARVTRWGGMISTPDAVLQAVIKRSLVESGCPASIVNELIENAHERSWPQGLATLETRQMNRRYYENYVAKRIPGKQAVVVMACENQHMGDDMVQEPGLVMIFAHGVEEI
    3. >P06727
    4. MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNALFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEVSQKIGDNLRELQQRLEPYADQLRTQVSTQAEQLRRQLTPYAQRMERVLRENADSLQASLRPHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEGLTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLRGNTEGLQKSLAELGGHLDQQVEEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKEKESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
    5. >Q60770
    6. MAPPVSERGLKSVVWRKIKTAVFDDCRKEGEWKIMLLDEFTTKLLSSCCKMTDLLEEGITVIENIYKNREPVRQMKALYFISPTPKSVDCFLRDFGSKSEKKYKAAYIYFTDFCPDSLFNKIKASCSKSIRRCKEINISFIPQESQVYTLDVPDAFYYCYSPDPSNASRKEVVMEAMAEQIVTVCATLDENPGVRYKSKPLDNASKLAQLVEKKLEDYYKIDEKGLIKGKTQSQLLIIDRGFDPVSTVLHELTFQAMAYDLLPIENDTYKYKTDGKEKEAVLEEDDDLWVRVRHRHIAVVLEEIPKLMKEISSTKKATEGKTSLSALTQLMKKMPHFRKQISKQVVHLNLAEDCMNKFKLNIEKLCKTEQDLALGTDAEGQRVKDSMLVLLPVLLNKNHDNCDKIRAVLLYIFGINGTTEENLDRLIHNVKIEDDSDMIRNWSHLGVPIVPPSQQAKPLRKDRSAEETFQLSRWTPFIKDIMEDAIDNRLDSKEWPYCSRCPAVWNGSGAVSARQKPRTNYLELDRKNGSRLIIFVIGGITYSEMRCAYEVSQAHKSCEVIIGSTHILTPRKLLDDIKMLNKSKDKVSFKDE
    7. >P70452
    8. MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTILATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINKCNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEIQQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICVSVTVLILAVIIGITITVG
    9. >P63044
    10. MSATAATVPPAAPAGEGGPPAPPPNLTSNRRLQQTQAQVDEVVDIMRVNVDKVLERDQKLSELDDRADALQAGASQFETSAAKLKRKYWWKNLKMMIILGVICAIILIIIIVYFST

    2.根据PDB ID在RCSB中获取pdb文件

    将第一段代码的网址换成:

    url = 'http://www.rcsb.org/pdb/files/'+i+'.pdb'

    PS:最近在学习dssp的处理,但是一直没有进展,又没有小伙伴有Linux的安装包和教程

    ***********************

    满满的干货说我文章质量太低了………………,让我提交下,看看字数够了没

  • 相关阅读:
    Netty源码剖析之数据通信流程
    目标检测YOLO实战应用案例100讲-森林野火预警的小目标检测(续)
    Docker---cgroups资源限制
    java实现克里金插值导出geojson矢量数据(kriging)
    Node.js学习一 —— 模块化
    JAVA基础之单元测试
    帆软报表决策系统用户管理中添加用户,对手机号,emali添加自己的校验逻辑
    【GPU驱动开发】- mesa编译与链接过程详细分析
    智能运维探索(二) | 如何利用人工智能实现告警关联分析
    【面试题】说说你对 async和await 理解
  • 原文地址:https://blog.csdn.net/Daisy4/article/details/126088485