• python运行hhblits二进制命令的包装器类


    hhblits 是 HMM-HMM(Hidden Markov Model to Hidden Markov Model)比对方法的一部分,也是 HMMER 软件套件中的工具之一。与 hhsearch 类似,hhblits 也用于进行高效的蛋白质序列比对,特别擅长于检测远缘同源性。

    hh-suite 源码 https://github.com/soedinglab/hh-suite

    以下是 hhblits 的一些主要特点和用途:

    1. HMM-HMM比对: hhblits 使用隐藏马尔可夫模型(HMM)来表示蛋白质家族的模型,从而能够更全面地考虑多序列信息。这使得它相对于传统的序列-序列比对方法更能捕捉蛋白质家族的模式和结构。

    2. 迭代比对: hhblits 支持迭代比对的策略,通过在每一轮中更新模型,逐渐提高比对的灵敏度。这种策略对于发现相对远缘同源蛋白质非常有用。

    3. 远缘同源性检测: 与其他 HMM-HMM比对方法类似,hhblits 被设计用于检测远缘同源蛋白质,即那些在序列上相对较远但在结构和功能上相似的蛋白质。

    4. 数据库搜索: 用户可以使用 hhblits 在大型蛋白质数据库中搜索与给定蛋白质序列相似的蛋白质。

    5. 灵敏度和特异性: hhblits 的设计目标是在保持高灵敏度的同时,降低假阳性的比对,以提高比对的特异性。

    1. import glob
    2. import os
    3. import subprocess
    4. from typing import Any, List, Mapping, Optional, Sequence
    5. from absl import logging
    6. import contextlib
    7. import shutil
    8. import tempfile
    9. import time
    10. _HHBLITS_DEFAULT_P = 20
    11. _HHBLITS_DEFAULT_Z = 500
    12. @contextlib.contextmanager
    13. def tmpdir_manager(base_dir: Optional[str] = None):
    14. """Context manager that deletes a temporary directory on exit."""
    15. tmpdir = tempfile.mkdtemp(dir=base_dir)
    16. try:
    17. yield tmpdir
    18. finally:
    19. shutil.rmtree(tmpdir, ignore_errors=True)
    20. @contextlib.contextmanager
    21. def timing(msg: str):
    22. logging.info('Started %s', msg)
    23. tic = time.time()
    24. yield
    25. toc = time.time()
    26. logging.info('Finished %s in %.3f seconds', msg, toc - tic)
    27. class HHBlits:
    28. """Python wrapper of the HHblits binary."""
    29. def __init__(self,
    30. *,
    31. binary_path: str,
    32. databases: Sequence[str],
    33. n_cpu: int = 4,
    34. n_iter: int = 3,
    35. e_value: float = 0.001,
    36. maxseq: int = 1_000_000,
    37. realign_max: int = 100_000,
    38. maxfilt: int = 100_000,
    39. min_prefilter_hits: int = 1000,
    40. all_seqs: bool = False,
    41. alt: Optional[int] = None,
    42. p: int = _HHBLITS_DEFAULT_P,
    43. z: int = _HHBLITS_DEFAULT_Z):
    44. """Initializes the Python HHblits wrapper.
    45. Args:
    46. binary_path: The path to the HHblits executable.
    47. databases: A sequence of HHblits database paths. This should be the
    48. common prefix for the database files (i.e. up to but not including
    49. _hhm.ffindex etc.)
    50. n_cpu: The number of CPUs to give HHblits.
    51. n_iter: The number of HHblits iterations.
    52. e_value: The E-value, see HHblits docs for more details.
    53. maxseq: The maximum number of rows in an input alignment. Note that this
    54. parameter is only supported in HHBlits version 3.1 and higher.
    55. realign_max: Max number of HMM-HMM hits to realign. HHblits default: 500.
    56. maxfilt: Max number of hits allowed to pass the 2nd prefilter.
    57. HHblits default: 20000.
    58. min_prefilter_hits: Min number of hits to pass prefilter.
    59. HHblits default: 100.
    60. all_seqs: Return all sequences in the MSA / Do not filter the result MSA.
    61. HHblits default: False.
    62. alt: Show up to this many alternative alignments.
    63. p: Minimum Prob for a hit to be included in the output hhr file.
    64. HHblits default: 20.
    65. z: Hard cap on number of hits reported in the hhr file.
    66. HHblits default: 500. NB: The relevant HHblits flag is -Z not -z.
    67. Raises:
    68. RuntimeError: If HHblits binary not found within the path.
    69. """
    70. self.binary_path = binary_path
    71. self.databases = databases
    72. self.n_cpu = n_cpu
    73. self.n_iter = n_iter
    74. self.e_value = e_value
    75. self.maxseq = maxseq
    76. self.realign_max = realign_max
    77. self.maxfilt = maxfilt
    78. self.min_prefilter_hits = min_prefilter_hits
    79. self.all_seqs = all_seqs
    80. self.alt = alt
    81. self.p = p
    82. self.z = z
    83. def query(self, input_fasta_path: str) -> List[Mapping[str, Any]]:
    84. """Queries the database using HHblits."""
    85. with tmpdir_manager() as query_tmp_dir:
    86. a3m_path = os.path.join(query_tmp_dir, 'output.a3m')
    87. db_cmd = []
    88. for db_path in self.databases:
    89. db_cmd.append('-d')
    90. db_cmd.append(db_path)
    91. cmd = [
    92. self.binary_path,
    93. '-i', input_fasta_path,
    94. '-cpu', str(self.n_cpu),
    95. '-oa3m', a3m_path,
    96. '-o', '/dev/null',
    97. '-n', str(self.n_iter),
    98. '-e', str(self.e_value),
    99. '-maxseq', str(self.maxseq),
    100. '-realign_max', str(self.realign_max),
    101. '-maxfilt', str(self.maxfilt),
    102. '-min_prefilter_hits', str(self.min_prefilter_hits)]
    103. if self.all_seqs:
    104. cmd += ['-all']
    105. if self.alt:
    106. cmd += ['-alt', str(self.alt)]
    107. if self.p != _HHBLITS_DEFAULT_P:
    108. cmd += ['-p', str(self.p)]
    109. if self.z != _HHBLITS_DEFAULT_Z:
    110. cmd += ['-Z', str(self.z)]
    111. cmd += db_cmd
    112. print("cmd:",' '.join(cmd))
    113. logging.info('Launching subprocess "%s"', ' '.join(cmd))
    114. process = subprocess.Popen(
    115. cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    116. with timing('HHblits query'):
    117. stdout, stderr = process.communicate()
    118. retcode = process.wait()
    119. if retcode:
    120. # Logs have a 15k character limit, so log HHblits error line by line.
    121. logging.error('HHblits failed. HHblits stderr begin:')
    122. for error_line in stderr.decode('utf-8').splitlines():
    123. if error_line.strip():
    124. logging.error(error_line.strip())
    125. logging.error('HHblits stderr end')
    126. raise RuntimeError('HHblits failed\nstdout:\n%s\n\nstderr:\n%s\n' % (
    127. stdout.decode('utf-8'), stderr[:500_000].decode('utf-8')))
    128. with open(a3m_path) as f:
    129. a3m = f.read()
    130. raw_output = dict(
    131. a3m=a3m,
    132. output=stdout,
    133. stderr=stderr,
    134. n_iter=self.n_iter,
    135. e_value=self.e_value)
    136. return [raw_output]
    137. if __name__ == "__main__":
    138. hhblits_binary_path = "/home/zheng/software/hhsuite-3.3.0-SSE2-Linux/bin/hhblits"
    139. input_fasta_path = "/home/zheng/test/Q94K49.fasta"
    140. #hhsuite数据库下载地址:https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/
    141. scop70_database_path = "/home/zheng/database/scop70_1.75_hhsuite3/scop70_1.75"
    142. pdb70_database_path = "/home/zheng/database/pdb70_from_mmcif_latest/pdb70"
    143. ## 单一数据库
    144. #hhblits_runner = HHBlits(binary_path = hhblits_binary_path,
    145. # databases = [scop70_database_path])
    146. ## 多个数据库
    147. database_lst = [scop70_database_path, pdb70_database_path]
    148. hhblits_runner = HHBlits(binary_path = hhblits_binary_path,
    149. databases = database_lst)
    150. result = hhblits_runner.query(input_fasta_path)
    151. print(len(result))
    152. print(result[0]['a3m'])
    153. #print(result)

  • 相关阅读:
    如何骚操作通过科目一斩获高分,python带你了解~
    【接口测试】HTTP接口详细验证清单
    One bite of Stream(9)
    Node学习笔记之MySQL基本使用
    MyBatis框架使用过程中出现的问题以及解决方案
    Nexus-3.41.1安装
    Java进阶API第四章
    ICMP重定向(ICMP redirect)实验分析
    数据结构与算法-图
    深度观察2024中国系统架构师大会(SACC)
  • 原文地址:https://blog.csdn.net/qq_27390023/article/details/134526554