• 第四次考核 Jimmy 学徒考核 Linux安装软件 rnaseq上游分析-2 ascp kingfisher数据下载ena


     1

    第四次考核 Jimmy 学徒考核 Linux安装软件 rnaseq上游分析_YoungLeelight的博客-CSDN博客添加命令 export PATH=$PATH:/path/to/bwa-0.7.12/bin # Add bwa to your PATH by editing ~/.bashrc file (or .bash_profile or .profile file) 在全新服务器配置转录组测序数据处理环境 1.2 配置rna转录组环境1.3 下载并且整理数据库文件数据下载也是老规矩,利用prefetch下载SRA数据,准备进行转录组数据分析。在GSE数据集对应网页,点击SRA_Rhttps://blog.csdn.net/qq_52813185/article/details/128172862?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22128172862%22%2C%22source%22%3A%22qq_52813185%22%7D

    01-rna-seq从头开始 卖萌哥 Linux生信技能树Linux安装软件 Linux实战RNASEQ上游_YoungLeelight的博客-CSDN博客

    1.首先构建项目所需目录,这样比较哦清晰。共分为四个目录,

    2.原始数据上传测序数据
    或者自己通过kingfisher下载sra原始数据或者 fastq.gz数据

    下载公共测序数据的另一种姿势(kingfisher) - 简书

    生物信息常见文件的格式以及查看方式 | Public Library of Bioinformatics  

    高速快速下载基因组ref文件

    查看文件下载是否完整

    md5值检查文件完整性,因为原始文件 太大,需要检查数据完成性    md5值给每个文件一个独特的id,根据id是否相等来检查文件完整性cd  01raw_data/
     

    1.  md5sum *gz>md5.txt     #给每个 gz文件都生成md5值,,并且输入到 md5.txt 
    2.      ls
    3.  cat md5.txt          #查看内容
    4.  md5sum -c md5.txt   #比较文件自带的md5值和自己生成的md5值是否相等。若相等则文件相等。  -c参数,check一下
    5. md5sum --help

    查看fastq文件内容

      1. # zcat查看gzip压缩的文件
      2. # head -n 8 显示前8行文件内容(前8行代表2条序列)
      3. zcat filename.fq.gz | head -n 8

    3 查看原始数据的质量情况 质控

    1. conda activate rna
    2. #当前目录下
    3. fastqc S*gz
    4. ls -lh
    5. multiqc ./

    fastq.gz为原始的测序数据

    fastqc.zip 为fastqc质控时候产生的数据 

    似乎质量不行啊

    通过前面这些步骤,已经初步判定数据是否合格。如果不合格,那如何修改?或者把不合格的数据扔掉?trimmgalore质控

    4 trim_galore去接头(并行处理) - 简书

    trimmgalore批量质控  双端数据

    2 分离_1和_2文件

    1. (wes) pc@lab-pc:/project/raw_fq$ ls|grep _1.fastq.gz>gz1
    2. (wes) pc@lab-pc:/project/raw_fq$ ls|grep _2.fastq.gz>gz2
    3. (wes) pc@lab-pc:/project/raw_fq$ paste gz1 gz2>config
    4. (wes) pc@lab-pc:/project/raw_fq$ cat config|head

     然后写一个脚本 用于 trimmgalore批量质控  双端数据 同时处理两个参数

    $一定不要忘记加上 

    1. (rna) tree119 21:48:01 ~/pipeline/download/rnaseq/01_rawdata
    2. $ cat trim.sh
    3. outdir=~/pipeline/download/rnaseq/02cleandata/
    4. cat config |while read id
    5. do
    6. arr=${id}
    7. fq1=${arr[0]}
    8. fq2=${arr[1]}
    9. nohup trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $outdir $fq1 $fq2 &
    10. done

    运行即可

    bash trim.sh 

    结果输入到了outdir文件夹下 (这里指的是02cleandata文件夹)

    得到的质控过滤后的文件:

    $ ls -lh  *fq.gz|cut -d" " -f 5-

     

    选取运行过程中其中一对结果展示如下

    1. RUN STATISTICS FOR INPUT FILE: SRR11618782_1.fastq.gz
    2. =============================================
    3. 343183 sequences processed in total
    4. The length threshold of paired-end sequences gets evaluated later on (in the validation step)
    5. Writing report to '/home/st8/ssd2/tree119/pipeline/download/rnaseq/01_rawdata/outdir/SRR11618782_2.fastq.gz_trimming_report.txt'
    6. SUMMARISING RUN PARAMETERS
    7. ==========================
    8. Input filename: SRR11618782_2.fastq.gz
    9. Trimming mode: paired-end
    10. Trim Galore version: 0.6.7
    11. Cutadapt version: 1.18
    12. Number of cores used for trimming: 1
    13. Quality Phred score cutoff: 25
    14. Quality encoding type selected: ASCII+33
    15. Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
    16. Maximum trimming error rate: 0.1 (default)
    17. Minimum required adapter overlap (stringency): 3 bp
    18. Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
    19. Output file(s) will be GZIP compressed
    20. Cutadapt seems to be reasonably up-to-date. Setting -j -j 1
    21. Writing final adapter and quality trimmed output to SRR11618782_2_trimmed.fq.gz
    22. >>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATA' from file SRR11618782_2.fastq.gz <<<
    23. This is cutadapt 1.18 with Python 3.7.15
    24. Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618610_2.fastq.gz
    25. Processing reads on 1 core in single-end mode ...
    26. Finished in 2.45 s (18 us/read; 3.36 M reads/minute).
    27. === Summary ===
    28. Total reads processed: 137,124
    29. Reads with adapters: 4,643 (3.4%)
    30. Reads written (passing filters): 137,124 (100.0%)
    31. Total basepairs processed: 13,712,400 bp
    32. Quality-trimmed: 305,580 bp (2.2%)
    33. Total written (filtered): 13,368,511 bp (97.5%)
    34. === Adapter 1 ===
    35. Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 4643 times.
    36. No. of allowed errors:
    37. 0-9 bp: 0; 10-12 bp: 1
    38. Bases preceding removed adapters:
    39. A: 18.6%
    40. C: 33.4%
    41. G: 24.8%
    42. T: 23.1%
    43. none/other: 0.0%
    44. Overview of removed sequences
    45. length count expect max.err error counts
    46. 3 3150 2142.6 0 3150
    47. 4 577 535.6 0 577
    48. 5 174 133.9 0 174
    49. 6 39 33.5 0 39
    50. 7 12 8.4 0 12
    51. 8 15 2.1 0 15
    52. 9 15 0.5 0 13 2
    53. 10 29 0.1 1 13 16
    54. 11 3 0.0 1 2 1
    55. 12 10 0.0 1 8 2
    56. 13 11 0.0 1 11
    57. 14 13 0.0 1 11 2
    58. 15 16 0.0 1 12 4
    59. 16 15 0.0 1 10 5
    60. 17 8 0.0 1 6 2
    61. 18 6 0.0 1 5 1
    62. 19 12 0.0 1 11 1
    63. 20 10 0.0 1 6 4
    64. 21 12 0.0 1 11 1
    65. 22 9 0.0 1 8 1
    66. 23 16 0.0 1 13 3
    67. 24 8 0.0 1 6 2
    68. 25 10 0.0 1 9 1
    69. 26 7 0.0 1 6 1
    70. 27 16 0.0 1 14 2
    71. 28 5 0.0 1 4 1
    72. 29 11 0.0 1 9 2
    73. 30 26 0.0 1 19 7
    74. 31 32 0.0 1 29 3
    75. 32 10 0.0 1 9 1
    76. 33 22 0.0 1 13 9
    77. 34 11 0.0 1 11
    78. 35 17 0.0 1 9 8
    79. 36 7 0.0 1 3 4
    80. 37 3 0.0 1 3
    81. 38 20 0.0 1 19 1
    82. 39 14 0.0 1 13 1
    83. 40 7 0.0 1 4 3
    84. 41 22 0.0 1 20 2
    85. 42 39 0.0 1 37 2
    86. 43 5 0.0 1 4 1
    87. 44 15 0.0 1 15
    88. 45 7 0.0 1 5 2
    89. 46 12 0.0 1 9 3
    90. 47 1 0.0 1 0 1
    91. 48 8 0.0 1 7 1
    92. 49 6 0.0 1 6
    93. 50 2 0.0 1 1 1
    94. 51 3 0.0 1 3
    95. 52 3 0.0 1 0 3
    96. 53 5 0.0 1 3 2
    97. 54 2 0.0 1 0 2
    98. 55 1 0.0 1 1
    99. 57 8 0.0 1 5 3
    100. 58 8 0.0 1 7 1
    101. 59 1 0.0 1 1
    102. 60 11 0.0 1 8 3
    103. 61 26 0.0 1 24 2
    104. 62 5 0.0 1 4 1
    105. 63 6 0.0 1 3 3
    106. 64 8 0.0 1 7 1
    107. 65 1 0.0 1 0 1
    108. 67 3 0.0 1 1 2
    109. 68 3 0.0 1 3
    110. 69 1 0.0 1 0 1
    111. 70 1 0.0 1 0 1
    112. 72 3 0.0 1 0 3
    113. 73 1 0.0 1 0 1
    114. 74 3 0.0 1 0 3
    115. 77 14 0.0 1 2 12
    116. 78 1 0.0 1 1
    117. 79 5 0.0 1 0 5
    118. 81 5 0.0 1 0 5
    119. 84 6 0.0 1 1 5
    120. 86 2 0.0 1 0 2
    121. 88 2 0.0 1 0 2
    122. 90 3 0.0 1 0 3
    123. 91 2 0.0 1 1 1
    124. 92 2 0.0 1 1 1
    125. 93 1 0.0 1 0 1
    126. 95 1 0.0 1 0 1
    127. 98 1 0.0 1 0 1
    128. RUN STATISTICS FOR INPUT FILE: SRR11618610_2.fastq.gz
    129. =============================================
    130. 137124 sequences processed in total
    131. The length threshold of paired-end sequences gets evaluated later on (in the validation step)
    132. Validate paired-end files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
    133. file_1: SRR11618610_1_trimmed.fq.gz, file_2: SRR11618610_2_trimmed.fq.gz
    134. >>>>> Now validing the length of the 2 paired-end infiles: SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz <<<<<
    135. Writing validated paired-end Read 1 reads to SRR11618610_1_val_1.fq.gz
    136. Writing validated paired-end Read 2 reads to SRR11618610_2_val_2.fq.gz
    137. Total number of sequences analysed: 137124
    138. Number of sequence pairs removed because at least one read was shorter than the length cutoff (36 bp): 2597 (1.89%)
    139. Deleting both intermediate output files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
    140. ====================================================================================================
    141. This is cutadapt 1.18 with Python 3.7.15
    142. Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618782_2.fastq.gz
    143. Processing reads on 1 core in single-end mode ...
    144. Finished in 4.63 s (13 us/read; 4.44 M reads/minute).
    145. === Summary ===
    146. Total reads processed: 343,183
    147. Reads with adapters: 31,276 (9.1%)
    148. Reads written (passing filters): 343,183 (100.0%)
    149. Total basepairs processed: 34,318,300 bp
    150. Quality-trimmed: 2,815,627 bp (8.2%)
    151. Total written (filtered): 30,713,243 bp (89.5%)
    152. === Adapter 1 ===
    153. Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 31276 times.
    154. No. of allowed errors:
    155. 0-9 bp: 0; 10-12 bp: 1
    156. Bases preceding removed adapters:
    157. A: 17.2%
    158. C: 45.5%
    159. G: 19.7%
    160. T: 17.6%
    161. none/other: 0.0%
    162. Overview of removed sequences
    163. length count expect max.err error counts
    164. 3 6021 5362.2 0 6021
    165. 4 1325 1340.6 0 1325
    166. 5 542 335.1 0 542
    167. 6 372 83.8 0 372
    168. 7 358 20.9 0 358
    169. 8 348 5.2 0 348
    170. 9 291 1.3 0 283 8
    171. 10 386 0.3 1 281 105
    172. 11 384 0.1 1 305 79
    173. 12 490 0.0 1 387 103
    174. 13 388 0.0 1 331 57
    175. 14 512 0.0 1 406 106
    176. 15 371 0.0 1 323 48
    177. 16 392 0.0 1 307 85
    178. 17 449 0.0 1 370 79
    179. 18 354 0.0 1 272 82
    180. 19 517 0.0 1 422 95
    181. 20 440 0.0 1 371 69
    182. 21 380 0.0 1 294 86
    183. 22 310 0.0 1 249 61
    184. 23 220 0.0 1 199 21
    185. 24 291 0.0 1 255 36
    186. 25 604 0.0 1 510 94
    187. 26 370 0.0 1 306 64
    188. 27 489 0.0 1 411 78
    189. 28 314 0.0 1 265 49
    190. 29 600 0.0 1 491 109
    191. 30 432 0.0 1 382 50
    192. 31 1108 0.0 1 954 154
    193. 32 361 0.0 1 306 55
    194. 33 501 0.0 1 418 83
    195. 34 509 0.0 1 421 88
    196. 35 548 0.0 1 472 76
    197. 36 411 0.0 1 346 65
    198. 37 665 0.0 1 507 158
    199. 38 1168 0.0 1 1053 115
    200. 39 2363 0.0 1 2233 130
    201. 40 276 0.0 1 235 41
    202. 41 537 0.0 1 506 31
    203. 42 278 0.0 1 249 29
    204. 43 452 0.0 1 428 24
    205. 44 368 0.0 1 344 24
    206. 45 60 0.0 1 53 7
    207. 46 74 0.0 1 65 9
    208. 47 129 0.0 1 122 7
    209. 48 225 0.0 1 200 25
    210. 49 366 0.0 1 346 20
    211. 50 163 0.0 1 146 17
    212. 51 43 0.0 1 40 3
    213. 52 26 0.0 1 20 6
    214. 53 79 0.0 1 63 16
    215. 54 14 0.0 1 12 2
    216. 55 80 0.0 1 61 19
    217. 56 35 0.0 1 22 13
    218. 57 152 0.0 1 146 6
    219. 58 184 0.0 1 179 5
    220. 59 121 0.0 1 114 7
    221. 60 242 0.0 1 227 15
    222. 61 796 0.0 1 756 40
    223. 62 146 0.0 1 134 12
    224. 63 64 0.0 1 62 2
    225. 64 146 0.0 1 134 12
    226. 65 66 0.0 1 59 7
    227. 66 20 0.0 1 18 2
    228. 67 104 0.0 1 85 19
    229. 68 65 0.0 1 58 7
    230. 69 16 0.0 1 11 5
    231. 70 11 0.0 1 10 1
    232. 71 11 0.0 1 9 2
    233. 72 19 0.0 1 13 6
    234. 73 17 0.0 1 14 3
    235. 74 40 0.0 1 27 13
    236. 75 21 0.0 1 14 7
    237. 76 9 0.0 1 8 1
    238. 77 18 0.0 1 15 3
    239. 78 13 0.0 1 10 3
    240. 79 16 0.0 1 10 6
    241. 80 11 0.0 1 9 2
    242. 81 9 0.0 1 3 6
    243. 82 14 0.0 1 14
    244. 83 13 0.0 1 12 1
    245. 84 17 0.0 1 15 2
    246. 85 33 0.0 1 30 3
    247. 86 17 0.0 1 16 1
    248. 87 20 0.0 1 20
    249. 88 5 0.0 1 5
    250. 89 13 0.0 1 13
    251. 90 6 0.0 1 5 1
    252. 91 7 0.0 1 6 1
    253. 92 13 0.0 1 2 11
    254. 95 1 0.0 1 0 1
    255. 96 1 0.0 1 0 1
    256. 97 7 0.0 1 1 6
    257. 100 3 0.0 1 0 3


    5.比对 align索引文件和质控好的测序数据进行比对

    下载参考基因组 下载已经建立好索引的基因组

    在全新服务器配置转录组测序数据处理环境

    查看下载的基因组是否正确  查看下载内容时候有误

    md5sum检查完整性 查看gtf文件是否完整

    案例一

    1. md5sum ./* >md5sum.txt
    2. md5sum -c md5sum.txt

    结果如下,说明文件是完整的 

    $ md5sum -c md5sum.txt
    ./CM000663.1: OK
    ./CM000664.1: OK
    ./CM000665.1: OK
    ./CM000666.1: OK
    ./CM000667.1: OK
    ./CM000668.1: OK
    ./CM000669.1: OK
    ./CM000670.1: OK
    ./CM000671.1: OK
    ./CM000672.1: OK
    ./CM000673.1: OK
    ./CM000674.1: OK
    ./CM000675.1: OK
    ./CM000676.1: OK
    ./CM000677.1: OK
    ./CM000678.1: OK
    ./CM000679.1: OK
    ./CM000680.1: OK
    ./CM000681.1: OK
    ./CM000682.1: OK
    ./CM000683.1: OK
    ./CM000684.1: OK
    ./CM000685.1: OK
    ./CM000686.1: OK
    ./GL000191.1: OK
    ./GL000192.1: OK
    ./GL000193.1: OK
    ./GL000194.1: OK
    ./GL000195.1: OK
    ./GL000196.1: OK
    ./GL000197.1: OK
    ./GL000198.1: OK
    ./GL000199.1: OK
    ./GL000200.1: OK
    ./GL000201.1: OK
    ./GL000202.1: OK
    ./GL000203.1: OK
    ./GL000204.1: OK
    ./GL000205.1: OK
    ./GL000206.1: OK
    ./GL000207.1: OK
    ./GL000208.1: OK
    ./GL000209.1: OK
    ./GL000210.1: OK
    ./GL000211.1: OK
    ./GL000212.1: OK
    ./GL000213.1: OK
    ./GL000214.1: OK
    ./GL000215.1: OK
    ./GL000216.1: OK
    ./GL000217.1: OK
    ./GL000218.1: OK
    ./GL000219.1: OK
    ./GL000220.1: OK
    ./GL000221.1: OK
    ./GL000222.1: OK
    ./GL000223.1: OK
    ./GL000224.1: OK
    ./GL000225.1: OK
    ./GL000226.1: OK
    ./GL000227.1: OK
    ./GL000228.1: OK
    ./GL000229.1: OK
    ./GL000230.1: OK
    ./GL000231.1: OK
    ./GL000232.1: OK
    ./GL000233.1: OK
    ./GL000234.1: OK
    ./GL000235.1: OK
    ./GL000236.1: OK
    ./GL000237.1: OK
    ./GL000238.1: OK
    ./GL000239.1: OK
    ./GL000240.1: OK
    ./GL000241.1: OK
    ./GL000242.1: OK
    ./GL000243.1: OK
    ./GL000244.1: OK
    ./GL000245.1: OK
    ./GL000246.1: OK
    ./GL000247.1: OK
    ./GL000248.1: OK
    ./GL000249.1: OK
    ./NC_012920.1: OK
    ./SRR11832836.sra: OK

    1. md5sum *.gz >md5.txt
    2. cat md5.txt
    3. md5sum -c md5.txt

    下载索引文件

    Hisat2官网上人类基因组索引的下载 - 简书

    wget -c 

    在全新服务器配置转录组测序数据处理环境

    对基因组文件解压缩 去掉gz后缀 

    ls *gz |xargs gunzip

    RNA-seq(5):序列比对:Hisat2 - 简书

    对索引文件解压缩 

    1. $ cd reference/index
    2. $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
    3. $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
    4. # 解压得到两个目录,hg19和mm10
    5. $ tar -zxvf *.tar.gz

    得到8个索引

    基因组注释文件(GFF,GTF)下载的五种方法_白墨石的博客-CSDN博客_基因组注释文件

    hg38的gtf文件下载

    genecode中参考基因组的名字  更新很快!

     新建脚本来运行  制备文件准备工作

    1. 489 ls |grep _1.fq.gz
    2. 490 ls |grep _1.fq.gz >gz1
    3. 491 ls |grep _2.fq.gz >gz2
    4. 492 paste gz1 gz2 >file_for_align
    5. 493 cat file_for_align

     

    新建脚本来运行  失败

    1. vim align.sh
    2. cat file_for_align |while read id
    3. do
    4. arr=${id}
    5. fq1=${arr[0]}
    6. fq2=${arr[1]}
    7. nohup sh -c " hisat2 -p 2 -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome -1 ${fq1} -2 ${fq2} 2>${id%%_*}.log | samtools sort -@ 2 -o ${id%%_*}.bam " &
    8. done

     联合多人的代码之后,成功!

    1. ls |grep .fq.gz|cut -d "_" -f 1|
    2. while read id; do nohup sh -c " hisat2 -p 2
    3. -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome
    4. -1 ${id}_1_val_1.fq.gz -2 ${id}_1_val_1.fq.gz
    5. 2>${id%%_*}.log |
    6. samtools sort -@ 2 -o ${id%%_*}.bam " & done

     

    RNA-seq(5):序列比对:Hisat2 - 简书 别人的代码

    1. #启动miniconda3环境(hisat2所在的环境)
    2. $ source ~/miniconda3/bin/activate
    3. #进入data目录
    4. $cd /mnt/f/rna_seq/aligned
    5. (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$
    6. # 小鼠和人是分开各自比对自己的index
    7. # 人的比对
    8. $ for ((i=56;i<=58;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam ;done
    9. # 小鼠比对
    10. $ for ((i=59;i<=62;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/mm10/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam; done
    11. #结果一共得到7个sam文件

    genome指的是这里的genome.x.ht2的basename 

    4.定量 featurecounts

    featurecounts的使用说明 - 简书

    1. gtf=~/pipeline/download/rnaseq/00ref/genom_file/gencode.v41.annotation.gtf
    2. nohup featureCounts -T 5 -p -t exon -g gene_id -a $gtf -o all.id.txt *bam 1>counts.id.log 2>&1 &

    reads计数原理及featureCounts统计counts后的cpm和tpm计算 - 简书

  • 相关阅读:
    编程2016 1
    vue2+echarts:echarts在dialog弹框中不显示的解决方案
    HCIA数据通信——交换机(Vlan间的通信与安全)
    第一章三层交换应用
    【题解】自创题目题解
    GPIO相关介绍
    继承-重写
    Mysql集群及高可用-Mysql高可用MHA9
    共读《redis设计与实现》-单机(一)
    Halcon 光度立体 缺陷检测
  • 原文地址:https://blog.csdn.net/qq_52813185/article/details/128177352