1
01-rna-seq从头开始 卖萌哥 Linux生信技能树Linux安装软件 Linux实战RNASEQ上游_YoungLeelight的博客-CSDN博客
下载公共测序数据的另一种姿势(kingfisher) - 简书
生物信息常见文件的格式以及查看方式 | Public Library of Bioinformatics
高速快速下载基因组ref文件
查看文件下载是否完整
md5值检查文件完整性,因为原始文件 太大,需要检查数据完成性 md5值给每个文件一个独特的id,根据id是否相等来检查文件完整性cd 01raw_data/
- md5sum *gz>md5.txt #给每个 gz文件都生成md5值,,并且输入到 md5.txt
- ls
- cat md5.txt #查看内容
-
- md5sum -c md5.txt #比较文件自带的md5值和自己生成的md5值是否相等。若相等则文件相等。 -c参数,check一下
- md5sum --help
- # zcat查看gzip压缩的文件
-
- # head -n 8 显示前8行文件内容(前8行代表2条序列)
-
-
- zcat filename.fq.gz | head -n 8
- conda activate rna
- #当前目录下
- fastqc S*gz
- ls -lh
-
-
-
- multiqc ./
fastq.gz为原始的测序数据
fastqc.zip 为fastqc质控时候产生的数据
似乎质量不行啊
- (wes) pc@lab-pc:/project/raw_fq$ ls|grep _1.fastq.gz>gz1
- (wes) pc@lab-pc:/project/raw_fq$ ls|grep _2.fastq.gz>gz2
- (wes) pc@lab-pc:/project/raw_fq$ paste gz1 gz2>config
- (wes) pc@lab-pc:/project/raw_fq$ cat config|head
$一定不要忘记加上
- (rna) tree119 21:48:01 ~/pipeline/download/rnaseq/01_rawdata
- $ cat trim.sh
- outdir=~/pipeline/download/rnaseq/02cleandata/
-
-
- cat config |while read id
- do
- arr=${id}
- fq1=${arr[0]}
- fq2=${arr[1]}
-
- nohup trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $outdir $fq1 $fq2 &
- done
运行即可
bash trim.sh
得到的质控过滤后的文件:
$ ls -lh *fq.gz|cut -d" " -f 5-
- RUN STATISTICS FOR INPUT FILE: SRR11618782_1.fastq.gz
- =============================================
- 343183 sequences processed in total
- The length threshold of paired-end sequences gets evaluated later on (in the validation step)
-
- Writing report to '/home/st8/ssd2/tree119/pipeline/download/rnaseq/01_rawdata/outdir/SRR11618782_2.fastq.gz_trimming_report.txt'
-
- SUMMARISING RUN PARAMETERS
- ==========================
- Input filename: SRR11618782_2.fastq.gz
- Trimming mode: paired-end
- Trim Galore version: 0.6.7
- Cutadapt version: 1.18
- Number of cores used for trimming: 1
- Quality Phred score cutoff: 25
- Quality encoding type selected: ASCII+33
- Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
- Maximum trimming error rate: 0.1 (default)
- Minimum required adapter overlap (stringency): 3 bp
- Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
- Output file(s) will be GZIP compressed
-
- Cutadapt seems to be reasonably up-to-date. Setting -j -j 1
- Writing final adapter and quality trimmed output to SRR11618782_2_trimmed.fq.gz
-
-
- >>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATA' from file SRR11618782_2.fastq.gz <<<
- This is cutadapt 1.18 with Python 3.7.15
- Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618610_2.fastq.gz
- Processing reads on 1 core in single-end mode ...
- Finished in 2.45 s (18 us/read; 3.36 M reads/minute).
-
- === Summary ===
-
- Total reads processed: 137,124
- Reads with adapters: 4,643 (3.4%)
- Reads written (passing filters): 137,124 (100.0%)
-
- Total basepairs processed: 13,712,400 bp
- Quality-trimmed: 305,580 bp (2.2%)
- Total written (filtered): 13,368,511 bp (97.5%)
-
- === Adapter 1 ===
-
- Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 4643 times.
- No. of allowed errors:
- 0-9 bp: 0; 10-12 bp: 1
- Bases preceding removed adapters:
- A: 18.6%
- C: 33.4%
- G: 24.8%
- T: 23.1%
- none/other: 0.0%
- Overview of removed sequences
- length count expect max.err error counts
- 3 3150 2142.6 0 3150
- 4 577 535.6 0 577
- 5 174 133.9 0 174
- 6 39 33.5 0 39
- 7 12 8.4 0 12
- 8 15 2.1 0 15
- 9 15 0.5 0 13 2
- 10 29 0.1 1 13 16
- 11 3 0.0 1 2 1
- 12 10 0.0 1 8 2
- 13 11 0.0 1 11
- 14 13 0.0 1 11 2
- 15 16 0.0 1 12 4
- 16 15 0.0 1 10 5
- 17 8 0.0 1 6 2
- 18 6 0.0 1 5 1
- 19 12 0.0 1 11 1
- 20 10 0.0 1 6 4
- 21 12 0.0 1 11 1
- 22 9 0.0 1 8 1
- 23 16 0.0 1 13 3
- 24 8 0.0 1 6 2
- 25 10 0.0 1 9 1
- 26 7 0.0 1 6 1
- 27 16 0.0 1 14 2
- 28 5 0.0 1 4 1
- 29 11 0.0 1 9 2
- 30 26 0.0 1 19 7
- 31 32 0.0 1 29 3
- 32 10 0.0 1 9 1
- 33 22 0.0 1 13 9
- 34 11 0.0 1 11
- 35 17 0.0 1 9 8
- 36 7 0.0 1 3 4
- 37 3 0.0 1 3
- 38 20 0.0 1 19 1
- 39 14 0.0 1 13 1
- 40 7 0.0 1 4 3
- 41 22 0.0 1 20 2
- 42 39 0.0 1 37 2
- 43 5 0.0 1 4 1
- 44 15 0.0 1 15
- 45 7 0.0 1 5 2
- 46 12 0.0 1 9 3
- 47 1 0.0 1 0 1
- 48 8 0.0 1 7 1
- 49 6 0.0 1 6
- 50 2 0.0 1 1 1
- 51 3 0.0 1 3
- 52 3 0.0 1 0 3
- 53 5 0.0 1 3 2
- 54 2 0.0 1 0 2
- 55 1 0.0 1 1
- 57 8 0.0 1 5 3
- 58 8 0.0 1 7 1
- 59 1 0.0 1 1
- 60 11 0.0 1 8 3
- 61 26 0.0 1 24 2
- 62 5 0.0 1 4 1
- 63 6 0.0 1 3 3
- 64 8 0.0 1 7 1
- 65 1 0.0 1 0 1
- 67 3 0.0 1 1 2
- 68 3 0.0 1 3
- 69 1 0.0 1 0 1
- 70 1 0.0 1 0 1
- 72 3 0.0 1 0 3
- 73 1 0.0 1 0 1
- 74 3 0.0 1 0 3
- 77 14 0.0 1 2 12
- 78 1 0.0 1 1
- 79 5 0.0 1 0 5
- 81 5 0.0 1 0 5
- 84 6 0.0 1 1 5
- 86 2 0.0 1 0 2
- 88 2 0.0 1 0 2
- 90 3 0.0 1 0 3
- 91 2 0.0 1 1 1
- 92 2 0.0 1 1 1
- 93 1 0.0 1 0 1
- 95 1 0.0 1 0 1
- 98 1 0.0 1 0 1
- RUN STATISTICS FOR INPUT FILE: SRR11618610_2.fastq.gz
- =============================================
- 137124 sequences processed in total
- The length threshold of paired-end sequences gets evaluated later on (in the validation step)
- Validate paired-end files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
- file_1: SRR11618610_1_trimmed.fq.gz, file_2: SRR11618610_2_trimmed.fq.gz
- >>>>> Now validing the length of the 2 paired-end infiles: SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz <<<<<
- Writing validated paired-end Read 1 reads to SRR11618610_1_val_1.fq.gz
- Writing validated paired-end Read 2 reads to SRR11618610_2_val_2.fq.gz
- Total number of sequences analysed: 137124
- Number of sequence pairs removed because at least one read was shorter than the length cutoff (36 bp): 2597 (1.89%)
- Deleting both intermediate output files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
- ====================================================================================================
- This is cutadapt 1.18 with Python 3.7.15
- Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618782_2.fastq.gz
- Processing reads on 1 core in single-end mode ...
- Finished in 4.63 s (13 us/read; 4.44 M reads/minute).
- === Summary ===
- Total reads processed: 343,183
- Reads with adapters: 31,276 (9.1%)
- Reads written (passing filters): 343,183 (100.0%)
- Total basepairs processed: 34,318,300 bp
- Quality-trimmed: 2,815,627 bp (8.2%)
- Total written (filtered): 30,713,243 bp (89.5%)
- === Adapter 1 ===
- Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 31276 times.
-
- No. of allowed errors:
- 0-9 bp: 0; 10-12 bp: 1
-
- Bases preceding removed adapters:
- A: 17.2%
- C: 45.5%
- G: 19.7%
- T: 17.6%
- none/other: 0.0%
-
- Overview of removed sequences
- length count expect max.err error counts
- 3 6021 5362.2 0 6021
- 4 1325 1340.6 0 1325
- 5 542 335.1 0 542
- 6 372 83.8 0 372
- 7 358 20.9 0 358
- 8 348 5.2 0 348
- 9 291 1.3 0 283 8
- 10 386 0.3 1 281 105
- 11 384 0.1 1 305 79
- 12 490 0.0 1 387 103
- 13 388 0.0 1 331 57
- 14 512 0.0 1 406 106
- 15 371 0.0 1 323 48
- 16 392 0.0 1 307 85
- 17 449 0.0 1 370 79
- 18 354 0.0 1 272 82
- 19 517 0.0 1 422 95
- 20 440 0.0 1 371 69
- 21 380 0.0 1 294 86
- 22 310 0.0 1 249 61
- 23 220 0.0 1 199 21
- 24 291 0.0 1 255 36
- 25 604 0.0 1 510 94
- 26 370 0.0 1 306 64
- 27 489 0.0 1 411 78
- 28 314 0.0 1 265 49
- 29 600 0.0 1 491 109
- 30 432 0.0 1 382 50
- 31 1108 0.0 1 954 154
- 32 361 0.0 1 306 55
- 33 501 0.0 1 418 83
- 34 509 0.0 1 421 88
- 35 548 0.0 1 472 76
- 36 411 0.0 1 346 65
- 37 665 0.0 1 507 158
- 38 1168 0.0 1 1053 115
- 39 2363 0.0 1 2233 130
- 40 276 0.0 1 235 41
- 41 537 0.0 1 506 31
- 42 278 0.0 1 249 29
- 43 452 0.0 1 428 24
- 44 368 0.0 1 344 24
- 45 60 0.0 1 53 7
- 46 74 0.0 1 65 9
- 47 129 0.0 1 122 7
- 48 225 0.0 1 200 25
- 49 366 0.0 1 346 20
- 50 163 0.0 1 146 17
- 51 43 0.0 1 40 3
- 52 26 0.0 1 20 6
- 53 79 0.0 1 63 16
- 54 14 0.0 1 12 2
- 55 80 0.0 1 61 19
- 56 35 0.0 1 22 13
- 57 152 0.0 1 146 6
- 58 184 0.0 1 179 5
- 59 121 0.0 1 114 7
- 60 242 0.0 1 227 15
- 61 796 0.0 1 756 40
- 62 146 0.0 1 134 12
- 63 64 0.0 1 62 2
- 64 146 0.0 1 134 12
- 65 66 0.0 1 59 7
- 66 20 0.0 1 18 2
- 67 104 0.0 1 85 19
- 68 65 0.0 1 58 7
- 69 16 0.0 1 11 5
- 70 11 0.0 1 10 1
- 71 11 0.0 1 9 2
- 72 19 0.0 1 13 6
- 73 17 0.0 1 14 3
- 74 40 0.0 1 27 13
- 75 21 0.0 1 14 7
- 76 9 0.0 1 8 1
- 77 18 0.0 1 15 3
- 78 13 0.0 1 10 3
- 79 16 0.0 1 10 6
- 80 11 0.0 1 9 2
- 81 9 0.0 1 3 6
- 82 14 0.0 1 14
- 83 13 0.0 1 12 1
- 84 17 0.0 1 15 2
- 85 33 0.0 1 30 3
- 86 17 0.0 1 16 1
- 87 20 0.0 1 20
- 88 5 0.0 1 5
- 89 13 0.0 1 13
- 90 6 0.0 1 5 1
- 91 7 0.0 1 6 1
- 92 13 0.0 1 2 11
- 95 1 0.0 1 0 1
- 96 1 0.0 1 0 1
- 97 7 0.0 1 1 6
- 100 3 0.0 1 0 3
-
下载参考基因组 下载已经建立好索引的基因组
查看下载的基因组是否正确 查看下载内容时候有误
案例一
- md5sum ./* >md5sum.txt
-
- md5sum -c md5sum.txt
结果如下,说明文件是完整的
$ md5sum -c md5sum.txt
./CM000663.1: OK
./CM000664.1: OK
./CM000665.1: OK
./CM000666.1: OK
./CM000667.1: OK
./CM000668.1: OK
./CM000669.1: OK
./CM000670.1: OK
./CM000671.1: OK
./CM000672.1: OK
./CM000673.1: OK
./CM000674.1: OK
./CM000675.1: OK
./CM000676.1: OK
./CM000677.1: OK
./CM000678.1: OK
./CM000679.1: OK
./CM000680.1: OK
./CM000681.1: OK
./CM000682.1: OK
./CM000683.1: OK
./CM000684.1: OK
./CM000685.1: OK
./CM000686.1: OK
./GL000191.1: OK
./GL000192.1: OK
./GL000193.1: OK
./GL000194.1: OK
./GL000195.1: OK
./GL000196.1: OK
./GL000197.1: OK
./GL000198.1: OK
./GL000199.1: OK
./GL000200.1: OK
./GL000201.1: OK
./GL000202.1: OK
./GL000203.1: OK
./GL000204.1: OK
./GL000205.1: OK
./GL000206.1: OK
./GL000207.1: OK
./GL000208.1: OK
./GL000209.1: OK
./GL000210.1: OK
./GL000211.1: OK
./GL000212.1: OK
./GL000213.1: OK
./GL000214.1: OK
./GL000215.1: OK
./GL000216.1: OK
./GL000217.1: OK
./GL000218.1: OK
./GL000219.1: OK
./GL000220.1: OK
./GL000221.1: OK
./GL000222.1: OK
./GL000223.1: OK
./GL000224.1: OK
./GL000225.1: OK
./GL000226.1: OK
./GL000227.1: OK
./GL000228.1: OK
./GL000229.1: OK
./GL000230.1: OK
./GL000231.1: OK
./GL000232.1: OK
./GL000233.1: OK
./GL000234.1: OK
./GL000235.1: OK
./GL000236.1: OK
./GL000237.1: OK
./GL000238.1: OK
./GL000239.1: OK
./GL000240.1: OK
./GL000241.1: OK
./GL000242.1: OK
./GL000243.1: OK
./GL000244.1: OK
./GL000245.1: OK
./GL000246.1: OK
./GL000247.1: OK
./GL000248.1: OK
./GL000249.1: OK
./NC_012920.1: OK
./SRR11832836.sra: OK
- md5sum *.gz >md5.txt
- cat md5.txt
- md5sum -c md5.txt
wget -c
ls *gz |xargs gunzip
- $ cd reference/index
- $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
- $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
- # 解压得到两个目录,hg19和mm10
- $ tar -zxvf *.tar.gz
得到8个索引
基因组注释文件(GFF,GTF)下载的五种方法_白墨石的博客-CSDN博客_基因组注释文件
hg38的gtf文件下载
genecode中参考基因组的名字 更新很快!
新建脚本来运行 制备文件准备工作
- 489 ls |grep _1.fq.gz
- 490 ls |grep _1.fq.gz >gz1
- 491 ls |grep _2.fq.gz >gz2
- 492 paste gz1 gz2 >file_for_align
- 493 cat file_for_align
- vim align.sh
-
-
- cat file_for_align |while read id
- do
- arr=${id}
- fq1=${arr[0]}
- fq2=${arr[1]}
-
-
- nohup sh -c " hisat2 -p 2 -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome -1 ${fq1} -2 ${fq2} 2>${id%%_*}.log | samtools sort -@ 2 -o ${id%%_*}.bam " &
-
- done
- ls |grep .fq.gz|cut -d "_" -f 1|
- while read id; do nohup sh -c " hisat2 -p 2
- -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome
- -1 ${id}_1_val_1.fq.gz -2 ${id}_1_val_1.fq.gz
- 2>${id%%_*}.log |
- samtools sort -@ 2 -o ${id%%_*}.bam " & done
RNA-seq(5):序列比对:Hisat2 - 简书 别人的代码
- #启动miniconda3环境(hisat2所在的环境)
- $ source ~/miniconda3/bin/activate
- #进入data目录
- $cd /mnt/f/rna_seq/aligned
- (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$
-
- # 小鼠和人是分开各自比对自己的index
- # 人的比对
- $ for ((i=56;i<=58;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam ;done
- # 小鼠比对
- $ for ((i=59;i<=62;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/mm10/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam; done
- #结果一共得到7个sam文件
- gtf=~/pipeline/download/rnaseq/00ref/genom_file/gencode.v41.annotation.gtf
-
- nohup featureCounts -T 5 -p -t exon -g gene_id -a $gtf -o all.id.txt *bam 1>counts.id.log 2>&1 &