ncbi-datasets-cli-高效便捷下载NCBI数据

文章目录

简介
安装
```datasets download```下载基因组/基因序列
```datasets summary```下载元数据
```dataformat```将json转换成表格格式
通过json文件解析其他字段
问题

简介

NCBI Datasets 可以轻松从 NCBI 数据库中收集数据。使用命令行界面（CLI）工具或 NCBI Datasets 网页界面查找和下载基因和基因组的序列、注释和元数据。如下是可用的工具：
在这里插入图片描述

安装

使用conda安装Datasets CLI tools, datasetsand dataformat:

# 注意不是datasets而是ncbi-datasets-cli
$ conda install -c conda-forge ncbi-datasets-cli
(base) [yut@io02 ~]$ datasets --version
datasets version: 15.25.0
1
2
3
4

`datasets download`下载基因组/基因序列

datasets从 NCBI 下载所有生命领域的生物序列数据，dataformat将前者下载的数据包中的元数据从 JSON Lines 格式转换为其他格式。

使用datasets下载人类参考基因组 GRCh38 的基因组数据包：

$ datasets download genome taxon human --reference --filename human-reference.zip

1
2

使用 dataformat从下载的人类参考基因组 GRCh38 数据包中提取选定的元数据字段：

$ dataformat tsv genome --package human-reference.zip --fields organism-name,assminfo-name,accession,assminfo-submitter
Organism name	Assembly Name	Assembly Accession	Assembly Submitter
Homo sapiens	GRCh38.p14	GCF_000001405.40	Genome Reference Consortium
1
2
3

按照GCA list文件编号下载

(base) [yut@io02 02_Glacier_new_taxa]$ head 3.gca
GCF_020042285.1
GCF_020783315.1
GCF_024343615.1
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile 3.gca --include gff3,rna,cds,protein,genome,seq-report --filename  3genome.zip
# --inputfile:输入GCA号的list，每行一个
# --filename：输出zip包名称，默认ncbi-dataset.zip

New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 3 genome records [================================================] 100% 3/3
Downloading: 3genome.zip    10.4MB valid zip archive
Validating package files [================================================] 100% 18/18

real    0m8.208s
user    0m0.652s
sys     0m0.234s
(base) [yut@io02 02_Glacier_new_taxa]$ ls
3genome.zip  download.log 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

下载大基因组

下载大量基因组，首先下载压缩包，然后分三步访问数据。

1.下载人基因组压缩包

datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
1

2.解压

unzip human_GRCh38_dataset.zip -d my_human_dataset
1

3.转换格式

datasets rehydrate --directory my_human_dataset/
1

genome完整参数

(base) [yut@io02 ~]$ datasets download genome  --help

Download a genome data package. Genome data packages may include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.

The default genome data package includes the following files:
  * <accession>_<assembly_name>_genomic.fna (genomic sequences)
  * assembly_data_report.jsonl (data report with genome assembly and annotation metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download genome [flags]
  datasets download genome [command]

Sample Commands
  datasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rna
  datasets download genome taxon "bos taurus" --dehydrated
  datasets download genome taxon human --assembly-level chromosome,complete --dehydrated
  datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated

Available Commands
  accession   Download a genome data package by Assembly or BioProject accession
  taxon       Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --annotated                Limit to annotated genomes
      --assembly-level string    Limit to genomes at one or more assembly levels (comma-separated):
                                   * chromosome
                                   * complete
                                   * contig
                                   * scaffold
                                    (default "[]")
      --assembly-source string   Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")
      --chromosomes strings      Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes
      --dehydrated               Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-atypical         Exclude atypical assemblies
      --mag string               Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")
      --preview                  Show information about the requested data package
      --reference                Limit to reference genomes
      --released-after string    Limit to genomes released on or after a specified date (MM/DD/YYYY)
      --released-before string   Limit to genomes released on or before a specified date (MM/DD/YYYY)
      --search strings           Limit results to genomes with specified text in the searchable fields:
                                 species and infraspecies, assembly name and submitter.
                                 To search multiple strings, use the flag multiple times.


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download genome <command> --help for detailed help about a command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

gene参数

(base) [yut@io02 ~]$ datasets download gene --help

Download a gene data package.  Gene data packages include gene, transcript and protein sequences and one or more data reports. Data packages are downloaded as a zip archive.

The default gene data package for NM, NR, NP, XM, XR, XP and YP accessions:
  * rna.fna (transcript sequences)
  * protein.faa (protein sequences)
  * data_report.jsonl (data report with gene metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download gene [flags]
  datasets download gene [command]

Sample Commands
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1

Available Commands
  gene-id     Download a gene data package by NCBI Gene ID
  symbol      Download a gene data package by gene symbol
  accession   Download a gene data package by RefSeq nucleotide or protein accession
  taxon       Download a gene data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --fasta-filter strings       Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions
      --fasta-filter-file string   Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions included in the specified file
      --preview                    Show information about the requested data package


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download gene <command> --help for detailed help about a command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

`datasets summary`下载元数据

(base) [yut@io02 ~]$ datasets summary --help

Print a data report containing gene, genome or virus metadata in JSON format.

Usage
  datasets summary [flags]
  datasets summary [command]

Sample Commands
  datasets summary genome accession GCF_000001405.40
  datasets summary genome taxon "mus musculus"
  datasets summary gene gene-id 672
  datasets summary gene symbol brca1 --taxon mouse
  datasets summary gene accession NP_000483.3
  datasets summary virus genome accession NC_045512.2
  datasets summary virus genome taxon sars-cov-2 --host dog

Available Commands
  gene        Print a summary of a gene dataset
  genome      Print a data report containing genome metadata
  virus       Print a data report containing virus genome metadata

Global Flags
      --api-key string   Specify an NCBI API key
      --debug            Emit debugging info
      --help             Print detailed help about a datasets command
      --version          Print version of datasets

Use datasets summary <command> --help for detailed help about a command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

实例

(base) [yut@io02 ~]$ datasets summary genome accession GCF_000001405.40
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"accession":"GCF_000001405.40","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.99187225,"duplicated":0.007256894,"fragmented":0.0015239477,"missing":0.0066037737,"single_copy":0.9846154,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; RefSeqFE; cmsearch; tRNAscan-SE","name":"GCF_000001405.40-RS_2023_10","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2023-10-02","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2023_10.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":22158,"other":413,"protein_coding":20080,"pseudogene":17001,"total":59652}},"status":"Updated annotation"},"assembly_info":{"assembly_level":"Chromosome","assembly_name":"GRCh38.p14","assembly_status":"current","assembly_type":"haploid-with-alt-loci","bioproject_accession":"PRJNA31257","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA31257","title":"The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"}]}],"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_000001405.40","description":"Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)","paired_assembly":{"accession":"GCA_000001405.29","only_genbank":"4 unlocalized and unplaced scaffolds.","status":"current"},"refseq_category":"reference genome","release_date":"2022-02-03","submitter":"Genome Reference Consortium","synonym":"hg38"},"assembly_stats":{"contig_l50":18,"contig_n50":57879411,"gaps_between_scaffolds_count":349,"gc_count":"1374283647","gc_percent":41,"number_of_component_sequences":35611,"number_of_contigs":996,"number_of_organelles":1,"number_of_scaffolds":470,"scaffold_l50":16,"scaffold_n50":67794873,"total_number_of_chromosomes":24,"total_sequence_length":"3099441038","total_ungapped_length":"2948318359"},"current_accession":"GCF_000001405.40","organelle_info":[{"description":"Mitochondrion","submitter":"Genome Reference Consortium","total_seq_length":"16569"}],"organism":{"common_name":"human","organism_name":"Homo sapiens","tax_id":9606},"paired_accession":"GCA_000001405.29","source_database":"SOURCE_DATABASE_REFSEQ"}],"total_count": 1}

(base) [yut@io02 ~]$ datasets summary gene gene-id 672
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"gene":{"annotations":[{"annotation_name":"GCF_000001405.40-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_000001405.40","assembly_name":"GRCh38.p14","genomic_locations":[{"genomic_accession_version":"NC_000017.11","genomic_range":{"begin":"43044295","end":"43170327","orientation":"minus"},"sequence_name":"17"}]},{"annotation_name":"GCF_009914755.1-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_009914755.1","assembly_name":"T2T-CHM13v2.0","genomic_locations":[{"genomic_accession_version":"NC_060941.1","genomic_range":{"begin":"43902857","end":"44029084","orientation":"minus"},"sequence_name":"17"}]}],"chromosomes":["17"],"common_name":"human","description":"BRCA1 DNA repair associated","ensembl_gene_ids":["ENSG00000012048"],"gene_groups":[{"id":"672","method":"NCBI Ortholog"}],"gene_id":"672","nomenclature_authority":{"authority":"HGNC","identifier":"HGNC:1100"},"omim_ids":["113705"],"orientation":"minus","protein_count":368,"reference_standards":[{"gene_range":{"accession_version":"NG_005905.2","range":[{"begin":"92501","end":"173689","orientation":"plus"}]},"type":"REFSEQ_GENE"}],"swiss_prot_accessions":["P38398"],"symbol":"BRCA1","synonyms":["IRIS","PSCP","BRCAI","BRCC1","FANCS","PNCA4","RNF53","BROVCA1","PPP1R53"],"tax_id":"9606","taxname":"Homo sapiens","transcript_count":368,"transcript_type_counts":[{"count":368,"type":"PROTEIN_CODING"}],"type":"PROTEIN_CODING"},"query":["672"]}],"total_count": 1}
1
2
3
4
5
6
7

下载结果为json格式

`dataformat`将json转换成表格格式

(base) [yut@io02 ~]$ dataformat tsv

Convert data to TSV format.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv [command]

Report Commands
  genome             Convert Genome Assembly Data Report into TSV format
  genome-seq         Convert Genome Assembly Sequence Report into TSV format
  gene               Convert Gene Report into TSV format
  gene-product       Convert Gene Product Report into TSV format
  virus-genome       Convert Virus Data Report into TSV format
  virus-annotation   Convert Virus Annotation Report into TSV format
  microbigge         Convert MicroBIGG-E Data Report into TSV format
  prok-gene          Convert Prokaryote Gene Report into TSV format
  prok-gene-location Convert Prokaryote Gene Location Report into TSV format
  genome-annotations Convert Genome Annotation Report into TSV format

Flags
      --elide-header   Do not output header
  -h, --help           help for tsv



Global Flags
      --force   Force dataformat to run without type check prompt

Use dataformat tsv <command> --help for detailed help about a command.

(base) [yut@io02 ~]$ dataformat tsv gene
Error: --inputfile and/or --packagefile must be specified, or data can be read from standard input
Usage
  dataformat tsv gene [flags]

Examples
  dataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonl
  dataformat tsv gene --package genes.zip

Flags
      --fields strings     Comma-separated list of fields (default annotation-assembly-accession,annotation-assembly-name,annotation-genomic-range-accession,annotation-genomic-range-exon-order,annotation-genomic-range-exon-orientation,annotation-genomic-range-exon-start,annotation-genomic-range-exon-stop,annotation-genomic-range-range-order,annotation-genomic-range-range-orientation,annotation-genomic-range-range-start,annotation-genomic-range-range-stop,annotation-genomic-range-seq-name,annotation-release-date,annotation-release-name,chromosomes,common-name,description,ensembl-geneids,gene-id,gene-type,genomic-region-gene-range-accession,genomic-region-gene-range-range-order,genomic-region-gene-range-range-orientation,genomic-region-gene-range-range-start,genomic-region-gene-range-range-stop,genomic-region-genomic-region-type,group-id,group-method,name-authority,name-id,omim-ids,orientation,protein-count,ref-standard-gene-range-accession,ref-standard-gene-range-range-order,ref-standard-gene-range-range-orientation,ref-standard-gene-range-range-start,ref-standard-gene-range-range-stop,ref-standard-genomic-region-type,replaced-gene-id,rna-type,swissprot-accessions,symbol,synonyms,tax-id,tax-name,transcript-count)
                               - annotation-assembly-accession
                               - annotation-assembly-name
                               - annotation-genomic-range-accession
                               - annotation-genomic-range-exon-order
                               - annotation-genomic-range-exon-orientation
                               - annotation-genomic-range-exon-start
                               - annotation-genomic-range-exon-stop
                               - annotation-genomic-range-range-order
                               - annotation-genomic-range-range-orientation
                               - annotation-genomic-range-range-start
                               - annotation-genomic-range-range-stop
                               - annotation-genomic-range-seq-name
                               - annotation-release-date
                               - annotation-release-name
                               - chromosomes
                               - common-name
                               - description
                               - ensembl-geneids
                               - gene-id
                               - gene-type
                               - genomic-region-gene-range-accession
                               - genomic-region-gene-range-range-order
                               - genomic-region-gene-range-range-orientation
                               - genomic-region-gene-range-range-start
                               - genomic-region-gene-range-range-stop
                               - genomic-region-genomic-region-type
                               - group-id
                               - group-method
                               - name-authority
                               - name-id
                               - omim-ids
                               - orientation
                               - protein-count
                               - ref-standard-gene-range-accession
                               - ref-standard-gene-range-range-order
                               - ref-standard-gene-range-range-orientation
                               - ref-standard-gene-range-range-start
                               - ref-standard-gene-range-range-stop
                               - ref-standard-genomic-region-type
                               - replaced-gene-id
                               - rna-type
                               - swissprot-accessions
                               - symbol
                               - synonyms
                               - tax-id
                               - tax-name
                               - transcript-count
  -h, --help               help for gene
      --inputfile string   Input file (default "ncbi_dataset/data/data_report.jsonl")
      --package string     Data package (zip archive), inputfile parameter is relative to the root path inside the archive



Global Flags
      --elide-header   Do not output header
      --force          Force dataformat to run without type check prompt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

实例

(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Annotation Assembly Accession   Annotation Assembly Name        Annotation Genomic Range Accession      Annotation Genomic Range Exons Order    Annotation Genomic Range Exons Orientation      Annotation Genomic Range Exons Start    Annotation Genomic Range Exons Stop     Annotation Genomic Range Order        Annotation Genomic Range Orientation    Annotation Genomic Range Start  Annotation Genomic Range Stop   Annotation Genomic Range Seq Name       Annotation Release Date Annotation Release Name Chromosomes     Common Name     Description     Ensembl GeneIDs NCBI GeneID   Gene Type       Genomic Region Gene Range Sequence Accession    Genomic Region Gene Range Order Genomic Region Gene Range Orientation   Genomic Region Gene Range Start Genomic Region Gene Range Stop  Genomic Region Genomic Region Type      Gene Group Identifier   Gene Group Method     Nomenclature Authority  Nomenclature ID OMIM IDs        Orientation     Proteins        Reference Standard Gene Range Sequence Accession        Reference Standard Gene Range Order     Reference Standard Gene Range Orientation       Reference Standard Gene Range Start     Reference Standard Gene Range Stop    Reference Standard Genomic Region Type  Replaced NCBI GeneID    RNA Type        SwissProt Accessions    Symbol  Synonyms        Taxonomic ID    Taxonomic Name  Transcripts
GCF_000001405.40        GRCh38.p14      NC_000017.11                                            minus   43044295        43170327        17      2023-10-02      GCF_000001405.40-RS_2023_10     17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING       672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368
GCF_009914755.1 T2T-CHM13v2.0   NC_060941.1                                             minus   43902857        44029084        17      2023-10-02      GCF_009914755.1-RS_2023_10      17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING               672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368

(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene --fields gene-id,gene-type,symbol
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
NCBI GeneID     Gene Type       Symbol
672     PROTEIN_CODING  BRCA1

# --as-json-lines必须加上
# --fields指定需要的字段，多个空格隔开
1
2
3
4
5
6
7
8
9
10
11
12
13

通过json文件解析其他字段

某些字段无法通过dataformat提取出来，可先保存成json文件，然后通过下面脚本解析：

(base) [yut@node01 ~]$ cat dataset.json
{"accession":"GCA_013141435.1","annotation_info":{"method":"Best-placed reference protein set; GeneMarkS-2+","name":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","provider":"NCBI","release_date":"2020-05-14","software_version":"4.11","stats":{"gene_counts":{"non_coding":27,"protein_coding":2566,"pseudogene":17,"total":2610}}},"assembly_info":{"assembly_level":"Contig","assembly_method":"MetaSPAdes v. 3.10.1","assembly_name":"ASM1314143v1","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA622654","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA622654","title":"Metagenomic profiling of ammonia and methane-oxidizing microorganisms in a Dutch drinking water treatment plant"}]}],"biosample":{"accession":"SAMN14539096","attributes":[{"name":"isolation_source","value":"Primary rapid sand filter"},{"name":"collection_date","value":"not applicable"},{"name":"geo_loc_name","value":"Netherlands"},{"name":"lat_lon","value":"not applicable"},{"name":"isolate","value":"P-RSF-IL-07"},{"name":"depth","value":"not applicable"},{"name":"env_broad_scale","value":"drinking water treatment plant"},{"name":"env_local_scale","value":"Primary rapid sand filter"},{"name":"env_medium","value":"not applicable"},{"name":"metagenomic","value":"1"},{"name":"environmental-sample","value":"1"},{"name":"sample_type","value":"metagenomic assembly"},{"name":"metagenome-source","value":"drinking water metagenome"},{"name":"derived_from","value":"This BioSample is a metagenomic assembly obtained from the drinking water metagenome BioSample:SAMN14524263, SAMN14524264, SAMN14524265, SAMN14524266"}],"bioprojects":[{"accession":"PRJNA622654"}],"description":{"comment":"Keywords: GSC:MIxS;MIMAG:6.0","organism":{"organism_name":"Ferruginibacter sp.","tax_id":1940288},"title":"MIMAG Metagenome-assembled Genome sample from Ferruginibacter sp."},"last_updated":"2020-05-19T00:50:12.857","models":["MIMAG.water"],"owner":{"contacts":[{}],"name":"Radboud University"},"package":"MIMAG.water.6.0","publication_date":"2020-05-19T00:50:12.857","sample_ids":[{"label":"Sample name","value":"Ferruginibacter sp. P-RSF-IL-07"}],"status":{"status":"live","when":"2020-05-19T00:50:12.857"},"submission_date":"2020-04-04T13:17:04.950"},"comments":"The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/","genome_notes":["derived from metagenome"],"release_date":"2020-05-21","sequencing_tech":"Illumina MiSeq","submitter":"Radboud University"},"assembly_stats":{"contig_l50":10,"contig_n50":104094,"gc_count":"978119","gc_percent":32,"genome_coverage":"270.4x","number_of_component_sequences":43,"number_of_contigs":43,"total_sequence_length":"3056910","total_ungapped_length":"3056910"},"average_nucleotide_identity":{"best_ani_match":{"ani":79.65,"assembly":"GCA_003426875.1","assembly_coverage":0.01,"category":"type","organism_name":"Lutibacter oceani","type_assembly_coverage":0.01},"category":"category_na","comment":"na","match_status":"low_coverage","submitted_organism":"Ferruginibacter sp.","submitted_species":"Ferruginibacter sp.","taxonomy_check_status":"Inconclusive"},"current_accession":"GCA_013141435.1","organism":{"infraspecific_names":{"isolate":"P-RSF-IL-07"},"organism_name":"Ferruginibacter sp.","tax_id":1940288},"source_database":"SOURCE_DATABASE_GENBANK","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JABFQZ000000000.1","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JABFQZ01","wgs_project_accession":"JABFQZ01"}}

(base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  *json
Save result in output.csv
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly
(base) [yut@node01 ~]$ cat ~/Software/Important_scripts/Parse_dataset_genome_json_metadata.py
#!/usr/bin/env python
import argparse
import json
import pandas as pd

# 创建参数解析器
parser = argparse.ArgumentParser(description='Parse JSON data')
parser.add_argument('json_file', help='Path to the JSON file')

# 解析参数
args = parser.parse_args()

# 读取JSON文件
with open(args.json_file, 'r') as file:
    json_str = file.read()

# 解析JSON
data = json.loads(json_str)

# 获取env_broad_scale字段的值
# 获取所需字段的值
accession = data["accession"]
geo_loc_name = data["assembly_info"]["biosample"]["attributes"][2]["value"]
lat_lon = data["assembly_info"]["biosample"]["attributes"][3]["value"]
collection_date = data['assembly_info']['biosample']['attributes'][1]['value']
env_broad_scale = data["assembly_info"]["biosample"]["attributes"][6]["value"]
env_local_scale = data['assembly_info']['biosample']['attributes'][7]['value']
env_medium = data['assembly_info']['biosample']['attributes'][8]['value']
sample_type = data['assembly_info']['biosample']['attributes'][11]['value']

# output
# 创建DataFrame
df = pd.DataFrame({
    'Accession': [accession],
    'Geo Location Name': [geo_loc_name],
    'Latitude and Longitude': [lat_lon],
    'Collection date' : [collection_date],
    'Env broad scale' : [env_broad_scale],
    'Env local scale'  : [env_local_scale],
    'Env medium' : [env_medium],
    'Sample type' : [sample_type]
})

# 将DataFrame保存为CSV文件
df.to_csv('output.csv', index=False)
print('Save result in output.csv ')

# run
$ (base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  dataset.json
Save result in output.csv
# output
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

问题

Error: Internal error (invalid zip archive)并且没有输出文件

(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile GTDB_R214_63_Ferruginibacter_genus.GCA --include gff3,rna,cds,protein,genome,seq-report
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 63 genome records [================================================] 100% 63/63
Downloading: ncbi_dataset.zip    146MB done
Validating package files [===========>------------------------------------]  28% 70/252
Error: Internal error (invalid zip archive). Please try again
1
2
3
4
5
6

上述问题可能是输入编号既包括GCA又包括GCF编号，解决办法是将两者分开下载，或者等到Validating package files停掉命令

相关阅读:
深度学习调参笔记
 Annoy vs Milvus：哪个向量数据库更适合您的AI应用？知其然知其所以然
 嵌入式常见接口协议总结
 定时器的使用
 294_C++_报警状态bit与(&)上通道bit，然后检测置位的通道，得到对应置位通道的告警信息，适用于多通道告警，组成string字符串发送
 安装cx_Oracle
高云FPGA系列教程（11）：MultiButton按键驱动模块移植
 【JavaEE进阶】Spring统一功能处理:拦截器的使用
 redis 未授权访问漏洞详解
 深入学习Synchronized各种使用方法
原文地址：https://blog.csdn.net/qq_42491125/article/details/134232393