Skip to main content

Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.

Project description

Public bins and MAGs uploader

Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.

It takes as input one tsv (tab-separated values) table in the following format:

genome_name genome_path accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination genome_coverage metagenome co-assembly broad_environment local_environment environmental_medium rRNA_presence taxonomy_lineage
ERR4647712_crispatus path/to/ERR4647712.fa.gz ERR4647712 megahit_v1.2.9 MGnify-genomes-generation-pipeline_v1.0.0 default CheckM2_v1.0.1 100 0.38 14.2 chicken gut metagenome False chicken gut mucosa True d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus

With columns indicating:

  • genome_name: genome id (unique string identifier)
  • accessions: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
  • assembly_software: assemblerName_vX.X
  • binning_software: binnerName_vX.X
  • binning_parameters: binning parameters
  • stats_generation_software: software_vX.X
  • completeness: float
  • contamination: float
  • rRNA_presence: True/False if 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome
  • NCBI_lineage: full NCBI lineage, either in tax ids (integers) or strings. Format: x;y;z;...
  • metagenome: needs to be listed in the taxonomy tree here (you might need to press "Tax tree - Show" in the right most section of the page)
  • co-assembly: True/False, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project.
  • genome_coverage : genome coverage against raw reads
  • genome_path: path to genome to upload (already compressed)
  • broad_environment: string (explanation following)
  • local_environment: string (explanation following)
  • environmental_medium: string (explanation following)

According to ENA checklist's guidelines, 'broad_environment' describes the broad ecological context of a sample - desert, taiga, coral reef, ... 'local_environment' is more local - lake, harbour, cliff, ... 'environmental_medium' is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ... For host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: "chicken digestive system", "digestive tube", "caecum". More information can be found at ERC000050 for bins and ERC000047 for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"

Another example can be found here

Warnings

Raw-read runs from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ), hence at least one DRR|ERR|SRR accession should be available for every genome to be uploaded. Assembly accessions (ERZ|SRZ|DRZ) are also supported.

If uploading TPA (Third PArty) genomes, you will need to contact ENA support before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the --tpa flag only with TPA genomes.

Files to be uploaded will need to be compressed (e.g. already in .gz format).

No more than 5000 genomes can be submitted at the same time.

Register samples and generate pre-upload files

The script needs python, pandas, requests, and ena-webin-cli to run. We provide a yaml file for the generation of a conda environment:

# Create environment and install requirements
conda env create -f requirements.yml
conda activate genome_uploader

You can generate pre-upload files with:

python genome_upload.py -u UPLOAD_STUDY --genome_info METADATA_FILE (--mags | --bins) --webin WEBIN_ID --password PASSWORD --centre_name CENTRE_NAME [--out] [--force] [--live] [--tpa]

where

  • -u UPLOAD_STUDY: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)
  • ---genome_info METADATA_FILE : genomes metadata file in tsv format
  • -m, --mags, --b, --bins: select for bin or MAG upload. If in doubt, look at their definition according to ENA
  • --out: output folder (default: working directory)
  • --force: forces reset of sample xmls generation
  • --live: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the -test option in the upload command for the test submission to work)
  • --webin WEBIN_ID: webin id (format: Webin_XXXXX)
  • --password PASSWORD: webin password
  • --centre_name CENTRE_NAME: name of the centre uploading genomes
  • --tpa: if uploading TPA (Third PArty) generated genomes

It is recommended to validate your genomes in test mode (i.e. without --live in the registration step and with -test during the upload) before attempting the final upload. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process.

Sample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, --force will allow xml regeneration.

Produced files:

The script produces the following files and folders:

bin_upload/MAG_upload
├── manifests
│    └── ...
├── ENA_backup.json                 # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force
├── genome_samples.xml              # xml generated to register samples on ENA before the upload
├── registered_bins/MAGs.tsv        # list of genomes registered on ENA in live mode - needed for manifest generation
├── registered_bins/MAGs_test.tsv   # list of genomes registered on ENA in test mode - needed for manifest generation
└── submission.xml                  # xml used for genome registration on ENA

Upload genomes

Once manifest files are generated, it is necessary to use ENA's webin-cli resource to upload genomes.

To test your submission (i.e. you registered your samples without the --live option with genome_upload.py), add the -test argument.

A live execution example within this repo is the following:

ena-webin-cli \
  -context=genome \
  -manifest=ERR123456_bin.1.manifest \
  -userName="Webin-XXX" \
  -password="YYY" \
  -submit

More information on ENA's webin-cli can be found here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genome_uploader-2.0.0.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

genome_uploader-2.0.0-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file genome_uploader-2.0.0.tar.gz.

File metadata

  • Download URL: genome_uploader-2.0.0.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.16

File hashes

Hashes for genome_uploader-2.0.0.tar.gz
Algorithm Hash digest
SHA256 43daf589294bcd1eae8606ae8b1c50d7dfed112b421ffbf87f9e21848ba5753f
MD5 9c4b210e05d649c9eaf982e9731a1487
BLAKE2b-256 f9b48a7d4632c2f58b864fded74fae925118e2ba6649f31f3ba09f67262c26f1

See more details on using hashes here.

File details

Details for the file genome_uploader-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for genome_uploader-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c97f8225fd0af142c6cc1f0ac494744c760347f92f85a66f99bac48e893113b4
MD5 4f4f7a24adbd8719ee5576d91f036080
BLAKE2b-256 4357a2c3c5f2618b70f9a6fecefd8c30c04e6ef004ff366b48bbfed4ed3f4ccd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page