Skip to main content

Python scripts to upload primary metagenome and metatranscriptome assemblies to ENA on a per-study basis. This script generates xmls to register a new study and create manifests necessary for submission with webin-cli.

Project description

ENA Metagenome Assembly uploader

Upload of metagenome and metatranscriptome assemblies to the European Nucleotide Archive (ENA)

Pre-requisites:

  • CSV metadata file. One per study. See tests/fixtures/test_metadata for an example
  • Compressed assembly fasta files in the locations defined in the metadata file

Set the following environmental variables with your webin details:

ENA_WEBIN

export ENA_WEBIN=Webin-0000

ENA_WEBIN_PASSWORD

export ENA_WEBIN_PASSWORD=password

Installation

Install the package:

pip install assembly-uploader

Usage

From the command line

Register study and generate pre-upload files

If you already have a registered study accession for your assembly files skip to step 3.

Step 1: generate XML files for a new assembly study submission

This step will generate a folder <STUDY>_upload and a project XML and submission XML within it:

study_xmls
  --study STUDY         raw reads study ID
  --library LIBRARY     metagenome or metatranscriptome
  --center CENTER       center for upload e.g. EMG
  --hold HOLD           hold date (private) if it should be different from the provided study in format dd-mm-yyyy. Will inherit the release date of the raw read study if not
                        provided.
  --tpa                 use this flag if the study is a third party assembly. Default False
  --publication PUBLICATION
                        pubmed ID for connected publication if available
  --private             use flag if your data is private

Step 2: submit the new assembly study to ENA

This step will submit the XML to ENA and generate a new assembly study accession identifier. Make sure to write down the newly generated study accession identifier!

[!NOTE]

We recommend first submitting the study to the ENA's TEST server first using the --test argument. If no errors occur, then re-run the command without the --test argument for a live submission.

submit_study
  --study STUDY         raw reads study ID
  --directory PATH      directory containing study XML
  --test                run test submission only

Step 3: make a manifest file for each assembly

[!IMPORTANT] Please read carefully before creating manifest files for co-assemblies:

  1. Co-assemblies cannot be generated from a mix of private and public runs - all runs used in a co-assembly must have the same privacy status (all private or all public).
  2. If your co-assembly was assembled from runs generated from multiple biological samples, you must first register a co-assembly sample (see ENA FAQ on co-assemblies) and then specify it in the Sample column of your metadata CSV file.

This step will generate manifest files in the folder <STUDY>_upload for runs specified in the metadata file:

assembly_manifest
  --study STUDY         raw reads study ID
  --data DATA           metadata CSV - runs (comma-separated and in quotes, example: "SRR1234,SRR5678"), coverage, assembler, version, filepath and optionally sample
  --assembly_study ASSEMBLY_STUDY
                        pre-existing study ID to submit to if available. Must exist in the webin account
  --force               overwrite all existing manifests
  --private             use flag if your data is private
  --tpa                 use this flag if the study is a third party assembly. Default False

Step 4: upload assemblies

Once manifest files are generated, it is necessary to use ENA's webin-cli resource to upload the metagenome assemblies. More information on ENA's webin-cli can be found in the ENA docs.

We recommend using a pre-installed webin_cli_handler script.

[!NOTE]

First, validate your submission with the --mode validate.
Second, upload to the ENA's TEST server using the --test flag (make sure you have submitted the study to the ENA's TEST server on Step 2).

Run live execution:

webin_cli_handler \
  --manifest *.manifest \
  --context genome \
  --mode submit \
  [--test]

If you do not have ena-webin-cli installed add the --download-webin-cli flag. The tool will be automatically downloaded. It requires a recent JAVA version to be able to work following official repo.
If you want to use local Java .jar provide it with --webin-cli-jar.

Other options:

webin_cli_handler

  -h, --help            show this help message and exit
  -m, --manifest MANIFEST
                        Manifest text file containing file and metadata fields
  -c, --context {genome,transcriptome,sequence,polysample,reads,taxrefset}
                        Submission type: genome, transcriptome, sequence, polysample, reads, taxrefset
  --mode {submit,validate}
                        submit or validate
  --test                Specify to use test server instead of live
  --workdir WORKDIR     Path to working directory
  --download-webin-cli  Specify if you do not have ena-webin-cli installed
  --download-webin-cli-directory DOWNLOAD_WEBIN_CLI_DIRECTORY
                        Path to save webin-cli into
  --download-webin-cli-version DOWNLOAD_WEBIN_CLI_VERSION
                        Version of ena-webin-cli to download, default: latest
  --webin-cli-jar WEBIN_CLI_JAR
                        Path to pre-downloaded webin-cli.jar file to execute
  --retries RETRIES     Number of retry attempts (default: 3)
  --retry-delay RETRY_DELAY
                        Initial retry delay in seconds (default: 5)
  --java-heap-size-initial JAVA_HEAP_SIZE_INITIAL
                        Java initial heap size in GB (default: 10)
  --java-heap-size-max JAVA_HEAP_SIZE_MAX
                        Java maximum heap size in GB (default: 10)

Optional step 5: publicly releasing a private study

release_study
  --study STUDY         study ID (e.g. of the assembly study)
  --test                run test submission only

From a Python script

This assembly_uploader can also be used a Python library, so that you can integrate the steps into another Python workflow or tool.

from pathlib import Path

from assembly_uploader.study_xmls import StudyXMLGenerator, METAGENOME
from assembly_uploader.submit_study import submit_study
from assembly_uploader.assembly_manifest import AssemblyManifestGenerator

# Generate new assembly study XML files
StudyXMLGenerator(
    study="SRP272267",
    center_name="EMG",
    library=METAGENOME,
    tpa=True,
    output_dir=Path("my-study"),
).write()

# Submit new assembly study to ENA
new_study_accession = submit_study("SRP272267", is_test=True, directory=Path("my-study"))
print(f"My assembly study has the accession {new_study_accession}")

# Create manifest files for the assemblies to be uploaded
# This assumes you have a CSV file detailing the assemblies with their assembler and coverage metadata
# see tests/fixtures/test_metadata for an example
AssemblyManifestGenerator(
    study="SRP272267",
    assembly_study=new_study_accession,
    assemblies_csv=Path("/path/to/my/assemblies.csv"),
    output_dir=Path("my-study"),
).write()

The ENA submission requires webin-cli, so follow Step 4 above. (You could still call this from Python, e.g. with subprocess.Popen.)

Finally, you can also publicly release a private/embargoed/held study:

from assembly_uploader.release_study import release_study
release_study("SRP272267")

Development setup

Prerequisites: a functioning conda or pixi installation.

To install the assembly uploader codebase in "editable" mode:

conda env create -f requirements.yml
conda activate assemblyuploader
pip install -e '.[dev,test]'
pre-commit install

Testing

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assembly_uploader-1.3.5.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assembly_uploader-1.3.5-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file assembly_uploader-1.3.5.tar.gz.

File metadata

  • Download URL: assembly_uploader-1.3.5.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for assembly_uploader-1.3.5.tar.gz
Algorithm Hash digest
SHA256 b68ff22643590ca3d371a5acf75cbd16db16d725252303f03bf1e1940de2ac0e
MD5 e13e01363d48308f6a4588a32c067abc
BLAKE2b-256 d43dacf746c694c209782fad03d3ff5bb6ab4db1437b1b02db52d4d674d9eb16

See more details on using hashes here.

Provenance

The following attestation bundles were made for assembly_uploader-1.3.5.tar.gz:

Publisher: pypi.yml on EBI-Metagenomics/assembly_uploader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file assembly_uploader-1.3.5-py3-none-any.whl.

File metadata

File hashes

Hashes for assembly_uploader-1.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 38cafbb8e306a05aea4ff45a78b02178e993e533261f04ea936fc3e98b88dd62
MD5 9fbd69540a05c322cce4d08fa0ff53a4
BLAKE2b-256 88e5cc4b4770a7625bedddd0ba1f97ee7f1d8a1ebdc4e5c8a8e79c4c069975d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for assembly_uploader-1.3.5-py3-none-any.whl:

Publisher: pypi.yml on EBI-Metagenomics/assembly_uploader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page