Command Line Interface to upload data to the European Nucleotide Archive

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ENA upload tool

This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files of one of the excel spreadsheet that can be found on this template repo. The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a Galaxy tool and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like usegalaxy.eu.

Overview

The metadata should be provided in separate tables corresponding to the following ENA objects:

STUDY
SAMPLE
EXPERIMENT
RUN

The program to perform the following actions:

add: add an object to the archive
modify: modify an object in the archive
cancel: cancel a private object and its dependent objects
release: release a private object immediately to the public

After a successful submission, new tsv tables will be generated with the ENA accession numbers filled in along with a submission receipt.

Tool dependencies

python 3.5+ including following packages:
- Genshi
- lxml
- pandas
- requests

Installation

pip install ena-upload-cli

Usage

Minimal:  ena-upoad-cli --action {add,modify,cancel,release} --center CENTER_NAME  --secret SECRET

All supported arguments:

  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --action {add,modify,cancel,release}
                         add: add an object to the archive
                         modify: modify an object in the archive
                         cancel: cancel a private object and its dependent objects
                         release: release a private object immediately to public
  --study STUDY         table of STUDY object
  --sample SAMPLE       table of SAMPLE object
  --experiment EXPERIMENT
                        table of EXPERIMENT object
  --run RUN             table of RUN object
  --data [FILE ...]     data for submission
  --center CENTER_NAME  specific to your Webin account
  --checklist CHECKLIST
                        specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
  --xlsx XLSX           Excel table with metadata
  --tool TOOL_NAME      specify the name of the tool this submission is done with. Default: ena-upload-cli
  --tool_version TOOL_VERSION
                        specify the version of the tool this submission is done with
  --no_data_upload      indicate if no upload should be performed and you like to submit a RUN object (e.g. if uploaded     
                        was done separately).
  --draft               indicate if no submission should be performed
  --secret SECRET       .secret.yml file containing the password and Webin ID of your ENA account
  -d, --dev             flag to use the dev/sandbox endpoint of ENA

Mandatory arguments: --action, --center and --secret.

ENA Webin

A Webin can be made here if you don't have one already. The Webin ID makes use of the full username looking like: Webin-XXXXX. Visit Webin online to check on your submissions or dev Webin to check on test submissions.

The .secret.yml file

To avoid exposing your credentials through the terminal history, it is recommended to make use of a .secret.yml file, containing your password and username keywords. An example is given in the root of this directory.

ENA sample checklists

You can specify ENA sample checklist using the --checklist parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the ENA website. This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct --checklist parameter is given.

Fixed sample columns

The command line tool will automatically fetch the correct scientific name based on the taxon ID or fetch the taxon ID based on the scientific name. Both can be given and no overwrite will be done.

Mandatory: alias, title, sample_description and either scientific_name or taxon_id (preferred)
Optional: common_name

alias	title	taxon_id	scientific_name	common_name	sample_description
sample_alias_4	sample_title_2	2697049	Severe acute respiratory syndrome coronavirus 2	covid-19	sample_description_1
sample_alias_5	sample_title_3	2697049	Severe acute respiratory syndrome coronavirus 2	covid-19	sample_description_2

Viral submissions

If you want to submit viral samples you can use the ENA virus pathogen checklist by adding ERC000033 to the checklist parameter. Check out our viral example command as demonstration. Please use the ENA virus pathogen checklist on the website of ENA to know which values are allowed/possible in the restricted text and text choice fields.

ENA study, experiment and run tables

Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV). Currently we refer to the ENA Webin to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.

Study tsv table

Name of column	Cardinality	Documentation	CV
alias	mandatory	Submitter designated name for the object. The name must be unique within the submission account.
title	mandatory	Title of the study as would be used in a publication.
study_type	mandatory	The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study.	yes
study_abstract	mandatory	Briefly describes the goals, purpose, and scope of the Study. This need not be listed if it can be inherited from a referenced publication.
center_project_name	optional	Submitter defined project name. This field is intended for backward tracking of the study record to the submitter's LIMS.
study_description	optional	More extensive free-form description of the study.
pubmed_id	optional	Link to publication related to this study.

Experiment tsv table

Name of column	Cardinality	Documentation	CV
alias	mandatory	Submitter designated name for the object. The name must be unique within the submission account.
title	mandatory	Short text that can be used to call out experiment records in searches or in displays.
study_alias	mandatory	Identifies the parent study.
sample_alias	mandatory	Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified.
design_description	mandatory	Goal and setup of the individual library including library was constructed.
spot_descriptor	optional	The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files).
library_name	mandatory	The submitter's name for this library.
library_layout	mandatory	LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified.	yes
insert_size	mandatory	Relative distance.
library_strategy	mandatory	Sequencing technique intended for this library	yes
library_source	mandatory	The LIBRARY_SOURCE specifies the type of source material that is being sequenced.	yes
library_selection	mandatory	Method used to enrich the target in the sequence library preparation	yes
platform	mandatory	The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center.	yes
library_construction_protocol	optional	Free form text describing the protocol by which the sequencing library was constructed.

Run tsv table

Name of column	Cardinality	Documentation	CV
alias	mandatory	Submitter designated name for the object. The name must be unique within the submission account.
experiment_alias	mandatory	Identifies the parent experiment.
file_name	mandatory	The name or relative pathname of a run data file.
file_type	mandatory	The run data file model.	yes
file_checksum	optional	Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option

Dev instance

By default the submission will be done using following url to ENA: https://www.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA

Use the --dev flag if you want to do a test submission using the tool by the sandbox dev instance of ENA: https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA. A TEST submission will be discarded within 24 hours.

Submitting a selection of rows to ENA

Optionally you can add a status column to every table that contains the action you want to apply during this submission. If you chose to add only the first 2 samples to ENA, you specify --action add as parameter in the command and you add the add value to the status column of the rows you want to submit as demonstrated below. Same holds for the action modify, release and cancel.

Example with modify as seen in the example sample modify table

alias	status	title	taxon_id	sample_description
sample_alias_4	modify	sample_title_1	2697049	sample_description_1
sample_alias_5		sample_title_2	2697049	sample_description_2

IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the --action parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.

Using Excel templates

We also support the use of specific excel templates, designed for each sample checklist. Use the --xlsx command to add the path to an excel template file filled in from this template repo.

The data files

Supported data

Read data
Genome Assembly
Transcriptome Assembly
Template Sequence
Other Analyses

Most files uploaded to the ENA FTP server need to be compressed.

More information on how ENA wants to receive the files can be found here.

Note for data upload: Uploaded files are persistently stored on the ENA server after the upload for some time. Thus, if multiple test submission are performed, it is possible to skip the data upload with --no_data_upload in subsequent submissions. This also allows uploading (large) datasets separately e.g. with aspera. For the --no_data_upload argument, data file(s) still need to be provided with --data if a RUN object is submitted in order to generate MD5 sums. If the

Releasing and canceling a submission

If you want to release or cancel data, you can do so by using cancel or release in the --action parameter in the command line. Tables that have to be released or cancelled need an accession column with corresponding accession ids. This means that you first have to use add to submit your data, and use afterwords the updated table with accession ids, if you did not yet submit your data.

By default the updated tables after submission will have the action added in their status column. Don't forget to change the values to release or cancel if you want to use one of these actions (or delete the status column if your action applies for the whole table).

NOTE: Releasing a study will make all child elements like runs and experiments public.

Tool overview

inputs:

metadata tables/excelsheet
- examples in example_table and on this template repo for excel sheets
- (optional) define actions in status column e.g. add, modify, cancel, release (when not given the whole table is submitted)
- to perform bulk submission of all objects, the aliases ids in different ENA objects should be in the association where alias ids in experiment object link all objects together
experimental data
- examples in example_data

outputs:

a receipt.xml file in the working directory with the receipt from the ENA submission
metadata tables with updated info in the same directory of inputs:
- updated status: added, modified, canceled, released
- accession ids
- submission date
- file checksums in runs table if not given
- taxon id or scientific name in sample table if not given

Test the tool

add metadata and sequence data

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --secret .secret.yml

add metadata only

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs_md5sums.tsv --dev --secret .secret.yml

add studies

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --dev --secret .secret.yml

modify sample metadata

ena-upload-cli --action modify --center 'your_center_name' --sample example_tables/ENA_template_samples_modify.tsv --dev --secret .secret.yml

viral data

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples_vir.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml

Using an Excel template

ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx

release submission

ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml

Note for Windows users: Windows, by default, does not support wildcard expansion in command-line arguments. Because of this the --data example_data/*gz argument should be substituted with one containing a list of the data files. For this example, use:
--data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.1

Feb 26, 2024

0.7.0

Dec 18, 2023

0.6.3

Sep 7, 2023

0.6.2

May 10, 2023

0.6.1

Jul 26, 2022

0.6.0

May 3, 2022

0.5.3

Jan 21, 2022

0.5.2

Jan 19, 2022

0.5.1

Jan 13, 2022

This version

0.5.0

Jan 12, 2022

0.4.5

Dec 24, 2021

0.4.4

Dec 8, 2021

0.4.3

Oct 26, 2021

0.4.2

Oct 7, 2021

0.4.1

Sep 15, 2021

0.4.0

Aug 16, 2021

0.3.1

Jul 14, 2021

0.3.0

Jun 10, 2021

0.2.8

Apr 30, 2021

0.2.7

Mar 30, 2021

0.2.5

Dec 21, 2020

0.2.4

Nov 19, 2020

0.2.3

Nov 19, 2020

0.2.2

Oct 5, 2020

0.2.1

Oct 4, 2020

0.2.0

Sep 24, 2020

0.1.9

Sep 24, 2020

0.1.8

Sep 22, 2020

0.1.7

Aug 14, 2020

0.1.6

Aug 14, 2020

0.1.5

Aug 7, 2020

0.1.4

Aug 6, 2020

0.1.3

Jun 25, 2020

0.1.2

Jun 4, 2020

0.1.1

Jun 4, 2020

0.1

Jun 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ena-upload-cli-0.5.0.tar.gz (83.2 kB view hashes)

Uploaded Jan 12, 2022 Source

Built Distribution

ena_upload_cli-0.5.0-py3-none-any.whl (136.6 kB view hashes)

Uploaded Jan 12, 2022 Python 3

Hashes for ena-upload-cli-0.5.0.tar.gz

Hashes for ena-upload-cli-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`d4d0e0b4cfe63dac8c9936450b144e955c50c767ca87b36a23ca2e75e174787c`
MD5	`2114ec3b70c3b3958c3e4d5ef1592333`
BLAKE2b-256	`ca9b8efc6c07ccfc6181ab5479ef5edadf56b175f7a4095688d3c2c3582fa971`

Hashes for ena_upload_cli-0.5.0-py3-none-any.whl

Hashes for ena_upload_cli-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`13522914fa442859a522c217d4f2c2008a77a73c7b3ffc0ff5c3721e90948622`
MD5	`47ea0d9fa34065f17dde6a71a9ac4600`
BLAKE2b-256	`99d3f949566281e99246da03b4db4725f82ad7784a062dc5cf0a3d603581bc61`

ena-upload-cli 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

ENA upload tool

Overview

Tool dependencies

Installation

Usage

ENA Webin

The .secret.yml file

ENA sample checklists

Fixed sample columns

Viral submissions

ENA study, experiment and run tables

Study tsv table

Experiment tsv table

Run tsv table

Dev instance

Submitting a selection of rows to ENA

Using Excel templates

The data files

Releasing and canceling a submission

Tool overview

Test the tool

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution