Read the latest Real Python tutorials

Project description

getENA

Sometimes we need to download a sequencing project from ENA; fortunately ENA offers in its platform a link to the file that we need. However, we can spend a lot of time downloading files manually if the amount of files is large.

I have developed a small project in Python to be able to do this work in an automated and parallel way to increase the performance.

Installation

pip install getENA

Alternatively, from GitHub

pip install git+https://github.com/EnzoAndree/getENA

Usage

Let's say I'm interested in Clostridium perfringens sequencing projects; we have to search ENA for public sequencing projects at https://www.ebi.ac.uk/ena/browser/text-search?query=clostridium%20perfringens. Here, we choose the codes that we need, for example:

PRJNA350702 PRJNA285473 PRJNA508810

We have 2 options to download the FASTQ files, (1) add the project codes to the command line separated by spaces as an argument, or (2) make a file containing a list of all the project codes that need.

For the first option (recommended for few projects, e.g. >= 5) we run the following

getENA.py -acc PRJNA350702 PRJNA285473 PRJNA508810

For the second option (recommended for many projects, e.g. >= 5) we run the following

getENA.py -accfile ena.list.txt

Where ena.list.txt is the file containing a list of all the project codes.

Instead, if you only want to download a few selected genomes from the project, simply add the run_accession as a parameter

getENA.py -acc SRR096826 SRR8867692 SRR7601184

If you want, you can increase the performance by increasing the number of reads that are downloaded in parallel (-t option). However, be careful, because ENA aborts the connection if it detects that you have many connections at the same time with its FTP. Empirically I have observed that 12 parallel connections work properly without ENA cancelling the download.

As a crazy example of many parallel connections of the above commands would be the following:

getENA.py -t 64 -acc PRJNA350702 PRJNA285473 PRJNA508810

One of the main features of getENA.py is that it automatically confirms the integrity of the FASTQ file when you download it. If the connection is lost, if ENA cancels the connection or if the getENA.py is stopped, you can run the program again and restart the download without losing the files that were already downloaded.

By default the output directory of getENA.py is a folder called ENA_out in the current directory. It can be modified with the -o argument. For example:

getENA.py -o Cperfringens -t 64 -acc PRJNA350702 PRJNA285473 PRJNA508810

Output files

The scheme of the files and folders created follows the next format:

|ENA_out
|-- metadata.tsv
|-- ERR0001_1.fastq.gz
|-- ERR0001_2.fastq.gz
|-- ...
|-- ERR0009_1.fastq.gz
|-- ERR0009_2.fastq.gz
|-- tmp
|---- PRJNA350702.tsv
|---- PRJNA285473.tsv
|---- PRJNA508810.tsv

Where PRJNA350702.tsv, PRJNA285473.tsv and PRJNA508810.tsv are the metadata of selected projects and metadata.tsv is a merge of this three files. The folder ENA_out, contain all FASTQ file of each project

If you only want to get the assemblies reported in ENA, you can get all the FASTA files for a given taxon ID. In this case the taxon id of Clostridium perfringens is 1502. So the command line to download all assemblies of this species is:

getENA.py -o Cperfringens -tax 1502

This command line will generate a genomes directory within the Cperfringens folder where all assemblies reported to date are placed

Licence

GPL v3

Author

Enzo Guerrero-Araya
Twitter: @eguerreroaraya

Project details

Release history Release notifications | RSS feed

This version

1.2.2

Jan 12, 2020

1.2.1

Jan 12, 2020

1.2.0

Jan 12, 2020

1.1.0

Dec 27, 2019

1.0.4

Dec 22, 2019

1.0.3

Dec 22, 2019

1.0.2

Dec 22, 2019

1.0.1

Dec 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

getENA-1.2.2-py3-none-any.whl (19.1 kB view details)

Uploaded Jan 12, 2020 Python 3

File details

Details for the file getENA-1.2.2-py3-none-any.whl.

File metadata

Download URL: getENA-1.2.2-py3-none-any.whl
Upload date: Jan 12, 2020
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.3

File hashes

Hashes for getENA-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7ea818ee03ba1b23c3b6a22b62a874cc8f3924e87317a9320743ac9a63a8fe0`
MD5	`c1abf186a0922f13a9ee57fe1526aa04`
BLAKE2b-256	`b979aaf72825669d929c9d2121a716cd38db5533806f3e3661dd3027b4b0b606`

See more details on using hashes here.

getENA 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

getENA

Installation

Usage

Output files

Licence

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes