medirect

Multithreaded ncbi edirect and ftract

These details have not been verified by PyPI

Project links

repository

Project description

medirect - a multiprocessed utility for retrieving records and parsing feature tables from ncbi

As a bioinformatician I build a lot of bacterial dna reference databases. Part of my job is to gather sequences data where it is available from outside sources. One of those sources is the NCBI nucleotide database. I designed this package to help me gather data from NCBI quickly by utilizing multiple processors to make multiple data requests from the NCBI database servers. The utilities mefetch and ftract are designed to work like efetch and xtract and can be slotted in along with other ncbi utilities and follow the same edirect documentation, guidelines and requirements and usage policies. The utilities have primarily been tested on the nucleotide database but should work on any type of data available through the NCBI servers.

The mefetch utility is designed to be fast and can easily overwhelm the NCBI servers. For this reason I highlight two points from the usage policy:

Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.
Make no more than 3 requests every 1 second.

The ftract utility pattern matches features based on the three column table structure described here: feature tables It is designed to parse data and coordinates from feature tables which are magnitudes smaller and faster to parse than the xml tables and xtract parser utility available as part of the standard edirect package. The entire edirect package is available here: ftp downloads

dependencies

Python 3.x
biopython >= 1.68
retrying >= 1.3.3

installation

medirect can be installed in two ways:

For regular users:

% pip3 install medirect

For developers:

% pip3 install git://github.com/crosenth/medirect.git
# or
% git clone git://github.com/crosenth/medirect.git
% cd medirect
% python3 setup.py install

examples

The mefetch executable works exactly like edirect efetch with a an additional multiprocessing argument -proc and a few more features.

By allowing additional processes to download records The -proc argument allows a linear download speed increase downloading large datasets.

Here is an example downloading 255,303 Rhizobium sequence accessions using one processor:

% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -mode text -format acc -proc 1 > accessions.txt
0.53s user 0.11s system 0% cpu 12:43.11 total

Which is equivalent to ncbi efetch:

% esearch -db nucleotide -query 'Rhizobium' | time efetch -mode text -format acc > accessions.txt
0.53s user 0.11s system 0% cpu 12:47.54 total

Adding another processor -proc 2:

% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 2 -mode text -format acc > accessions.txt
0.46s user 0.08s system 0% cpu 5:17.51 total

And another -proc 3 (default):

% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 3 -mode text -format acc > accessions.txt
0.35s user 0.10s system 0% cpu 2:57.01 total

And -proc 4 (see usage policy):

% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 4 -mode text -format acc > accessions.txt
0.35s user 0.08s system 0% cpu 1:40.54 total

Results can be returned in the exact order they intended by the NCBI server using the -in-order argument. Otherwise, the order will be determined by how fast ncbi returns results per process.

The -retmax argument (or chunksize) determines the number of results returned per -proc. By default, it is set to the 10,000 max records per documentation. Setting the -retmax to higher than 10,000 will automatically be set back down to 10,000.

By default the -id reads stdin xml output from esearch. The -id argument can also take input as a comma delimited list of ids or text file of ids. When coupled with the -csv argument the input can be a csv file with additional argument columns. This is useful for bulk downloads with different positional arguments.

ftract allows csv output of different features from ncbi feature tables. The required -feature argument is comma separated feature_key:qualifier_key:qualifier_value

% mefetch -id KN150849 -db nucleotide -email user@ema.il -format ft | ftract --feature rrna:product:16s
id,seq_start,seq_stop,strand
KN150849.1,594136,595654,2
KN150849.1,807985,809503,2
KN150849.1,2227751,2229271,1

And pipe this back into mefetch to download these three regions in genbank format:

% mefetch -id KN150849 -db nucleotide -email user@ema.il -format ft | ftract --feature rrna:product:16s | mefetch -db nucleotide -email crosenth@uw.edu -csv -format gb

And finally combining all these concepts, return all the Burkholderia gladioli 16s rrna products in fasta format using the default -proc 3 like this:

% esearch -query 'Burkholderia gladioli AND sequence_from_type[Filter]' -db 'nucleotide' | mefetch -email user@ema.il -format ft | ftract --feature rrna:product:16s | mefetch -db nucleotide -email user@ema.il -csv -format fasta
0.24s user 0.05s system 1% cpu 18.596 total

issues

Please use the Issue Tracker(s) available on Github or Bitbucket to report any bugs or feature requests. For all other inquiries email Chris Rosenthal.

license

Released under the GPLv3 License

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

This version

0.34.0

Apr 11, 2025

0.33.0

Apr 11, 2025

0.32.0

Feb 5, 2025

0.31.0

Dec 31, 2024

0.30.0

Aug 22, 2024

0.29.0

Feb 1, 2024

0.28.0

Feb 1, 2024

0.27.0

Jan 9, 2024

0.26.0

Nov 29, 2023

0.25.0

Nov 29, 2023

0.24.0

Nov 27, 2023

0.23.0

Jul 5, 2023

0.22.0

Jan 25, 2023

0.21.0

Oct 15, 2022

0.20.0

Oct 11, 2022

0.19.0

Sep 14, 2021

0.18.0

Aug 10, 2020

0.17.0

Aug 7, 2020

0.16.0

Jul 31, 2020

0.14.0

May 21, 2019

0.13.0

Apr 30, 2019

0.12.0

Mar 20, 2019

0.11.0

Dec 21, 2018

0.10.0

Dec 21, 2018

0.9.0

May 7, 2018

0.8.0

Jan 8, 2018

0.7.0

Jan 4, 2018

0.6.0

Jan 4, 2018

0.5.0

Nov 28, 2017

0.4.0

Jul 12, 2017

0.3.0

Feb 28, 2017

0.2.0

Feb 27, 2017

0.1.0

Jan 10, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medirect-0.34.0.tar.gz (55.0 kB view details)

Uploaded Apr 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

medirect-0.34.0-py3-none-any.whl (37.0 kB view details)

Uploaded Apr 11, 2025 Python 3

File details

Details for the file medirect-0.34.0.tar.gz.

File metadata

Download URL: medirect-0.34.0.tar.gz
Upload date: Apr 11, 2025
Size: 55.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for medirect-0.34.0.tar.gz
Algorithm	Hash digest
SHA256	`59e76bf4e361e11b06c1a78000dda1bafcc2952afd6a162ca5983d8bc5ac0597`
MD5	`76b5e5cca4de56dc79896d5a72c104a9`
BLAKE2b-256	`167cbee3655ec7df5af3a52d8a2082ff8888824541fa24772ccea7f2821a8085`

See more details on using hashes here.

File details

Details for the file medirect-0.34.0-py3-none-any.whl.

File metadata

Download URL: medirect-0.34.0-py3-none-any.whl
Upload date: Apr 11, 2025
Size: 37.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for medirect-0.34.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c91c217b1e696f99556f4d86a8568efb98ca7ce0dd0cfe4f1a595d48499632e`
MD5	`9ec1f82570b9aa3cf2e220d5e57b1b6c`
BLAKE2b-256	`9c95d13051d13323bac38b63d8a1fbdd120d85c03425a28018c467f744993bbf`

See more details on using hashes here.

medirect 0.34.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

authors

utilities

about

dependencies

installation

examples

issues

license

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes