multiprocessed ncbi edirect and ftract
Project description
medirect - a multiprocessed utility for retrieving records and parsing feature tables from ncbi
about
As a bioinformatician I build a lot of nucleic bacteria reference databases. I created this package to help do that quickly. The utilities in this package (mefetch and ffetch) can be slotted in along with other ncbi utilities and follow the same edirect documenation, guidelines and requirements and usage policies.
For large data requests I highlight two points from the usage policy:
Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.
Make no more than 3 requests every 1 second.
Some additional documentation for using ffetch:
edirect ftp downloads
dependencies
installation
medirect can be installed in two ways:
For regular users:
% pip3 install medirect
For developers:
% pip3 install git://github.com/crosenth/medirect.git # or % git clone git://github.com/crosenth/medirect.git % cd medirect % python3 setup.py install
examples
The mefetch executable works exactly like edirect efetch with a an additional multiprocessing argument -proc and a few more features.
By allowing additional processes to download records The -proc argument allows a linear download speed increase downloading large datasets.
For example:
% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -mode text -format acc -proc 1 > accessions.txt 0.53s user 0.11s system 0% cpu 11:43.11 total
Which is equivalent to ncbi efetch:
% esearch -db nucleotide -query 'Rhizobium' | time efetch -mode text -format acc > accessions.txt 0.53s user 0.11s system 0% cpu 12:47.54 total
Adding another processor -proc 2:
% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 2 -mode text -format acc > accessions.txt 0.46s user 0.08s system 0% cpu 5:17.51 total
And another -proc 3 (default):
% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 3 -mode text -format acc > accessions.txt 0.35s user 0.10s system 0% cpu 2:57.01 total
And -proc 4 (see usage policy):
% esearch -db nucleotide -query 'Rhizobium' | time mefetch -email user@ema.il -proc 4 -mode text -format acc > accessions.txt 0.35s user 0.08s system 0% cpu 1:40.54 total
Results can be returned in the same order as efetch using the -in-order argument. Otherwise, the order will be determined by how fast ncbi returns results per process.
The -retmax argument (or chunksize) determines the number of results returned per -proc. By default, it is set to the 10,000 max records per documentation. Setting the -retmax to higher than 10,000 will automatically be set back down to 10,000.
By default the -id reads stdin xml output from esearch. The -id argument can also take input as a comma delimited list of ids or text file of ids. When coupled with the -csv argument the input can be a csv file with additional argument columns. This is useful for bulk downloads with different positional arguments.
ftract allows csv output of different features from ncbi feature tables. The required -feature argument is comma separated feature_key:qualifier_key:qualifier_value
% mefetch -id KN150849 -db nucleotide -email user@ema.il -format ft | ftract --feature rrna::16s id,seq_start,seq_stop,strand KN150849.1,594136,595654,2 KN150849.1,807985,809503,2 KN150849.1,2227751,2229271,1
And pipe this back into mefetch to download these three regions in genbank format:
% mefetch -id KN150849 -db nucleotide -email user@ema.il -format ft | ftract --feature rrna:product:16s | mefetch -db nucleotide -email crosenth@uw.edu -csv -format gb
And finally, return all the Burkholderia gladioli 16s rrna products in fasta format like this:
% esearch -query 'Burkholderia gladioli' -db 'nucleotide' | mefetch -email user@ema.il -format ft | ftract --feature rrna:product:16s | mefetch -db nucleotide -email user@ema.il -csv -format fasta
issues
Please use the Issue Tracker(s) available on Github or Bitbucket to report any bugs or feature requests. For all other inquiries email Chris Rosenthal.
license
Copyright (c) 2016 Chris Rosenthal
Released under the GPLv3 License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file medirect-0.1.0.tar.gz
.
File metadata
- Download URL: medirect-0.1.0.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2007a1c6783e6fce91112e6015b3761755ea433e6207b6e11a977939191cce28 |
|
MD5 | 4ff945a1ec3a7a7af49af5bcb0685332 |
|
BLAKE2b-256 | 60237f6d192e398c2c4878296d99ead6c7eea632ea86278569f7928b8b400054 |