A Python package for interacting with SRAdb and downloading datasets from SRA/ENA/GEO
Project description
A Python package for retrieving metadata and downloading datasets from SRA/ENA/GEO
Documentation
CLI Usage
pysradb supports command line ussage. See CLI instructions or quickstart guide.
$ pysradb
usage: pysradb [-h] [--version] [--citation]
{metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
...
pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
version: 1.0.1
Citation: 10.12688/f1000research.18676.1
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--citation how to cite
subcommands:
{metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
metadata Fetch metadata for SRA project (SRPnnnn)
download Download SRA project (SRPnnnn)
search Search SRA for matching text
gse-to-gsm Get GSM for a GSE
gse-to-srp Get SRP for a GSE
gsm-to-gse Get GSE for a GSM
gsm-to-srp Get SRP for a GSM
gsm-to-srr Get SRR for a GSM
gsm-to-srs Get SRS for a GSM
gsm-to-srx Get SRX for a GSM
srp-to-gse Get GSE for a SRP
srp-to-srr Get SRR for a SRP
srp-to-srs Get SRS for a SRP
srp-to-srx Get SRX for a SRP
srr-to-gsm Get GSM for a SRR
srr-to-srp Get SRP for a SRR
srr-to-srs Get SRS for a SRR
srr-to-srx Get SRX for a SRR
srs-to-gsm Get GSM for a SRS
srs-to-srx Get SRX for a SRS
srx-to-srp Get SRP for a SRX
srx-to-srr Get SRR for a SRX
srx-to-srs Get SRS for a SRX
Quickstart
A Google Colaboratory version of most used commands are available in this Colab Notebook . Note that this requires only an active internet connection (no additional downloads are made).
The following notebooks document all the possible features of pysradb:
Installation
To install stable version using pip:
pip install pysradb
Alternatively, if you use conda:
conda install -c bioconda pysradb
This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:
conda create -c bioconda -n pysradb PYTHON=3.7 pysradb
Dependecies
pandas
requests
tqdm
xmltodict
Installing pysradb in development mode
git clone https://github.com/saketkc/pysradb.git
cd pysradb && pip install -r requirements.txt
pip install -e .
Using pysradb
Obtaining SRA metadata
$ pysradb metadata SRP000941 | head study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases SRP000941 SRX056722 Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS184466 Illumina HiSeq 2000 26900401 531654480 SRR179707 26900401 807012030 SRP000941 SRX027889 Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS116481 Illumina Genome Analyzer II 37528590 779578968 SRR067978 37528590 1351029240 SRP000941 SRX027888 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116483 Illumina Genome Analyzer II 13603127 3232309537 SRR067977 13603127 489712572 SRP000941 SRX027887 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116562 Illumina Genome Analyzer II 22430523 506327844 SRR067976 22430523 807498828 SRP000941 SRX027886 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116560 Illumina Genome Analyzer II 15342951 301720436 SRR067975 15342951 552346236 SRP000941 SRX027885 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116482 Illumina Genome Analyzer II 39725232 851429082 SRR067974 39725232 1430108352 SRP000941 SRX027884 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116481 Illumina Genome Analyzer II 32633277 544478483 SRR067973 32633277 1174797972 SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067972 9357767 336879612 SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067971 12793198 460555128
Obtaining detailed SRA metadata
$ pysradb metadata SRP075720 --detailed | head study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases SRP075720 SRX1800476 GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467643 Illumina HiSeq 2500 2547148 97658407 SRR3587912 2547148 127357400 SRP075720 SRX1800475 GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467642 Illumina HiSeq 2500 2676053 101904264 SRR3587911 2676053 133802650 SRP075720 SRX1800474 GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467641 Illumina HiSeq 2500 1603567 61729014 SRR3587910 1603567 80178350 SRP075720 SRX1800473 GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467640 Illumina HiSeq 2500 2498920 94977329 SRR3587909 2498920 124946000 SRP075720 SRX1800472 GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467639 Illumina HiSeq 2500 2226670 83473957 SRR3587908 2226670 111333500 SRP075720 SRX1800471 GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467638 Illumina HiSeq 2500 2269546 87486278 SRR3587907 2269546 113477300 SRP075720 SRX1800470 GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467636 Illumina HiSeq 2500 2333284 88669838 SRR3587906 2333284 116664200 SRP075720 SRX1800469 GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467637 Illumina HiSeq 2500 2071159 79689296 SRR3587905 2071159 103557950 SRP075720 SRX1800468 GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467635 Illumina HiSeq 2500 2321657 89307894 SRR3587904 2321657 116082850
Converting SRP to GSE
$ pysradb srp-to-gse SRP075720 study_accession study_alias SRP075720 GSE81903
Converting GSM to SRP
$ pysradb gsm-to-srp GSM2177186 experiment_alias study_accession GSM2177186 SRP075720
Converting GSM to GSE
$ pysradb gsm-to-gse GSM2177186 experiment_alias study_alias GSM2177186 GSE81903
Converting GSM to SRX
$ pysradb gsm-to-srx GSM2177186 experiment_alias experiment_accession GSM2177186 SRX1800089
Converting GSM to SRR
$ pysradb gsm-to-srr GSM2177186 experiment_alias run_accession GSM2177186 SRR3587529
Downloading supplementary files from GEO
$ pysradb download -g GSE161707
Downloading an entire SRA/ENA project (multithreaded)
pysradb makes it super easy to download datasets from SRA parallely: Using 8 threads to download:
$ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.
Downloading only certain samples of interest
$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download
This will download all RNA-seq samples coming from this project.
Ultrafast fastq downloads
With aspera-client installed, pysradb can perform ultra fast downloads:
To download all original fastqs with aspera-client installed utilizing 8 threads:
$ pysradb download -t 8 --use_ascp -p SRP002605
Refer to the notebook for (shallow) time benchmarks.
Publication
Presentation slides from BOSC (ISMB-ECCB) 2019: https://f1000research.com/slides/8-1183
Citation
Choudhary, Saket. “pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive.” F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)
@article{Choudhary2019,
doi = {10.12688/f1000research.18676.1},
url = {https://doi.org/10.12688/f1000research.18676.1},
year = {2019},
month = apr,
publisher = {F1000 (Faculty of 1000 Ltd)},
volume = {8},
pages = {532},
author = {Saket Choudhary},
title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},
journal = {F1000Research}
}
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
Zenodo DOI: 10.5281/zenodo.2306881
Questions?
Open an issue or join our Slack Channel.
History
1.1.1 (01-10-2022)
Do not exit if a qeury returns no hits (#149 <https://github.com/saketkc/pysradb/pull/149>)
1.1.0 (12-12-2021)
1.0.1 (01-10-2021)
Dropped Python 3.6 since pandas 1.2 is not supported
1.0.0 (01-09-2021)
0.11.1 (09-18-2020)
library_layout is now outputted in metadata #56
-detailed unifies columns for ENA fastq links instead of appending _x/_y #59
bugfix for parsing namespace in xml outputs #65
XML errors from NCBI are now handled more gracefully #69
Documentation and dependency updates
0.11.0 (09-04-2020)
pysradb download now supports multiple threads for paralle downloads
pysradb download also supports ultra fast downloads of FASTQs from ENA using aspera-client
0.10.3 (03-26-2020)
Added test cases for SRAweb
API limit exceeding errors are automagically handled
Bug fixes for GSE <=> SRR
Bug fix for metadata - supports multiple SRPs
Contributors
Dibya Gautam
Marius van den Beek
0.10.2 (02-05-2020)
Bug fix: Handle API-rate limit exceeding => Retries
Enhancement: ‘Alternatives’ URLs are now part of –detailed
0.10.1 (02-04-2020)
Bug fix: Handle Python3.6 for capture_output in subprocess.run
0.10.0 (01-31-2020)
All the subcommands (srx-to-srr, srx-to-srs) will now print additional columns where the first two columns represent the relevant conversion
Fixed a bug where for fetching entries with single efetch record
0.9.9 (01-15-2020)
Major fix: some SRRs would go missing as the experiment dict was being created only once per SRR (See #15)
Features: More detailed metadata by default in the SRAweb mode
See notebook: https://colab.research.google.com/drive/1C60V-
0.9.7 (01-20-2020)
Feature: instrument, run size and total spots are now printed in the metadata by default (SRAweb mode only)
Issue: Fixed an issue with srapath failing on SRP. srapath is now run on individual SRRs.
0.9.6 (07-20-2019)
Introduced SRAweb to perform queries over the web if the SQLite is missing or does not contain the relevant record.
0.9.0 (02-27-2019)
Others
This release completely changes the command line interface replacing click with argparse (https://github.com/saketkc/pysradb/pull/3)
Removed Python 2 comptaible stale code
0.8.0 (02-26-2019)
New methods/functionality
srr-to-gsm: convert SRR to GSM
SRAmetadb.sqlite.gz file is deleted by default after extraction
When SRAmetadb is not found a confirmation is seeked before downloading
Confirmation option before SRA downloads
Bugfix
download() works with wget
Others
–out_dir is now out-dir
0.7.1 (02-18-2019)
Important: Python2 is no longer supported. Please consider moving to Python3.
Bugfix
Included docs in the index whihch were missed out in the previous release
0.7.0 (02-08-2019)
New methods/functionality
gsm-to-srr: convert GSM to SRR
gsm-to-srx: convert GSM to SRX
gsm-to-gse: convert GSM to GSE
Renamed methods
The following commad line options have been renamed and the changes are not compatible with 0.6.0 release:
sra-metadata -> metadata.
sra-search -> search.
srametadb -> metadb.
0.6.0 (12-25-2018)
Bugfix
Fixed bugs introduced in 0.5.0 with API changes where multiple redundant columns were output in sra-metadata
New methods/functionality
download now allows piped inputs
0.5.0 (12-24-2018)
New methods/functionality
Support for filtering by SRX Id for SRA downloads.
srr_to_srx: Convert SRR to SRX/SRP
srp_to_srx: Convert SRP to SRX
Stripped down sra-metadata to give minimal information
Added –assay, –desc, –detailed flag for sra-metadata
Improved table printing on terminal
0.4.2 (12-16-2018)
Bugfix
Fixed unicode error in tests for Python2
0.4.0 (12-12-2018)
New methods/functionality
Added a new BASEdb class to handle common database connections
Initial support for GEOmetadb through GEOdb class
Initial support or a command line interface: - download Download SRA project (SRPnnnn) - gse-metadata Fetch metadata for GEO ID (GSEnnnn) - gse-to-gsm Get GSM(s) for GSE - gsm-metadata Fetch metadata for GSM ID (GSMnnnn) - sra-metadata Fetch metadata for SRA project (SRPnnnn)
Added three separate notebooks for SRAdb, GEOdb, CLI usage
0.3.0 (12-05-2018)
New methods/functionality
sample_attribute and experiment_attribute are now included by default in the df returned by sra_metadata()
expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute column
New methods to guess cell/tissue/strain: guess_cell_type()/guess_tissue_type()/guess_strain_type()
Improved README and usage instructions
0.2.2 (12-03-2018)
New methods/functionality
search_sra() allows full text search on SRA metadata.
0.2.0 (12-03-2018)
Renamed methods
The following methods have been renamed and the changes are not compatible with 0.1.0 release:
get_query() -> query().
sra_convert() -> sra_metadata().
get_table_counts() -> all_row_counts().
New methods/functionality
download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.
ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.
Bug fixes
Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.
Besides these, all methods now follow a numpydoc compatible documentation.
0.1.0 (12-01-2018)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysradb-1.2.0.tar.gz.
File metadata
- Download URL: pysradb-1.2.0.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85eabf0d094dffc5c1afe1e0aff18c2ae04fa85856520418429e741784efc299
|
|
| MD5 |
4bf8e31f7675f535338d98e95613c050
|
|
| BLAKE2b-256 |
138eb51e05a74433f0f8c0d573ba724fdb93305b29fda64dad5ccded73e3ff22
|
File details
Details for the file pysradb-1.2.0-py3-none-any.whl.
File metadata
- Download URL: pysradb-1.2.0-py3-none-any.whl
- Upload date:
- Size: 171.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc2ff13bdfd1aaefff342935d3b2839db1a4bb02542bcd30ba3caa054a944835
|
|
| MD5 |
e0c346177a328392ca92f46d142f26b4
|
|
| BLAKE2b-256 |
b1ccf0cff9c794a0cf321f4bd1a16d9a964d80348aee6ee8f4d1eb553c610f69
|