A Python package for interacting with SRAdb and downloading datasets from SRA/ENA/GEO
Project description
A Python package for retrieving metadata and downloading datasets from SRA/ENA/GEO
Documentation
CLI Usage
pysradb supports command line usage. See CLI instructions or quickstart guide.
$ pysradb usage: pysradb [-h] [--version] [--citation] {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs} ... pysradb: Query NGS metadata and data from NCBI Sequence Read Archive. version: 2.0 Citation: 10.12688/f1000research.18676.1 optional arguments: -h, --help show this help message and exit --version show program's version number and exit --citation how to cite subcommands: {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs} metadata Fetch metadata for SRA project (SRPnnnn) download Download SRA project (SRPnnnn) search Search SRA for matching text gse-to-gsm Get GSM for a GSE gse-to-srp Get SRP for a GSE gsm-to-gse Get GSE for a GSM gsm-to-srp Get SRP for a GSM gsm-to-srr Get SRR for a GSM gsm-to-srs Get SRS for a GSM gsm-to-srx Get SRX for a GSM srp-to-gse Get GSE for a SRP srp-to-srr Get SRR for a SRP srp-to-srs Get SRS for a SRP srp-to-srx Get SRX for a SRP srr-to-gsm Get GSM for a SRR srr-to-srp Get SRP for a SRR srr-to-srs Get SRS for a SRR srr-to-srx Get SRX for a SRR srs-to-gsm Get GSM for a SRS srs-to-srx Get SRX for a SRS srx-to-srp Get SRP for a SRX srx-to-srr Get SRR for a SRX srx-to-srs Get SRS for a SRX
Quickstart
A Google Colaboratory version of most used commands are available in this Colab Notebook . Note that this requires only an active internet connection (no additional downloads are made).
The following notebooks document all the possible features of pysradb:
Installation
To install stable version using pip:
pip install pysradb
Alternatively, if you use conda:
conda install -c bioconda pysradb
This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:
conda create -c bioconda -n pysradb PYTHON=3.7 pysradb
Dependencies
pandas
requests
tqdm
xmltodict
Installing pysradb in development mode
git clone https://github.com/saketkc/pysradb.git
cd pysradb && pip install -r requirements.txt
pip install -e .
Using pysradb
Obtaining SRA metadata
$ pysradb metadata SRP000941 | head study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases SRP000941 SRX056722 Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS184466 Illumina HiSeq 2000 26900401 531654480 SRR179707 26900401 807012030 SRP000941 SRX027889 Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS116481 Illumina Genome Analyzer II 37528590 779578968 SRR067978 37528590 1351029240 SRP000941 SRX027888 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116483 Illumina Genome Analyzer II 13603127 3232309537 SRR067977 13603127 489712572 SRP000941 SRX027887 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116562 Illumina Genome Analyzer II 22430523 506327844 SRR067976 22430523 807498828 SRP000941 SRX027886 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116560 Illumina Genome Analyzer II 15342951 301720436 SRR067975 15342951 552346236 SRP000941 SRX027885 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116482 Illumina Genome Analyzer II 39725232 851429082 SRR067974 39725232 1430108352 SRP000941 SRX027884 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116481 Illumina Genome Analyzer II 32633277 544478483 SRR067973 32633277 1174797972 SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067972 9357767 336879612 SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067971 12793198 460555128
Obtaining detailed SRA metadata
$ pysradb metadata SRP075720 --detailed | head study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases SRP075720 SRX1800476 GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467643 Illumina HiSeq 2500 2547148 97658407 SRR3587912 2547148 127357400 SRP075720 SRX1800475 GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467642 Illumina HiSeq 2500 2676053 101904264 SRR3587911 2676053 133802650 SRP075720 SRX1800474 GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467641 Illumina HiSeq 2500 1603567 61729014 SRR3587910 1603567 80178350 SRP075720 SRX1800473 GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467640 Illumina HiSeq 2500 2498920 94977329 SRR3587909 2498920 124946000 SRP075720 SRX1800472 GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467639 Illumina HiSeq 2500 2226670 83473957 SRR3587908 2226670 111333500 SRP075720 SRX1800471 GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467638 Illumina HiSeq 2500 2269546 87486278 SRR3587907 2269546 113477300 SRP075720 SRX1800470 GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467636 Illumina HiSeq 2500 2333284 88669838 SRR3587906 2333284 116664200 SRP075720 SRX1800469 GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467637 Illumina HiSeq 2500 2071159 79689296 SRR3587905 2071159 103557950 SRP075720 SRX1800468 GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467635 Illumina HiSeq 2500 2321657 89307894 SRR3587904 2321657 116082850
Converting SRP to GSE
$ pysradb srp-to-gse SRP075720 study_accession study_alias SRP075720 GSE81903
Converting GSM to SRP
$ pysradb gsm-to-srp GSM2177186 experiment_alias study_accession GSM2177186 SRP075720
Converting GSM to GSE
$ pysradb gsm-to-gse GSM2177186 experiment_alias study_alias GSM2177186 GSE81903
Converting GSM to SRX
$ pysradb gsm-to-srx GSM2177186 experiment_alias experiment_accession GSM2177186 SRX1800089
Converting GSM to SRR
$ pysradb gsm-to-srr GSM2177186 experiment_alias run_accession GSM2177186 SRR3587529
Downloading supplementary files from GEO
$ pysradb download -g GSE161707
Downloading an entire SRA/ENA project (multithreaded)
pysradb makes it super easy to download datasets from SRA parallely: Using 8 threads to download:
$ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by SRP/SRX/SRR mimicking the hierarchy of SRA projects.
Downloading only certain samples of interest
$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download
This will download all RNA-seq samples coming from this project.
Ultrafast fastq downloads
With aspera-client installed, pysradb can perform ultra fast downloads:
To download all original fastqs with aspera-client installed utilizing 8 threads:
$ pysradb download -t 8 --use_ascp -p SRP002605
Refer to the notebook for (shallow) time benchmarks.
Publication
Presentation slides from BOSC (ISMB-ECCB) 2019: https://f1000research.com/slides/8-1183
Citation
Choudhary, Saket. “pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive.” F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)
@article{Choudhary2019, doi = {10.12688/f1000research.18676.1}, url = {https://doi.org/10.12688/f1000research.18676.1}, year = {2019}, month = apr, publisher = {F1000 (Faculty of 1000 Ltd)}, volume = {8}, pages = {532}, author = {Saket Choudhary}, title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive}, journal = {F1000Research} }
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
Zenodo DOI: 10.5281/zenodo.2306881
Questions?
Open an issue or join our Slack Channel.
History
2.0.1 (2023-03-18)
Fix for pysradb download - using public_url
Fix for SRX -> SRR and related conversions (#183 <https://github.com/saketkc/pysradb/pull/183>)
2.0.0 (2023-02-23)
BREAKING change: Overhaul of how urls and associated metadata are returned (not backward compatible); all column names are lower cased by default
Fix extra space in “organism_taxid” column
Added support for Experiment attributes (#89 <https://github.com/saketkc/pysradb/issues/89#issuecomment-1439319532>)
1.4.2 (06-17-2022)
Fix ENA fastq fetching (#163 <https://github.com/saketkc/pysradb/issues/163>)
1.4.1 (06-04-2022)
Fix for fetchin alternative URLs
1.4.0 (06-04-2022)
Added ability to fetch alternative URLs (GCP/AWS) for metadata (#161 <https://github.com/saketkc/pysradb/issues/161>)
Fix for xmldict 0.13.0 no longer defaulting to OrderedDict (#159 <https://github.com/saketkc/pysradb/pull/159>)
Fix for missing experiment model and description in metadata (#160 <https://github.com/saketkc/pysradb/issues/160>)
1.3.0 (02-18-2022)
1.2.0 (01-10-2022)
Do not exit if a qeury returns no hits (#149 <https://github.com/saketkc/pysradb/pull/149>)
1.1.0 (12-12-2021)
1.0.1 (01-10-2021)
Dropped Python 3.6 since pandas 1.2 is not supported
1.0.0 (01-09-2021)
0.11.1 (09-18-2020)
library_layout is now outputted in metadata #56
-detailed unifies columns for ENA fastq links instead of appending _x/_y #59
bugfix for parsing namespace in xml outputs #65
XML errors from NCBI are now handled more gracefully #69
Documentation and dependency updates
0.11.0 (09-04-2020)
pysradb download now supports multiple threads for paralle downloads
pysradb download also supports ultra fast downloads of FASTQs from ENA using aspera-client
0.10.3 (03-26-2020)
Added test cases for SRAweb
API limit exceeding errors are automagically handled
Bug fixes for GSE <=> SRR
Bug fix for metadata - supports multiple SRPs
Contributors
Dibya Gautam
Marius van den Beek
0.10.2 (02-05-2020)
Bug fix: Handle API-rate limit exceeding => Retries
Enhancement: ‘Alternatives’ URLs are now part of –detailed
0.10.1 (02-04-2020)
Bug fix: Handle Python3.6 for capture_output in subprocess.run
0.10.0 (01-31-2020)
All the subcommands (srx-to-srr, srx-to-srs) will now print additional columns where the first two columns represent the relevant conversion
Fixed a bug where for fetching entries with single efetch record
0.9.9 (01-15-2020)
Major fix: some SRRs would go missing as the experiment dict was being created only once per SRR (See #15)
Features: More detailed metadata by default in the SRAweb mode
See notebook: https://colab.research.google.com/drive/1C60V-
0.9.7 (01-20-2020)
Feature: instrument, run size and total spots are now printed in the metadata by default (SRAweb mode only)
Issue: Fixed an issue with srapath failing on SRP. srapath is now run on individual SRRs.
0.9.6 (07-20-2019)
Introduced SRAweb to perform queries over the web if the SQLite is missing or does not contain the relevant record.
0.9.0 (02-27-2019)
Others
This release completely changes the command line interface replacing click with argparse (https://github.com/saketkc/pysradb/pull/3)
Removed Python 2 comptaible stale code
0.8.0 (02-26-2019)
New methods/functionality
srr-to-gsm: convert SRR to GSM
SRAmetadb.sqlite.gz file is deleted by default after extraction
When SRAmetadb is not found a confirmation is seeked before downloading
Confirmation option before SRA downloads
Bugfix
download() works with wget
Others
–out_dir is now out-dir
0.7.1 (02-18-2019)
Important: Python2 is no longer supported. Please consider moving to Python3.
Bugfix
Included docs in the index whihch were missed out in the previous release
0.7.0 (02-08-2019)
New methods/functionality
gsm-to-srr: convert GSM to SRR
gsm-to-srx: convert GSM to SRX
gsm-to-gse: convert GSM to GSE
Renamed methods
The following commad line options have been renamed and the changes are not compatible with 0.6.0 release:
sra-metadata -> metadata.
sra-search -> search.
srametadb -> metadb.
0.6.0 (12-25-2018)
Bugfix
Fixed bugs introduced in 0.5.0 with API changes where multiple redundant columns were output in sra-metadata
New methods/functionality
download now allows piped inputs
0.5.0 (12-24-2018)
New methods/functionality
Support for filtering by SRX Id for SRA downloads.
srr_to_srx: Convert SRR to SRX/SRP
srp_to_srx: Convert SRP to SRX
Stripped down sra-metadata to give minimal information
Added –assay, –desc, –detailed flag for sra-metadata
Improved table printing on terminal
0.4.2 (12-16-2018)
Bugfix
Fixed unicode error in tests for Python2
0.4.0 (12-12-2018)
New methods/functionality
Added a new BASEdb class to handle common database connections
Initial support for GEOmetadb through GEOdb class
Initial support or a command line interface: - download Download SRA project (SRPnnnn) - gse-metadata Fetch metadata for GEO ID (GSEnnnn) - gse-to-gsm Get GSM(s) for GSE - gsm-metadata Fetch metadata for GSM ID (GSMnnnn) - sra-metadata Fetch metadata for SRA project (SRPnnnn)
Added three separate notebooks for SRAdb, GEOdb, CLI usage
0.3.0 (12-05-2018)
New methods/functionality
sample_attribute and experiment_attribute are now included by default in the df returned by sra_metadata()
expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute column
New methods to guess cell/tissue/strain: guess_cell_type()/guess_tissue_type()/guess_strain_type()
Improved README and usage instructions
0.2.2 (12-03-2018)
New methods/functionality
search_sra() allows full text search on SRA metadata.
0.2.0 (12-03-2018)
Renamed methods
The following methods have been renamed and the changes are not compatible with 0.1.0 release:
get_query() -> query().
sra_convert() -> sra_metadata().
get_table_counts() -> all_row_counts().
New methods/functionality
download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.
ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.
Bug fixes
Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.
Besides these, all methods now follow a numpydoc compatible documentation.
0.1.0 (12-01-2018)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.