Skip to main content

A Python package for interacting with SRAdb and downloading datasets from SRA/ENA/GEO

Project description

https://raw.githubusercontent.com/saketkc/pysradb/master/docs/_static/pysradb_v3.png

A Python package for retrieving metadata and downloading datasets from SRA/ENA

https://img.shields.io/pypi/v/pysradb.svg?style=flat-square https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square https://zenodo.org/badge/159590788.svg https://img.shields.io/travis/saketkc/pysradb.svg?style=flat-square

CLI Usage

pysradb supports command line ussage. The documentation is in progress. See cmdline for some quick usage instructions. See quickstart for a list of instructions for each sub-command.

$ pysradb
 usage: pysradb [-h] [--version] [--citation]
                {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
                ...

 pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
 version: 1.0
 Citation: 10.12688/f1000research.18676.1

 optional arguments:
   -h, --help            show this help message and exit
   --version             show program's version number and exit
   --citation            how to cite

 subcommands:
   {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
     metadata            Fetch metadata for SRA project (SRPnnnn)
     download            Download SRA project (SRPnnnn)
     search              Search SRA for matching text
     gse-to-gsm          Get GSM for a GSE
     gse-to-srp          Get SRP for a GSE
     gsm-to-gse          Get GSE for a GSM
     gsm-to-srp          Get SRP for a GSM
     gsm-to-srr          Get SRR for a GSM
     gsm-to-srs          Get SRS for a GSM
     gsm-to-srx          Get SRX for a GSM
     srp-to-gse          Get GSE for a SRP
     srp-to-srr          Get SRR for a SRP
     srp-to-srs          Get SRS for a SRP
     srp-to-srx          Get SRX for a SRP
     srr-to-gsm          Get GSM for a SRR
     srr-to-srp          Get SRP for a SRR
     srr-to-srs          Get SRS for a SRR
     srr-to-srx          Get SRX for a SRR
     srs-to-gsm          Get GSM for a SRS
     srs-to-srx          Get SRX for a SRS
     srx-to-srp          Get SRP for a SRX
     srx-to-srr          Get SRR for a SRX
     srx-to-srs          Get SRS for a SRX

Quickstart

A Google Colaboratory version of most used commands are available in this Colab Notebook . Note that this requires only an active internet connection (no additional downloads are made).

The following notebooks document all the possible features of pysradb:

  1. Python API
  2. Downloading datasets from SRA - command line
  3. Parallely download multiple datasets - Python API
  4. Converting SRA-to-fastq - command line (requires conda)
  5. Downloading subsets of a project - Python API
  6. Download BAMs
  7. Metadata for multiple SRPs
  8. Multithreaded fastq downloads using Aspera Client
  9. Searching SRA/GEO/ENA

Installation

To install stable version using pip:

pip install pysradb

Alternatively, if you use conda:

conda install -c bioconda pysradb

This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:

conda create -c bioconda -n pysradb PYTHON=3.7 pysradb

Dependecies

pandas
requests
tqdm
xmltodict

Installing pysradb in development mode

git clone https://github.com/saketkc/pysradb.git
cd pysradb && pip install -r requirements.txt
pip install -e .

Using pysradb

Obtaining SRA metadata

$ pysradb metadata SRP000941 | head

study_accession experiment_accession experiment_title                                                                                                                 experiment_desc                                                                                                                  organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument                    total_spots total_size    run_accession run_total_spots run_total_bases
SRP000941       SRX056722                                                                         Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells                                                               Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS184466                              Illumina HiSeq 2000    26900401     531654480   SRR179707     26900401         807012030
SRP000941       SRX027889                                                                            Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells                                                                  Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS116481                      Illumina Genome Analyzer II    37528590     779578968   SRR067978     37528590        1351029240
SRP000941       SRX027888                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116483                      Illumina Genome Analyzer II    13603127    3232309537   SRR067977     13603127         489712572
SRP000941       SRX027887                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116562                      Illumina Genome Analyzer II    22430523     506327844   SRR067976     22430523         807498828
SRP000941       SRX027886                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116560                      Illumina Genome Analyzer II    15342951     301720436   SRR067975     15342951         552346236
SRP000941       SRX027885                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116482                      Illumina Genome Analyzer II    39725232     851429082   SRR067974     39725232        1430108352
SRP000941       SRX027884                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116481                      Illumina Genome Analyzer II    32633277     544478483   SRR067973     32633277        1174797972
SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067972      9357767         336879612
SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067971     12793198         460555128

Obtaining detailed SRA metadata

$ pysradb metadata SRP075720 --detailed | head

study_accession experiment_accession experiment_title                                  experiment_desc                                   organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument           total_spots total_size run_accession run_total_spots run_total_bases
SRP075720       SRX1800476            GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq   GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467643                    Illumina HiSeq 2500  2547148      97658407  SRR3587912    2547148         127357400
SRP075720       SRX1800475            GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq   GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467642                    Illumina HiSeq 2500  2676053     101904264  SRR3587911    2676053         133802650
SRP075720       SRX1800474            GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq   GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467641                    Illumina HiSeq 2500  1603567      61729014  SRR3587910    1603567          80178350
SRP075720       SRX1800473            GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq   GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467640                    Illumina HiSeq 2500  2498920      94977329  SRR3587909    2498920         124946000
SRP075720       SRX1800472            GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq   GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467639                    Illumina HiSeq 2500  2226670      83473957  SRR3587908    2226670         111333500
SRP075720       SRX1800471            GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq   GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467638                    Illumina HiSeq 2500  2269546      87486278  SRR3587907    2269546         113477300
SRP075720       SRX1800470            GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq   GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467636                    Illumina HiSeq 2500  2333284      88669838  SRR3587906    2333284         116664200
SRP075720       SRX1800469            GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq   GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467637                    Illumina HiSeq 2500  2071159      79689296  SRR3587905    2071159         103557950
SRP075720       SRX1800468            GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq   GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467635                    Illumina HiSeq 2500  2321657      89307894  SRR3587904    2321657         116082850

Converting SRP to GSE

$ pysradb srp-to-gse SRP075720

study_accession study_alias
SRP075720       GSE81903

Converting GSM to SRP

$ pysradb gsm-to-srp GSM2177186

experiment_alias study_accession
GSM2177186       SRP075720

Converting GSM to GSE

$ pysradb gsm-to-gse GSM2177186

experiment_alias study_alias
GSM2177186       GSE81903

Converting GSM to SRX

$ pysradb gsm-to-srx GSM2177186

experiment_alias experiment_accession
GSM2177186       SRX1800089

Converting GSM to SRR

$ pysradb gsm-to-srr GSM2177186

experiment_alias run_accession
GSM2177186       SRR3587529

Downloading an entire SRA/ENA project (multithreaded)

pysradb makes it super easy to download datasets from SRA parallely: Using 8 threads to download:

$ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852

Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.

Downloading only certain samples of interest

$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download

This will download all RNA-seq samples coming from this project.

Ultrafast fastq downloads

With aspera-client installed, pysradb can perform ultra fast downloads:

To download all original fastqs with aspera-client installed utilizing 8 threads:

$ pysradb download -t 8 --use_ascp -p SRP002605

Refer to the notebook for (shallow) time benchmarks.

Citation

Choudhary, Saket. “pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive.” F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)

@article{Choudhary2019,
doi = {10.12688/f1000research.18676.1},
url = {https://doi.org/10.12688/f1000research.18676.1},
year = {2019},
month = apr,
publisher = {F1000 (Faculty of 1000 Ltd)},
volume = {8},
pages = {532},
author = {Saket Choudhary},
title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},
journal = {F1000Research}
}

Zenodo archive: https://zenodo.org/badge/latestdoi/159590788

Zenodo DOI: 10.5281/zenodo.2306881

Questions?

Join our Slack Channel or open an issue.

History

1.0.1 (01-10-2021)

  • Dropped Python 3.6 since pandas 1.2 is not supported

1.0.0 (01-09-2021)

  • Retired metadb and SRAdb based search through CLI - everything defaults to SRAweb
  • SRAweb now supports search
  • N/A is now replaced with pd.NA
  • Two new fields in –detailed: instrument_model and instrument_model_desc #75
  • Updated documentation

0.11.1 (09-18-2020)

  • library_layout is now outputted in metadata #56
  • -detailed unifies columns for ENA fastq links instead of appending _x/_y #59
  • bugfix for parsing namespace in xml outputs #65
  • XML errors from NCBI are now handled more gracefully #69
  • Documentation and dependency updates

0.11.0 (09-04-2020)

  • pysradb download now supports multiple threads for paralle downloads
  • pysradb download also supports ultra fast downloads of FASTQs from ENA using aspera-client

0.10.3 (03-26-2020)

  • Added test cases for SRAweb
  • API limit exceeding errors are automagically handled
  • Bug fixes for GSE <=> SRR
  • Bug fix for metadata - supports multiple SRPs

Contributors

  • Dibya Gautam
  • Marius van den Beek

0.10.2 (02-05-2020)

  • Bug fix: Handle API-rate limit exceeding => Retries
  • Enhancement: ‘Alternatives’ URLs are now part of –detailed

0.10.1 (02-04-2020)

  • Bug fix: Handle Python3.6 for capture_output in subprocess.run

0.10.0 (01-31-2020)

  • All the subcommands (srx-to-srr, srx-to-srs) will now print additional columns where the first two columns represent the relevant conversion
  • Fixed a bug where for fetching entries with single efetch record

0.9.9 (01-15-2020)

  • Major fix: some SRRs would go missing as the experiment dict was being created only once per SRR (See #15)
  • Features: More detailed metadata by default in the SRAweb mode
  • See notebook: https://colab.research.google.com/drive/1C60V-

0.9.7 (01-20-2020)

  • Feature: instrument, run size and total spots are now printed in the metadata by default (SRAweb mode only)
  • Issue: Fixed an issue with srapath failing on SRP. srapath is now run on individual SRRs.

0.9.6 (07-20-2019)

  • Introduced SRAweb to perform queries over the web if the SQLite is missing or does not contain the relevant record.

0.9.0 (02-27-2019)

Others

0.8.0 (02-26-2019)

New methods/functionality

  • srr-to-gsm: convert SRR to GSM
  • SRAmetadb.sqlite.gz file is deleted by default after extraction
  • When SRAmetadb is not found a confirmation is seeked before downloading
  • Confirmation option before SRA downloads

Bugfix

  • download() works with wget

Others

  • –out_dir is now out-dir

0.7.1 (02-18-2019)

Important: Python2 is no longer supported. Please consider moving to Python3.

Bugfix

  • Included docs in the index whihch were missed out in the previous release

0.7.0 (02-08-2019)

New methods/functionality

  • gsm-to-srr: convert GSM to SRR
  • gsm-to-srx: convert GSM to SRX
  • gsm-to-gse: convert GSM to GSE

Renamed methods

The following commad line options have been renamed and the changes are not compatible with 0.6.0 release:

  • sra-metadata -> metadata.
  • sra-search -> search.
  • srametadb -> metadb.

0.6.0 (12-25-2018)

Bugfix

  • Fixed bugs introduced in 0.5.0 with API changes where multiple redundant columns were output in sra-metadata

New methods/functionality

  • download now allows piped inputs

0.5.0 (12-24-2018)

New methods/functionality

  • Support for filtering by SRX Id for SRA downloads.
  • srr_to_srx: Convert SRR to SRX/SRP
  • srp_to_srx: Convert SRP to SRX
  • Stripped down sra-metadata to give minimal information
  • Added –assay, –desc, –detailed flag for sra-metadata
  • Improved table printing on terminal

0.4.2 (12-16-2018)

Bugfix

  • Fixed unicode error in tests for Python2

0.4.0 (12-12-2018)

New methods/functionality

  • Added a new BASEdb class to handle common database connections
  • Initial support for GEOmetadb through GEOdb class
  • Initial support or a command line interface: - download Download SRA project (SRPnnnn) - gse-metadata Fetch metadata for GEO ID (GSEnnnn) - gse-to-gsm Get GSM(s) for GSE - gsm-metadata Fetch metadata for GSM ID (GSMnnnn) - sra-metadata Fetch metadata for SRA project (SRPnnnn)
  • Added three separate notebooks for SRAdb, GEOdb, CLI usage

0.3.0 (12-05-2018)

New methods/functionality

  • sample_attribute and experiment_attribute are now included by default in the df returned by sra_metadata()
  • expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute column
  • New methods to guess cell/tissue/strain: guess_cell_type()/guess_tissue_type()/guess_strain_type()
  • Improved README and usage instructions

0.2.2 (12-03-2018)

New methods/functionality

  • search_sra() allows full text search on SRA metadata.

0.2.0 (12-03-2018)

Renamed methods

The following methods have been renamed and the changes are not compatible with 0.1.0 release:

  • get_query() -> query().
  • sra_convert() -> sra_metadata().
  • get_table_counts() -> all_row_counts().

New methods/functionality

  • download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.
  • ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.

Bug fixes

  • Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.

Besides these, all methods now follow a numpydoc compatible documentation.

0.1.0 (12-01-2018)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pysradb, version 1.0.1
Filename, size File type Python version Upload date Hashes
Filename, size pysradb-1.0.1-py3-none-any.whl (169.7 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size pysradb-1.0.1.tar.gz (1.3 MB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page