Skip to main content

Python package for interacting with SRAdb and downloading datasets from SRA

Project description

pysradb

Python package for interacting with SRAdb and downloading datasets from SRA. (python3 only!)

https://raw.githubusercontent.com/saketkc/pysradb/master/docs/_static/pysradb_v3.png https://img.shields.io/pypi/v/pysradb.svg?style=flat-square https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square https://zenodo.org/badge/159590788.svg https://img.shields.io/travis/saketkc/pysradb.svg?style=flat-square https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx.svg

CLI Usage

pysradb supports command line ussage. The documentation is in progress. See cmdline for some quick usage instructions. See quickstart for a list of instructions for each sub-command.

$ pysradb
 usage: pysradb [-h] [--version] [--citation]
                {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
                ...

 pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
 version: 0.9.0.
 Citation: 10.12688/f1000research.18676.1

 optional arguments:
   -h, --help            show this help message and exit
   --version             show program's version number and exit
   --citation            how to cite

 subcommands:
   {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
     metadb              Download SRAmetadb.sqlite
     metadata            Fetch metadata for SRA project (SRPnnnn)
     download            Download SRA project (SRPnnnn)
     search              Search SRA for matching text
     gse-to-gsm          Get GSM for a GSE
     gse-to-srp          Get SRP for a GSE
     gsm-to-gse          Get GSE for a GSM
     gsm-to-srp          Get SRP for a GSM
     gsm-to-srr          Get SRR for a GSM
     gsm-to-srs          Get SRS for a GSM
     gsm-to-srx          Get SRX for a GSM
     srp-to-gse          Get GSE for a SRP
     srp-to-srr          Get SRR for a SRP
     srp-to-srs          Get SRS for a SRP
     srp-to-srx          Get SRX for a SRP
     srr-to-gsm          Get GSM for a SRR
     srr-to-srp          Get SRP for a SRR
     srr-to-srs          Get SRS for a SRR
     srr-to-srx          Get SRX for a SRR
     srs-to-gsm          Get GSM for a SRS
     srs-to-srx          Get SRX for a SRS
     srx-to-srp          Get SRP for a SRX
     srx-to-srr          Get SRR for a SRX
     srx-to-srs          Get SRS for a SRX

Installation

To install stable version using pip:

pip install pysradb

Alternatively, if you use conda:

conda install -c bioconda pysradb

This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:

conda create -c bioconda -n pysradb PYTHON=3 pysradb

Dependecies

pandas>=0.23.4
tqdm>=4.28
requests>=2.22.0
xmltodict>-0.12.0i
sra-tools
SRAmetadb.sqlite (optional)

Installing sratools

NCBI has slowly transitioned towards using Google cloud for storing SRA files. As such the ftp links are slowly getting obsolete. With release 0.9.5, pysradb has moved to utilizing srapath available through NCBI’s sra-tools for getting the SRA location. Thus aspera-client is no longer required. But, sra-tools is now a requirement and can be installed through bioconda.

Downloading SRAmetadb (optional)

pysradb can utilize a SQLite database file that has preprocessed metadata made available by the SRAdb project. Though, with the release 0.9.5, this database file is not a hard requirement for any of the operations.

SRAmetadb can be downloaded using:

wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz

Alternatively, you can also download it using pysradb, which by default downloads it into your current working directory:

$ pysradb metadb

You can also specify an alternate directory for download by supplying the --out-dir <OUT_DIR> argument.

Installing pysradb in development mode

pip install -U pandas tqdm requests xmltodict
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .

Using pysradb

Please see usage_scenarios for a few usage scenarios. Here are few hand-picked examples.

Mode: SRAmetadb or SRAWeb

pysradb’s initial versions were completely dependent on the SRAmnetadb.sqlite file made available by the SRAdb project, we refer to this as the SRAmetadb mode. However, with `pysradb 0.9.5, the depedence on the SQLite file has been made optional. In the abseence of the SQLite file, the operations are performed usiNCBi’s esrarch and esummary interface, a mode which we refer to as the SRAweb mode. All the operations with the exception of search can be performed withoudownloading the SQLite file. NOTE: The additional flags such as --desc, -detailed and -expand are currently not fully supported in the SRAweb mode and will be supported in a future release. However, all the basic funcuionality of interconverting one ID to another is available in both SRAweb and SRAmetadb mode.

Getting SRA metadata

$ pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head

study_accession experiment_accession sample_accession run_accession library_strategy batch         biomaterial_provider             biomaterial_type cell_type    collection_method differentiation_method                                                                                                                     differentiation_stage                                                                disease                                                          donor_age donor_ethnicity                 donor_health_status                                                                                 donor_id donor_sex line          lineage                                                               medium                                                                                                                                                                                                   molecule     passage                             sample_term_id  sex     source_name              tissue                   tissue_depot tissue_type
SRP000941       SRX006235            SRS004118        SRR018454     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006236            SRS004118        SRR018456     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006237            SRS004118        SRR018455     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019072     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019080     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019081     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019082     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019083     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
SRP000941       SRX006239            SRS004213        SRR019084     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN

Getting detailed SRA metadata

$ pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand | head

study_accession experiment_accession sample_accession run_accession experiment_title                                  experiment_attribute        taxon_id library_selection library_layout library_strategy library_source  library_name  bases      spots   adapter_spec  avg_read_length developmental_stage retina_id source_name                tissue
SRP075720       SRX1800089           SRS1467259       SRR3587529    GSM2177186: Kcng4_1Ra_A10; Mus musculus; RNA-Seq  GEO Accession: GSM2177186  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         79101650   1582033  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800090           SRS1467260       SRR3587530    GSM2177187: Kcng4_1Ra_A11; Mus musculus; RNA-Seq  GEO Accession: GSM2177187  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         84573650   1691473  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800091           SRS1467261       SRR3587531    GSM2177188: Kcng4_1Ra_A12; Mus musculus; RNA-Seq  GEO Accession: GSM2177188  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77835550   1556711  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800092           SRS1467262       SRR3587532    GSM2177189: Kcng4_1Ra_A1; Mus musculus; RNA-Seq   GEO Accession: GSM2177189  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         73905150   1478103  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800093           SRS1467263       SRR3587533    GSM2177190: Kcng4_1Ra_A2; Mus musculus; RNA-Seq   GEO Accession: GSM2177190  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77193150   1543863  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800094           SRS1467264       SRR3587534    GSM2177191: Kcng4_1Ra_A3; Mus musculus; RNA-Seq   GEO Accession: GSM2177191  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         59205550   1184111  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800095           SRS1467265       SRR3587535    GSM2177192: Kcng4_1Ra_A4; Mus musculus; RNA-Seq   GEO Accession: GSM2177192  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         61794700   1235894  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800096           SRS1467266       SRR3587536    GSM2177193: Kcng4_1Ra_A5; Mus musculus; RNA-Seq   GEO Accession: GSM2177193  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         78437650   1568753  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
SRP075720       SRX1800097           SRS1467267       SRR3587537    GSM2177194: Kcng4_1Ra_A6; Mus musculus; RNA-Seq   GEO Accession: GSM2177194  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77392700   1547854  None         50.0             p17                 1ra       mus musculus retina__ p17  retina

Converting SRP to GSE

$ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720

study_accession study_alias
SRP075720       GSE81903

Converting GSM to SRP

$ pysradb gsm-to-srp --db ./SRAmetadb.sqlite GSM2177186

experiment_alias study_accession
GSM2177186       SRP075720

Converting GSM to GSE

$ pysradb gsm-to-gse --db ./SRAmetadb.sqlite GSM2177186

experiment_alias study_alias
GSM2177186       GSE81903

Converting GSM to SRX

$ pysradb gsm-to-srx --db ./SRAmetadb.sqlite GSM2177186

experiment_alias experiment_accession
GSM2177186       SRX1800089

Converting GSM to SRR

$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186

experiment_alias run_accession
GSM2177186       SRR3587529

Complete Metadata for any record

Use the --detailed flag:

$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 --detailed --desc --expand

experiment_alias run_accession experiment_accession sample_accession study_accession run_alias      sample_alias study_alias developmental_stage retina_id source_name                tissue
GSM2177186       SRR3587529    SRX1800089           SRS1467259       SRP075720       GSM2177186_r1  GSM2177186   GSE81903    p17                 1ra       mus musculus retina__ p17  retina

Getting only the assay type

$ pysradb metadata SRP000941 --db ./SRAmetadb.sqlite --assay  | tr -s '  ' | cut -f5 -d ' ' | sort | uniq -c

999 Bisulfite-Seq
768 ChIP-Seq
  1 library_strategy
121 OTHER
353 RNA-Seq
 28 WGS

Downloading entire project

pysradb makes it super easy to download datasets from SRA.

$ pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852

Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.

Downloading only certain samples of interest

$ pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download

This will download all RNA-seq samples coming from this project using aspera-client, if available. Alternatively, it can also use wget.

Demo Notebooks

These notebooks document all the possible features of pysradb:

  1. Python API usage

  2. Command line usage

Citation

Choudhary, Saket. “pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive.” F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)

@article{Choudhary2019,
doi = {10.12688/f1000research.18676.1},
url = {https://doi.org/10.12688/f1000research.18676.1},
year = {2019},
month = apr,
publisher = {F1000 (Faculty of 1000 Ltd)},
volume = {8},
pages = {532},
author = {Saket Choudhary},
title = {pysradb: A Python package to query next-generation sequencing metadata and data from {NCBI} Sequence Read Archive},
journal = {F1000Research}
}

Zenodo archive: https://zenodo.org/badge/latestdoi/159590788

Zenodo DOI: 10.5281/zenodo.2306881

A lot of functionality in pysradb is based on ideas from the original SRAdb package. Please cite the original SRAdb publication:

Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. “SRAdb: query and use public next-generation sequencing data from within R.” BMC bioinformatics 14, no. 1 (2013): 19.

History

0.9.0 (02-27-2019)

Others

0.8.0 (02-26-2019)

New methods/functionality

  • srr-to-gsm: convert SRR to GSM

  • SRAmetadb.sqlite.gz file is deleted by default after extraction

  • When SRAmetadb is not found a confirmation is seeked before downloading

  • Confirmation option before SRA downloads

Bugfix

  • download() works with wget

Others

  • –out_dir is now out-dir

0.7.1 (02-18-2019)

Important: Python2 is no longer supported. Please consider moving to Python3.

Bugfix

  • Included docs in the index whihch were missed out in the previous release

0.7.0 (02-08-2019)

New methods/functionality

  • gsm-to-srr: convert GSM to SRR

  • gsm-to-srx: convert GSM to SRX

  • gsm-to-gse: convert GSM to GSE

Renamed methods

The following commad line options have been renamed and the changes are not compatible with 0.6.0 release:

  • sra-metadata -> metadata.

  • sra-search -> search.

  • srametadb -> metadb.

0.6.0 (12-25-2018)

Bugfix

  • Fixed bugs introduced in 0.5.0 with API changes where multiple redundant columns were output in sra-metadata

New methods/functionality

  • download now allows piped inputs

0.5.0 (12-24-2018)

New methods/functionality

  • Support for filtering by SRX Id for SRA downloads.

  • srr_to_srx: Convert SRR to SRX/SRP

  • srp_to_srx: Convert SRP to SRX

  • Stripped down sra-metadata to give minimal information

  • Added –assay, –desc, –detailed flag for sra-metadata

  • Improved table printing on terminal

0.4.2 (12-16-2018)

Bugfix

  • Fixed unicode error in tests for Python2

0.4.0 (12-12-2018)

New methods/functionality

  • Added a new BASEdb class to handle common database connections

  • Initial support for GEOmetadb through GEOdb class

  • Initial support or a command line interface: - download Download SRA project (SRPnnnn) - gse-metadata Fetch metadata for GEO ID (GSEnnnn) - gse-to-gsm Get GSM(s) for GSE - gsm-metadata Fetch metadata for GSM ID (GSMnnnn) - sra-metadata Fetch metadata for SRA project (SRPnnnn)

  • Added three separate notebooks for SRAdb, GEOdb, CLI usage

0.3.0 (12-05-2018)

New methods/functionality

  • sample_attribute and experiment_attribute are now included by default in the df returned by sra_metadata()

  • expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute column

  • New methods to guess cell/tissue/strain: guess_cell_type()/guess_tissue_type()/guess_strain_type()

  • Improved README and usage instructions

0.2.2 (12-03-2018)

New methods/functionality

  • search_sra() allows full text search on SRA metadata.

0.2.0 (12-03-2018)

Renamed methods

The following methods have been renamed and the changes are not compatible with 0.1.0 release:

  • get_query() -> query().

  • sra_convert() -> sra_metadata().

  • get_table_counts() -> all_row_counts().

New methods/functionality

  • download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.

  • ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.

Bug fixes

  • Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.

Besides these, all methods now follow a numpydoc compatible documentation.

0.1.0 (12-01-2018)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysradb-0.9.6.tar.gz (97.9 kB view hashes)

Uploaded Source

Built Distribution

pysradb-0.9.6-py3-none-any.whl (31.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page