Python package for interacting with SRAdb and downloading datasets from SRA
Project description
pysradb
Python package for interacting with SRAdb and downloading datasets from SRA. (python3 only!)
CLI Usage
pysradb supports command line ussage. The documentation is in progress. See cmdline for some quick usage instructions. See quickstart for a list of instructions for each sub-command.
$ pysradb usage: pysradb [-h] [--version] [--citation] {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs} ... pysradb: Query NGS metadata and data from NCBI Sequence Read Archive. version: 0.9.0. Citation: 10.12688/f1000research.18676.1 optional arguments: -h, --help show this help message and exit --version show program's version number and exit --citation how to cite subcommands: {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs} metadb Download SRAmetadb.sqlite metadata Fetch metadata for SRA project (SRPnnnn) download Download SRA project (SRPnnnn) search Search SRA for matching text gse-to-gsm Get GSM for a GSE gse-to-srp Get SRP for a GSE gsm-to-gse Get GSE for a GSM gsm-to-srp Get SRP for a GSM gsm-to-srr Get SRR for a GSM gsm-to-srs Get SRS for a GSM gsm-to-srx Get SRX for a GSM srp-to-gse Get GSE for a SRP srp-to-srr Get SRR for a SRP srp-to-srs Get SRS for a SRP srp-to-srx Get SRX for a SRP srr-to-gsm Get GSM for a SRR srr-to-srp Get SRP for a SRR srr-to-srs Get SRS for a SRR srr-to-srx Get SRX for a SRR srs-to-gsm Get GSM for a SRS srs-to-srx Get SRX for a SRS srx-to-srp Get SRP for a SRX srx-to-srr Get SRR for a SRX srx-to-srs Get SRS for a SRX
Installation
To install stable version using pip:
pip install pysradb
Alternatively, if you use conda:
conda install -c bioconda pysradb
This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:
conda create -c bioconda -n pysradb PYTHON=3 pysradb
Dependecies
pandas>=0.23.4
tqdm>=4.28
requests>=2.22.0
xmltodict>-0.12.0i
sra-tools
SRAmetadb.sqlite (optional)
Installing sratools
NCBI has slowly transitioned towards using Google cloud for storing SRA files. As such the ftp links are slowly getting obsolete. With release 0.9.5, pysradb has moved to utilizing srapath available through NCBI’s sra-tools for getting the SRA location. Thus aspera-client is no longer required. But, sra-tools is now a requirement and can be installed through bioconda.
Downloading SRAmetadb (optional)
pysradb can utilize a SQLite database file that has preprocessed metadata made available by the SRAdb project. Though, with the release 0.9.5, this database file is not a hard requirement for any of the operations.
SRAmetadb can be downloaded using:
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
Alternatively, you can also download it using pysradb, which by default downloads it into your current working directory:
$ pysradb metadb
You can also specify an alternate directory for download by supplying the --out-dir <OUT_DIR> argument.
Installing pysradb in development mode
pip install -U pandas tqdm requests xmltodict
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .
Using pysradb
Please see usage_scenarios for a few usage scenarios. Here are few hand-picked examples.
Mode: SRAmetadb or SRAWeb
pysradb’s initial versions were completely dependent on the SRAmnetadb.sqlite file made available by the SRAdb project, we refer to this as the SRAmetadb mode. However, with `pysradb 0.9.5, the depedence on the SQLite file has been made optional. In the abseence of the SQLite file, the operations are performed usiNCBi’s esrarch and esummary interface, a mode which we refer to as the SRAweb mode. All the operations with the exception of search can be performed withoudownloading the SQLite file. NOTE: The additional flags such as --desc, -detailed and -expand are currently not fully supported in the SRAweb mode and will be supported in a future release. However, all the basic funcuionality of interconverting one ID to another is available in both SRAweb and SRAmetadb mode.
Search
Search for all projects containing “ribosome profiling”:
$ pysradb search "ribosome profiling" | head study_accession experiment_accession sample_accession run_accession DRP000927 DRX002899 DRS002983 DRR003575 DRP000927 DRX002900 DRS002992 DRR003576 DRP000927 DRX002901 DRS003001 DRR003577 DRP000927 DRX002902 DRS003010 DRR003578 DRP000927 DRX002903 DRS003019 DRR003579 DRP000927 DRX002904 DRS003028 DRR003580 DRP000927 DRX002905 DRS003037 DRR003581 DRP000927 DRX002906 DRS003038 DRR003582 DRP003075 DRX019536 DRS026974 DRR021383
Getting SRA metadata
$ pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head study_accession experiment_accession sample_accession run_accession library_strategy batch biomaterial_provider biomaterial_type cell_type collection_method differentiation_method differentiation_stage disease donor_age donor_ethnicity donor_health_status donor_id donor_sex line lineage medium molecule passage sample_term_id sex source_name tissue tissue_depot tissue_type SRP000941 SRX006235 SRS004118 SRR018454 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006236 SRS004118 SRR018456 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006237 SRS004118 SRR018455 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019072 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019080 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019081 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019082 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019083 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN SRP000941 SRX006239 SRS004213 SRR019084 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
Getting detailed SRA metadata
$ pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand | head study_accession experiment_accession sample_accession run_accession experiment_title experiment_attribute taxon_id library_selection library_layout library_strategy library_source library_name bases spots adapter_spec avg_read_length developmental_stage retina_id source_name tissue SRP075720 SRX1800089 SRS1467259 SRR3587529 GSM2177186: Kcng4_1Ra_A10; Mus musculus; RNA-Seq GEO Accession: GSM2177186 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 79101650 1582033 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800090 SRS1467260 SRR3587530 GSM2177187: Kcng4_1Ra_A11; Mus musculus; RNA-Seq GEO Accession: GSM2177187 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 84573650 1691473 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800091 SRS1467261 SRR3587531 GSM2177188: Kcng4_1Ra_A12; Mus musculus; RNA-Seq GEO Accession: GSM2177188 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77835550 1556711 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800092 SRS1467262 SRR3587532 GSM2177189: Kcng4_1Ra_A1; Mus musculus; RNA-Seq GEO Accession: GSM2177189 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 73905150 1478103 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800093 SRS1467263 SRR3587533 GSM2177190: Kcng4_1Ra_A2; Mus musculus; RNA-Seq GEO Accession: GSM2177190 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77193150 1543863 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800094 SRS1467264 SRR3587534 GSM2177191: Kcng4_1Ra_A3; Mus musculus; RNA-Seq GEO Accession: GSM2177191 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 59205550 1184111 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800095 SRS1467265 SRR3587535 GSM2177192: Kcng4_1Ra_A4; Mus musculus; RNA-Seq GEO Accession: GSM2177192 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 61794700 1235894 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800096 SRS1467266 SRR3587536 GSM2177193: Kcng4_1Ra_A5; Mus musculus; RNA-Seq GEO Accession: GSM2177193 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 78437650 1568753 None 50.0 p17 1ra mus musculus retina__ p17 retina SRP075720 SRX1800097 SRS1467267 SRR3587537 GSM2177194: Kcng4_1Ra_A6; Mus musculus; RNA-Seq GEO Accession: GSM2177194 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77392700 1547854 None 50.0 p17 1ra mus musculus retina__ p17 retina
Converting SRP to GSE
$ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720 study_accession study_alias SRP075720 GSE81903
Converting GSM to SRP
$ pysradb gsm-to-srp --db ./SRAmetadb.sqlite GSM2177186 experiment_alias study_accession GSM2177186 SRP075720
Converting GSM to GSE
$ pysradb gsm-to-gse --db ./SRAmetadb.sqlite GSM2177186 experiment_alias study_alias GSM2177186 GSE81903
Converting GSM to SRX
$ pysradb gsm-to-srx --db ./SRAmetadb.sqlite GSM2177186 experiment_alias experiment_accession GSM2177186 SRX1800089
Converting GSM to SRR
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 experiment_alias run_accession GSM2177186 SRR3587529
Complete Metadata for any record
Use the --detailed flag:
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 --detailed --desc --expand experiment_alias run_accession experiment_accession sample_accession study_accession run_alias sample_alias study_alias developmental_stage retina_id source_name tissue GSM2177186 SRR3587529 SRX1800089 SRS1467259 SRP075720 GSM2177186_r1 GSM2177186 GSE81903 p17 1ra mus musculus retina__ p17 retina
Getting only the assay type
$ pysradb metadata SRP000941 --db ./SRAmetadb.sqlite --assay | tr -s ' ' | cut -f5 -d ' ' | sort | uniq -c 999 Bisulfite-Seq 768 ChIP-Seq 1 library_strategy 121 OTHER 353 RNA-Seq 28 WGS
Downloading entire project
pysradb makes it super easy to download datasets from SRA.
$ pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.
Downloading only certain samples of interest
$ pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download
This will download all RNA-seq samples coming from this project using aspera-client, if available. Alternatively, it can also use wget.
Demo Notebooks
These notebooks document all the possible features of pysradb:
Citation
Choudhary, Saket. “pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive.” F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)
@article{Choudhary2019, doi = {10.12688/f1000research.18676.1}, url = {https://doi.org/10.12688/f1000research.18676.1}, year = {2019}, month = apr, publisher = {F1000 (Faculty of 1000 Ltd)}, volume = {8}, pages = {532}, author = {Saket Choudhary}, title = {pysradb: A Python package to query next-generation sequencing metadata and data from {NCBI} Sequence Read Archive}, journal = {F1000Research} }
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
Zenodo DOI: 10.5281/zenodo.2306881
A lot of functionality in pysradb is based on ideas from the original SRAdb package. Please cite the original SRAdb publication:
Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. “SRAdb: query and use public next-generation sequencing data from within R.” BMC bioinformatics 14, no. 1 (2013): 19.
Free software: BSD license
Documentation: https://saketkc.github.io/pysradb
History
0.9.0 (02-27-2019)
Others
This release completely changes the command line interface replacing click with argparse (https://github.com/saketkc/pysradb/pull/3)
Removed Python 2 comptaible stale code
0.8.0 (02-26-2019)
New methods/functionality
srr-to-gsm: convert SRR to GSM
SRAmetadb.sqlite.gz file is deleted by default after extraction
When SRAmetadb is not found a confirmation is seeked before downloading
Confirmation option before SRA downloads
Bugfix
download() works with wget
Others
–out_dir is now out-dir
0.7.1 (02-18-2019)
Important: Python2 is no longer supported. Please consider moving to Python3.
Bugfix
Included docs in the index whihch were missed out in the previous release
0.7.0 (02-08-2019)
New methods/functionality
gsm-to-srr: convert GSM to SRR
gsm-to-srx: convert GSM to SRX
gsm-to-gse: convert GSM to GSE
Renamed methods
The following commad line options have been renamed and the changes are not compatible with 0.6.0 release:
sra-metadata -> metadata.
sra-search -> search.
srametadb -> metadb.
0.6.0 (12-25-2018)
Bugfix
Fixed bugs introduced in 0.5.0 with API changes where multiple redundant columns were output in sra-metadata
New methods/functionality
download now allows piped inputs
0.5.0 (12-24-2018)
New methods/functionality
Support for filtering by SRX Id for SRA downloads.
srr_to_srx: Convert SRR to SRX/SRP
srp_to_srx: Convert SRP to SRX
Stripped down sra-metadata to give minimal information
Added –assay, –desc, –detailed flag for sra-metadata
Improved table printing on terminal
0.4.2 (12-16-2018)
Bugfix
Fixed unicode error in tests for Python2
0.4.0 (12-12-2018)
New methods/functionality
Added a new BASEdb class to handle common database connections
Initial support for GEOmetadb through GEOdb class
Initial support or a command line interface: - download Download SRA project (SRPnnnn) - gse-metadata Fetch metadata for GEO ID (GSEnnnn) - gse-to-gsm Get GSM(s) for GSE - gsm-metadata Fetch metadata for GSM ID (GSMnnnn) - sra-metadata Fetch metadata for SRA project (SRPnnnn)
Added three separate notebooks for SRAdb, GEOdb, CLI usage
0.3.0 (12-05-2018)
New methods/functionality
sample_attribute and experiment_attribute are now included by default in the df returned by sra_metadata()
expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute column
New methods to guess cell/tissue/strain: guess_cell_type()/guess_tissue_type()/guess_strain_type()
Improved README and usage instructions
0.2.2 (12-03-2018)
New methods/functionality
search_sra() allows full text search on SRA metadata.
0.2.0 (12-03-2018)
Renamed methods
The following methods have been renamed and the changes are not compatible with 0.1.0 release:
get_query() -> query().
sra_convert() -> sra_metadata().
get_table_counts() -> all_row_counts().
New methods/functionality
download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.
ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.
Bug fixes
Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.
Besides these, all methods now follow a numpydoc compatible documentation.
0.1.0 (12-01-2018)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.