Python package for interacting with SRAdb and downloading datasets from SRA
Project description
#######
pysradb
#######
.. image:: https://zenodo.org/badge/159590788.svg
:target: https://zenodo.org/badge/latestdoi/159590788
.. image:: https://img.shields.io/pypi/v/pysradb.svg?style=flat-square
:target: https://pypi.python.org/pypi/pysradb
.. image:: https://img.shields.io/travis/saketkc/pysradb.svg?style=flat-square
:target: https://travis-ci.com/saketkc/pysradb
.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
:target: http://bioconda.github.io/recipes/pysradb/README.html
.. image:: https://codecov.io/gh/saketkc/pysradb/branch/master/graph/badge.svg?style=flat-square
:target: https://codecov.io/gh/saketkc/pysradb
Python package for interacting with SRAdb and downloading datasets from SRA.
(python3 only!)
.. raw:: html
<a href="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx?speed=5&autoplay=1" target="_blank"><img src="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx.svg" /></a>
*********
CLI Usage
*********
``pysradb`` supports command line ussage. The documentation
is in progress. See `cmdline <https://github.com/saketkc/pysradb/blob/master/docs/cmdline.rst>`_ for
some quick usage instructions. See `quickstart <https://www.saket-choudhary.me/pysradb/quickstart.html#the-full-list-of-possible-pysradb-operations>`_ for
a list of instructions for each sub-command.
::
$ pysradb
Usage: pysradb [OPTIONS] COMMAND [ARGS]...
pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
Citation: Pending.
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
download Download SRA project (SRPnnnn)
gse-to-gsm Get GSM for a GSE
gse-to-srp Get SRP for a GSE
gsm-to-gse Get GSE for a GSM
gsm-to-srp Get SRP for a GSM
gsm-to-srr Get SRR for a GSM
gsm-to-srx Get SRX for a GSM
metadata Fetch metadata for SRA project (SRPnnnn)
metadb Download SRAmetadb.sqlite
search Search SRA for matching text
srp-to-gse Get GSE for a SRP
srp-to-srr Get SRR for a SRP
srp-to-srs Get SRS for a SRP
srr-to-gsm Get GSM for a SRR
srp-to-srx Get SRX for a SRP
srr-to-srp Get SRP for a SRR
srr-to-srs Get SRS for a SRR
srr-to-srx Get SRX for a SRR
srs-to-srx Get SRX for a SRS
srx-to-srp Get SRP for a SRX
srx-to-srr Get SRR for a SRX
srx-to-srs Get SRS for a SRX
************
Installation
************
To install stable version using `pip`:
.. code-block:: bash
pip install pysradb
Alternatively, if you use conda:
.. code-block:: bash
conda install -c bioconda pysradb
This step will install all the dependencies except aspera-client_ (which is not required, but highly recommended).
If you have an existing environment with a lot of pre-installed packages, conda might be `slow <https://github.com/bioconda/bioconda-recipes/issues/13774>`_.
Please consider creating a new enviroment for ``pysradb``:
.. code-block:: bash
conda create -c bioconda -n pysradb PYTHON=3 pysradb
Dependecies
===========
.. code-block:: bash
pandas>=0.23.4
tqdm>=4.28
click>=7.0
aspera-client
SRAmetadb.sqlite
Downloading SRAmetadb
=====================
We need a SQLite database file that has preprocessed metadata made available by the
`SRAdb <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19>`_ project.
SRAmetadb can be downloaded using:
.. code-block:: bash
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
Alternatively, you can also download it using ``pysradb``, which by default downloads it into your
current working directory:
::
$ pysradb metadb
You can also specify an alternate directory for download by supplying the ``--out-dir <OUT_DIR>`` argument.
.. _aspera-client:
aspera-client
=============
We strongly recommend using ``aspera-client`` (which uses UDP) since it `warrants faster downloads <http://www.skullbox.net/tcpudp.php>`_ as compared to ``ftp/http`` based downloads.
PDF intructions are available on IBM's `website <https://downloads.asperasoft.com/connect2/>`_.
Direct download links:
- `Linux <https://download.asperasoft.com/download/sw/connect/3.8.1/ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz>`_
- `MacOS <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnectInstaller-3.8.1.161274.dmg>`_
- `Windows: <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnect-ML-3.8.1.161274.msi>`_
Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:
.. code-block:: bash
tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
Install complete.
Installing pysradb in development mode
======================================
.. code-block:: bash
pip install -U pandas tqdm
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .
*************
Using pysradb
*************
Please see `usage_scenarios <https://saket-choudhary.me/pysradb/usage_scenarios.html>`_ for a few usage scenarios.
Here are few hand-picked examples.
Getting SRA metadata
====================
::
$ pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head
study_accession experiment_accession sample_accession run_accession library_strategy batch biomaterial_provider biomaterial_type cell_type collection_method differentiation_method differentiation_stage disease donor_age donor_ethnicity donor_health_status donor_id donor_sex line lineage medium molecule passage sample_term_id sex source_name tissue tissue_depot tissue_type
SRP000941 SRX006235 SRS004118 SRR018454 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006236 SRS004118 SRR018456 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006237 SRS004118 SRR018455 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019072 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019080 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019081 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019082 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019083 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019084 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
Getting detailed SRA metadata
=============================
::
$ pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand | head
study_accession experiment_accession sample_accession run_accession experiment_title experiment_attribute taxon_id library_selection library_layout library_strategy library_source library_name bases spots adapter_spec avg_read_length developmental_stage retina_id source_name tissue
SRP075720 SRX1800089 SRS1467259 SRR3587529 GSM2177186: Kcng4_1Ra_A10; Mus musculus; RNA-Seq GEO Accession: GSM2177186 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 79101650 1582033 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800090 SRS1467260 SRR3587530 GSM2177187: Kcng4_1Ra_A11; Mus musculus; RNA-Seq GEO Accession: GSM2177187 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 84573650 1691473 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800091 SRS1467261 SRR3587531 GSM2177188: Kcng4_1Ra_A12; Mus musculus; RNA-Seq GEO Accession: GSM2177188 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77835550 1556711 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800092 SRS1467262 SRR3587532 GSM2177189: Kcng4_1Ra_A1; Mus musculus; RNA-Seq GEO Accession: GSM2177189 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 73905150 1478103 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800093 SRS1467263 SRR3587533 GSM2177190: Kcng4_1Ra_A2; Mus musculus; RNA-Seq GEO Accession: GSM2177190 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77193150 1543863 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800094 SRS1467264 SRR3587534 GSM2177191: Kcng4_1Ra_A3; Mus musculus; RNA-Seq GEO Accession: GSM2177191 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 59205550 1184111 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800095 SRS1467265 SRR3587535 GSM2177192: Kcng4_1Ra_A4; Mus musculus; RNA-Seq GEO Accession: GSM2177192 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 61794700 1235894 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800096 SRS1467266 SRR3587536 GSM2177193: Kcng4_1Ra_A5; Mus musculus; RNA-Seq GEO Accession: GSM2177193 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 78437650 1568753 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800097 SRS1467267 SRR3587537 GSM2177194: Kcng4_1Ra_A6; Mus musculus; RNA-Seq GEO Accession: GSM2177194 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77392700 1547854 None 50.0 p17 1ra mus musculus retina__ p17 retina
Converting SRP to GSE
=====================
::
$ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720
study_accession study_alias
SRP075720 GSE81903
Converting GSM to SRP
=====================
::
$ pysradb gsm-to-srp --db ./SRAmetadb.sqlite GSM2177186
experiment_alias study_accession
GSM2177186 SRP075720
Converting GSM to GSE
=====================
::
$ pysradb gsm-to-gse --db ./SRAmetadb.sqlite GSM2177186
experiment_alias study_alias
GSM2177186 GSE81903
Converting GSM to SRX
=====================
::
$ pysradb gsm-to-srx --db ./SRAmetadb.sqlite GSM2177186
experiment_alias experiment_accession
GSM2177186 SRX1800089
Converting GSM to SRR
=====================
::
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186
experiment_alias run_accession
GSM2177186 SRR3587529
Complete Metadata for any record
================================
Use the ``--detailed`` flag:
::
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 --detailed --desc --expand
experiment_alias run_accession experiment_accession sample_accession study_accession run_alias sample_alias study_alias developmental_stage retina_id source_name tissue
GSM2177186 SRR3587529 SRX1800089 SRS1467259 SRP075720 GSM2177186_r1 GSM2177186 GSE81903 p17 1ra mus musculus retina__ p17 retina
Getting only the assay type
===========================
::
$ pysradb metadata SRP000941 --db ./SRAmetadb.sqlite --assay | tr -s ' ' | cut -f5 -d ' ' | sort | uniq -c
999 Bisulfite-Seq
768 ChIP-Seq
1 library_strategy
121 OTHER
353 RNA-Seq
28 WGS
Downloading entire project
==========================
``pysradb`` makes it super easy to download datasets from SRA.
::
$ pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by ``SRP/SRX/SRR`` mimicking the hiererachy of SRA projects.
Downloading only certain samples of interest
============================================
::
$ pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download
This will download all ``RNA-seq`` samples coming from this project using ``aspera-client``, if available.
Alternatively, it can also use ``wget``.
**************
Demo Notebooks
**************
These notebooks document all the possible features of `pysradb`:
1. `Python API usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/01.SRAdb-demo.ipynb>`_
2. `Command line usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/03.CommandLine-demo.ipynb>`_
********
Citation
********
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
DOI: 10.5281/zenodo.2306881
A lot of functionality in ``pysradb`` is based on ideas from the original `SRAdb package <https://bioconductor.org/packages/release/bioc/html/SRAdb.html>`_. Please cite the original SRAdb publication:
Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. "SRAdb: query and use public next-generation sequencing data from within R." BMC bioinformatics 14, no. 1 (2013): 19.
* Free software: BSD license
* Documentation: https://saketkc.github.io/pysradb
#######
History
#######
*******************
0.8.0 (02-26-2019)
*******************
New methods/functionality
=========================
* `srr-to-gsm`: convert SRR to GSM
* SRAmetadb.sqlite.gz file is deleted by default after extraction
* When SRAmetadb is not found a confirmation is seeked before downloading
* Confirmation option before SRA downloads
Bugfix
======
* download() works with wget
Others
======
* `--out_dir` is now `out-dir`
*******************
0.7.1 (02-18-2019)
*******************
Important: Python2 is no longer supported.
Please consider moving to Python3.
Bugfix
======
* Included docs in the index whihch were missed
out in the previous release
*******************
0.7.0 (02-08-2019)
*******************
New methods/functionality
=========================
* `gsm-to-srr`: convert GSM to SRR
* `gsm-to-srx`: convert GSM to SRX
* `gsm-to-gse`: convert GSM to GSE
Renamed methods
===============
The following commad line options have been renamed
and the changes are not compatible with 0.6.0
release:
* `sra-metadata` -> `metadata`.
* `sra-search` -> `search`.
* `srametadb` -> `metadb`.
*******************
0.6.0 (12-25-2018)
*******************
Bugfix
======
* Fixed bugs introduced in 0.5.0 with API changes where
multiple redundant columns were output in `sra-metadata`
New methods/functionality
=========================
* `download` now allows piped inputs
*******************
0.5.0 (12-24-2018)
*******************
New methods/functionality
=========================
* Support for filtering by SRX Id for SRA downloads.
* `srr_to_srx`: Convert SRR to SRX/SRP
* `srp_to_srx`: Convert SRP to SRX
* Stripped down `sra-metadata` to give minimal information
* Added `--assay`, `--desc`, `--detailed` flag for `sra-metadata`
* Improved table printing on terminal
*******************
0.4.2 (12-16-2018)
*******************
Bugfix
======
* Fixed unicode error in tests for Python2
*******************
0.4.0 (12-12-2018)
*******************
New methods/functionality
=========================
* Added a new `BASEdb` class to handle common database connections
* Initial support for GEOmetadb through GEOdb class
* Initial support or a command line interface:
- download Download SRA project (SRPnnnn)
- gse-metadata Fetch metadata for GEO ID (GSEnnnn)
- gse-to-gsm Get GSM(s) for GSE
- gsm-metadata Fetch metadata for GSM ID (GSMnnnn)
- sra-metadata Fetch metadata for SRA project (SRPnnnn)
* Added three separate notebooks for SRAdb, GEOdb, CLI usage
*******************
0.3.0 (12-05-2018)
*******************
New methods/functionality
=========================
* `sample_attribute` and `experiment_attribute` are now included by default in the df returned by `sra_metadata()`
* `expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute` column
* New methods to guess cell/tissue/strain: `guess_cell_type()`/`guess_tissue_type()`/`guess_strain_type()`
* Improved README and usage instructions
*******************
0.2.2 (12-03-2018)
*******************
New methods/functionality
=========================
* `search_sra()` allows full text search on SRA metadata.
*******************
0.2.0 (12-03-2018)
*******************
Renamed methods
===============
The following methods have been renamed
and the changes are not compatible with 0.1.0
release:
* `get_query()` -> `query()`.
* `sra_convert()` -> `sra_metadata()`.
* `get_table_counts()` -> `all_row_counts()`.
New methods/functionality
=========================
* `download_sradb_file()` makes fetching `SRAmetadb.sqlite` file easy; wget is no longer
required.
* `ftp` protocol is now supported besides `fsp` and hence `aspera-client` is now optional.
We however, strongly recommend `aspera-client` for faster downloads.
Bug fixes
=========
* Silenced `SettingWithCopyWarning` by excplicitly doing operations on a copy of
the dataframe instead of the original.
Besides these, all methods now follow a `numpydoc` compatible documentation.
******************
0.1.0 (12-01-2018)
******************
* First release on PyPI.
pysradb
#######
.. image:: https://zenodo.org/badge/159590788.svg
:target: https://zenodo.org/badge/latestdoi/159590788
.. image:: https://img.shields.io/pypi/v/pysradb.svg?style=flat-square
:target: https://pypi.python.org/pypi/pysradb
.. image:: https://img.shields.io/travis/saketkc/pysradb.svg?style=flat-square
:target: https://travis-ci.com/saketkc/pysradb
.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
:target: http://bioconda.github.io/recipes/pysradb/README.html
.. image:: https://codecov.io/gh/saketkc/pysradb/branch/master/graph/badge.svg?style=flat-square
:target: https://codecov.io/gh/saketkc/pysradb
Python package for interacting with SRAdb and downloading datasets from SRA.
(python3 only!)
.. raw:: html
<a href="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx?speed=5&autoplay=1" target="_blank"><img src="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx.svg" /></a>
*********
CLI Usage
*********
``pysradb`` supports command line ussage. The documentation
is in progress. See `cmdline <https://github.com/saketkc/pysradb/blob/master/docs/cmdline.rst>`_ for
some quick usage instructions. See `quickstart <https://www.saket-choudhary.me/pysradb/quickstart.html#the-full-list-of-possible-pysradb-operations>`_ for
a list of instructions for each sub-command.
::
$ pysradb
Usage: pysradb [OPTIONS] COMMAND [ARGS]...
pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
Citation: Pending.
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
download Download SRA project (SRPnnnn)
gse-to-gsm Get GSM for a GSE
gse-to-srp Get SRP for a GSE
gsm-to-gse Get GSE for a GSM
gsm-to-srp Get SRP for a GSM
gsm-to-srr Get SRR for a GSM
gsm-to-srx Get SRX for a GSM
metadata Fetch metadata for SRA project (SRPnnnn)
metadb Download SRAmetadb.sqlite
search Search SRA for matching text
srp-to-gse Get GSE for a SRP
srp-to-srr Get SRR for a SRP
srp-to-srs Get SRS for a SRP
srr-to-gsm Get GSM for a SRR
srp-to-srx Get SRX for a SRP
srr-to-srp Get SRP for a SRR
srr-to-srs Get SRS for a SRR
srr-to-srx Get SRX for a SRR
srs-to-srx Get SRX for a SRS
srx-to-srp Get SRP for a SRX
srx-to-srr Get SRR for a SRX
srx-to-srs Get SRS for a SRX
************
Installation
************
To install stable version using `pip`:
.. code-block:: bash
pip install pysradb
Alternatively, if you use conda:
.. code-block:: bash
conda install -c bioconda pysradb
This step will install all the dependencies except aspera-client_ (which is not required, but highly recommended).
If you have an existing environment with a lot of pre-installed packages, conda might be `slow <https://github.com/bioconda/bioconda-recipes/issues/13774>`_.
Please consider creating a new enviroment for ``pysradb``:
.. code-block:: bash
conda create -c bioconda -n pysradb PYTHON=3 pysradb
Dependecies
===========
.. code-block:: bash
pandas>=0.23.4
tqdm>=4.28
click>=7.0
aspera-client
SRAmetadb.sqlite
Downloading SRAmetadb
=====================
We need a SQLite database file that has preprocessed metadata made available by the
`SRAdb <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19>`_ project.
SRAmetadb can be downloaded using:
.. code-block:: bash
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
Alternatively, you can also download it using ``pysradb``, which by default downloads it into your
current working directory:
::
$ pysradb metadb
You can also specify an alternate directory for download by supplying the ``--out-dir <OUT_DIR>`` argument.
.. _aspera-client:
aspera-client
=============
We strongly recommend using ``aspera-client`` (which uses UDP) since it `warrants faster downloads <http://www.skullbox.net/tcpudp.php>`_ as compared to ``ftp/http`` based downloads.
PDF intructions are available on IBM's `website <https://downloads.asperasoft.com/connect2/>`_.
Direct download links:
- `Linux <https://download.asperasoft.com/download/sw/connect/3.8.1/ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz>`_
- `MacOS <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnectInstaller-3.8.1.161274.dmg>`_
- `Windows: <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnect-ML-3.8.1.161274.msi>`_
Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:
.. code-block:: bash
tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
Install complete.
Installing pysradb in development mode
======================================
.. code-block:: bash
pip install -U pandas tqdm
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .
*************
Using pysradb
*************
Please see `usage_scenarios <https://saket-choudhary.me/pysradb/usage_scenarios.html>`_ for a few usage scenarios.
Here are few hand-picked examples.
Getting SRA metadata
====================
::
$ pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head
study_accession experiment_accession sample_accession run_accession library_strategy batch biomaterial_provider biomaterial_type cell_type collection_method differentiation_method differentiation_stage disease donor_age donor_ethnicity donor_health_status donor_id donor_sex line lineage medium molecule passage sample_term_id sex source_name tissue tissue_depot tissue_type
SRP000941 SRX006235 SRS004118 SRR018454 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006236 SRS004118 SRR018456 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006237 SRS004118 SRR018455 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019072 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019080 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019081 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019082 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019083 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019084 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
Getting detailed SRA metadata
=============================
::
$ pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand | head
study_accession experiment_accession sample_accession run_accession experiment_title experiment_attribute taxon_id library_selection library_layout library_strategy library_source library_name bases spots adapter_spec avg_read_length developmental_stage retina_id source_name tissue
SRP075720 SRX1800089 SRS1467259 SRR3587529 GSM2177186: Kcng4_1Ra_A10; Mus musculus; RNA-Seq GEO Accession: GSM2177186 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 79101650 1582033 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800090 SRS1467260 SRR3587530 GSM2177187: Kcng4_1Ra_A11; Mus musculus; RNA-Seq GEO Accession: GSM2177187 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 84573650 1691473 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800091 SRS1467261 SRR3587531 GSM2177188: Kcng4_1Ra_A12; Mus musculus; RNA-Seq GEO Accession: GSM2177188 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77835550 1556711 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800092 SRS1467262 SRR3587532 GSM2177189: Kcng4_1Ra_A1; Mus musculus; RNA-Seq GEO Accession: GSM2177189 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 73905150 1478103 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800093 SRS1467263 SRR3587533 GSM2177190: Kcng4_1Ra_A2; Mus musculus; RNA-Seq GEO Accession: GSM2177190 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77193150 1543863 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800094 SRS1467264 SRR3587534 GSM2177191: Kcng4_1Ra_A3; Mus musculus; RNA-Seq GEO Accession: GSM2177191 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 59205550 1184111 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800095 SRS1467265 SRR3587535 GSM2177192: Kcng4_1Ra_A4; Mus musculus; RNA-Seq GEO Accession: GSM2177192 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 61794700 1235894 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800096 SRS1467266 SRR3587536 GSM2177193: Kcng4_1Ra_A5; Mus musculus; RNA-Seq GEO Accession: GSM2177193 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 78437650 1568753 None 50.0 p17 1ra mus musculus retina__ p17 retina
SRP075720 SRX1800097 SRS1467267 SRR3587537 GSM2177194: Kcng4_1Ra_A6; Mus musculus; RNA-Seq GEO Accession: GSM2177194 10090 cDNA SINGLE - RNA-Seq TRANSCRIPTOMIC None 77392700 1547854 None 50.0 p17 1ra mus musculus retina__ p17 retina
Converting SRP to GSE
=====================
::
$ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720
study_accession study_alias
SRP075720 GSE81903
Converting GSM to SRP
=====================
::
$ pysradb gsm-to-srp --db ./SRAmetadb.sqlite GSM2177186
experiment_alias study_accession
GSM2177186 SRP075720
Converting GSM to GSE
=====================
::
$ pysradb gsm-to-gse --db ./SRAmetadb.sqlite GSM2177186
experiment_alias study_alias
GSM2177186 GSE81903
Converting GSM to SRX
=====================
::
$ pysradb gsm-to-srx --db ./SRAmetadb.sqlite GSM2177186
experiment_alias experiment_accession
GSM2177186 SRX1800089
Converting GSM to SRR
=====================
::
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186
experiment_alias run_accession
GSM2177186 SRR3587529
Complete Metadata for any record
================================
Use the ``--detailed`` flag:
::
$ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 --detailed --desc --expand
experiment_alias run_accession experiment_accession sample_accession study_accession run_alias sample_alias study_alias developmental_stage retina_id source_name tissue
GSM2177186 SRR3587529 SRX1800089 SRS1467259 SRP075720 GSM2177186_r1 GSM2177186 GSE81903 p17 1ra mus musculus retina__ p17 retina
Getting only the assay type
===========================
::
$ pysradb metadata SRP000941 --db ./SRAmetadb.sqlite --assay | tr -s ' ' | cut -f5 -d ' ' | sort | uniq -c
999 Bisulfite-Seq
768 ChIP-Seq
1 library_strategy
121 OTHER
353 RNA-Seq
28 WGS
Downloading entire project
==========================
``pysradb`` makes it super easy to download datasets from SRA.
::
$ pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by ``SRP/SRX/SRR`` mimicking the hiererachy of SRA projects.
Downloading only certain samples of interest
============================================
::
$ pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download
This will download all ``RNA-seq`` samples coming from this project using ``aspera-client``, if available.
Alternatively, it can also use ``wget``.
**************
Demo Notebooks
**************
These notebooks document all the possible features of `pysradb`:
1. `Python API usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/01.SRAdb-demo.ipynb>`_
2. `Command line usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/03.CommandLine-demo.ipynb>`_
********
Citation
********
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
DOI: 10.5281/zenodo.2306881
A lot of functionality in ``pysradb`` is based on ideas from the original `SRAdb package <https://bioconductor.org/packages/release/bioc/html/SRAdb.html>`_. Please cite the original SRAdb publication:
Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. "SRAdb: query and use public next-generation sequencing data from within R." BMC bioinformatics 14, no. 1 (2013): 19.
* Free software: BSD license
* Documentation: https://saketkc.github.io/pysradb
#######
History
#######
*******************
0.8.0 (02-26-2019)
*******************
New methods/functionality
=========================
* `srr-to-gsm`: convert SRR to GSM
* SRAmetadb.sqlite.gz file is deleted by default after extraction
* When SRAmetadb is not found a confirmation is seeked before downloading
* Confirmation option before SRA downloads
Bugfix
======
* download() works with wget
Others
======
* `--out_dir` is now `out-dir`
*******************
0.7.1 (02-18-2019)
*******************
Important: Python2 is no longer supported.
Please consider moving to Python3.
Bugfix
======
* Included docs in the index whihch were missed
out in the previous release
*******************
0.7.0 (02-08-2019)
*******************
New methods/functionality
=========================
* `gsm-to-srr`: convert GSM to SRR
* `gsm-to-srx`: convert GSM to SRX
* `gsm-to-gse`: convert GSM to GSE
Renamed methods
===============
The following commad line options have been renamed
and the changes are not compatible with 0.6.0
release:
* `sra-metadata` -> `metadata`.
* `sra-search` -> `search`.
* `srametadb` -> `metadb`.
*******************
0.6.0 (12-25-2018)
*******************
Bugfix
======
* Fixed bugs introduced in 0.5.0 with API changes where
multiple redundant columns were output in `sra-metadata`
New methods/functionality
=========================
* `download` now allows piped inputs
*******************
0.5.0 (12-24-2018)
*******************
New methods/functionality
=========================
* Support for filtering by SRX Id for SRA downloads.
* `srr_to_srx`: Convert SRR to SRX/SRP
* `srp_to_srx`: Convert SRP to SRX
* Stripped down `sra-metadata` to give minimal information
* Added `--assay`, `--desc`, `--detailed` flag for `sra-metadata`
* Improved table printing on terminal
*******************
0.4.2 (12-16-2018)
*******************
Bugfix
======
* Fixed unicode error in tests for Python2
*******************
0.4.0 (12-12-2018)
*******************
New methods/functionality
=========================
* Added a new `BASEdb` class to handle common database connections
* Initial support for GEOmetadb through GEOdb class
* Initial support or a command line interface:
- download Download SRA project (SRPnnnn)
- gse-metadata Fetch metadata for GEO ID (GSEnnnn)
- gse-to-gsm Get GSM(s) for GSE
- gsm-metadata Fetch metadata for GSM ID (GSMnnnn)
- sra-metadata Fetch metadata for SRA project (SRPnnnn)
* Added three separate notebooks for SRAdb, GEOdb, CLI usage
*******************
0.3.0 (12-05-2018)
*******************
New methods/functionality
=========================
* `sample_attribute` and `experiment_attribute` are now included by default in the df returned by `sra_metadata()`
* `expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute` column
* New methods to guess cell/tissue/strain: `guess_cell_type()`/`guess_tissue_type()`/`guess_strain_type()`
* Improved README and usage instructions
*******************
0.2.2 (12-03-2018)
*******************
New methods/functionality
=========================
* `search_sra()` allows full text search on SRA metadata.
*******************
0.2.0 (12-03-2018)
*******************
Renamed methods
===============
The following methods have been renamed
and the changes are not compatible with 0.1.0
release:
* `get_query()` -> `query()`.
* `sra_convert()` -> `sra_metadata()`.
* `get_table_counts()` -> `all_row_counts()`.
New methods/functionality
=========================
* `download_sradb_file()` makes fetching `SRAmetadb.sqlite` file easy; wget is no longer
required.
* `ftp` protocol is now supported besides `fsp` and hence `aspera-client` is now optional.
We however, strongly recommend `aspera-client` for faster downloads.
Bug fixes
=========
* Silenced `SettingWithCopyWarning` by excplicitly doing operations on a copy of
the dataframe instead of the original.
Besides these, all methods now follow a `numpydoc` compatible documentation.
******************
0.1.0 (12-01-2018)
******************
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pysradb-0.8.0.tar.gz
(62.0 kB
view hashes)
Built Distribution
pysradb-0.8.0-py3-none-any.whl
(23.8 kB
view hashes)