Python package for interacting with SRAdb and downloading datasets from SRA
Project description
pysradb
Python package for interacting with SRAdb and downloading datasets from SRA.
Installation
To install stable version:
pip install pysradb
This step will install all the dependencies except aspera-client. Both Python 2 and Python 3 are supported.
Dependecies
pandas>=0.23.4
tqdm>=4.28
aspera-client
SRAmetadb.sqlite
SRAmetadb
SRAmetadb can be downloaded as:
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
Alternatively, you can aslo download it using pysradb:
from pysradb import download_sradb_file
download_sradb_file()
SRAmetadb.sqlite.gz: 2.44GB [01:10, 36.9MB/s]
aspera-client
We strongly recommend using aspera-client (which uses UDP) since it enables faster downloads as compared to ftp/http based downloads.
PDF intructions are available here: https://downloads.asperasoft.com/connect2/.
Direct download links:
MacOS: https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnectInstaller-3.8.1.161274.dmg
Windows: https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnect-ML-3.8.1.161274.msi
Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:
tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
Install complete.
Installing pysradb in development mode
pip install -U pandas tqdm
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .
Interacting with SRA
Fetch the metadata table (SRA-runtable)
from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()
study_accession |
experiment_accession |
experiment_title |
run_accession |
taxon_id |
library_selection |
library_layout |
library_strategy |
library_source |
library_name |
bases |
spots |
adapter_spec |
avg_read_length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SRP098789 |
SRX2536403 |
GSM2475997: 1.5 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER |
SRR5227288 |
9606 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
2104142750 |
42082855 |
50 |
||
SRP098789 |
SRX2536404 |
GSM2475998: 1.5 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER |
SRR5227289 |
9606 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
2082873050 |
41657461 |
50 |
||
SRP098789 |
SRX2536405 |
GSM2475999: 1.5 µM PF-067446846, 10 min, rep 3; Homo sapiens; OTHER |
SRR5227290 |
9606 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
2023148650 |
40462973 |
50 |
||
SRP098789 |
SRX2536406 |
GSM2476000: 0.3 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER |
SRR5227291 |
9606 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
2057165950 |
41143319 |
50 |
||
SRP098789 |
SRX2536407 |
GSM2476001: 0.3 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER |
SRR5227292 |
9606 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
3027621850 |
60552437 |
50 |
Downloading an entire project arranged experiment wise
from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP017942')
db.download(df)
Downloading a subset of experiments
df = db.sra_metadata('SRP000941')
print(df.library_strategy.unique())
['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']
df_rna = df[df.library_strategy == 'RNA-Seq']
db.download(df=df_rna, out_dir='/pysradb_downloads')()
Demo
https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/demo.ipynb
Citation
Pending.
A lot of functionality in pysradb is based on ideas from the original SRAdb package. Please cite the original SRAdb publication:
Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. “SRAdb: query and use public next-generation sequencing data from within R.” BMC bioinformatics 14, no. 1 (2013): 19.
Free software: BSD license
Documentation: https://saketkc.github.io/pysradb
History
0.2.0 (12-03-2018)
Renamed methods
The following methods have been renamed and the changes are not compatible with 0.1.0 release:
get_query() -> query().
sra_convert() -> sra_metadata().
get_table_counts() -> all_row_counts().
New methods/functionality
download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.
ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.
Bug fixes
Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.
Besides these, all methods now follow a numpydoc compatible documentation.
0.1.0 (12-01-2018)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pysradb-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba86135628c869fb1d42a026c3af2b6b864e72ffdba69fb86e9d514b2427f4db |
|
MD5 | f876803313d3da7b3e2df8f9fbed638f |
|
BLAKE2b-256 | e858b40e6ab98a2e4667f9b4762f8f3329b6e8f7660d875fe082809573db447b |