Utilities for streaming NGS reads from SRA and GA4GH accessions.
Project description
ngstream: Streaming NGS reads from public databases
ngstream is a small python (3.6+) library that makes it easy to stream NGS reads from the Sequence Read Archive (SRA), GA4GH, and (eventually) other public databases, given an accession number.
Dependencies
- Interacting with SRA requires NGS and the python language bindings to be installed. Follow the instructions here. We recommend installing the SDK from bioconda or HomeBrew (
brew install sratookkit
) and then installing the python library from GitHub. - pysam is required for converting between BAM/CRAM (e.g. downloaded with Htsget) and SAM/FASTQ.
Note that the SRA toolkit by default caches downloaded data -- if you mysteriously run out of hard disk space, this is probably why. Instructions on how to configure/disable caching are here. If you want to change the cache location, use the following command (it won't return 0, but it still works):
vdb-config --root -s /repository/user/main/public/root=<TARGET_DIR>
Installation
pip install ngstream
Building from source
Clone this repository and run:
make
Accessing Reads from SRA
import ngstream
# Use the API to stream reads within your own python program.
with ngstream.open("SRR3618567", protocol="sra") as reader:
for record in reader:
# `record` is an `ngstream.api.Record` object if the data is
# single-end, and a `ngstream.api.Fragment` object if the data
# is paired-end.
print(record.as_fastq())
Accessing Reads Using HTSGet
import ngstream
from pathlib import Path
url = 'https://era.org/hts/ABC123'
ref = ngstream.GenomeReference("GRCh37", Path("GRCh37_sizes.txt"))
with ngstream.open(url, protocol="htsget", reference=ref) as reader:
for pair in reader:
print("\n".join(str(read) for read in pair))
Dump reads to a file (or pair of files)
import ngstream
# Grab 1000 read pairs from an SRA run and write them to FASTQ files.
accession = 'SRR3618567'
with ngstream.open("SRR3618567", protocol="sra") as reader:
files = ngstream.dump_fastq(accession, item_limit=1000)
print(f"Wrote {reader.read_count} reads from {accession} to {files[0]}, {files[1]}")
Use the command-line tools
# Dump all reads from the ABC123 dataset to ABC123.bam in the current directory.
$ htsget_dump https://era.org/hts/ABC123
Documentation
Coming soon
Developers
- We welcome any contributions via pull requests.
- Unit tests are highly desirable.
- Style-wise, we try to adhere to PEP8, and to the Google python style guidelines when there is ambiguity.
- We use Google-style docstrings, which are formatted by the Napoleon Sphinx Plugin.
- We run pylint as part of each build and strive to maintain a 10/10 score.
- We enforce a Code of Conduct.
Todo
- Add EGA support https://www.ebi.ac.uk/ega/about/your_EGA_account/download_streaming_client#API
- Acceleration of SRA downloads using prefetch: https://twitter.com/PhilippBayer/status/1076800095910150145
- Replace pysam with bamnostic: https://github.com/betteridiot/bamnostic
- Use pysradb to e.g. convert SRX IDs to list of SRRs: https://github.com/saketkc/pysradb
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ngstream-0.2.0.tar.gz
.
File metadata
- Download URL: ngstream-0.2.0.tar.gz
- Upload date:
- Size: 39.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1868375a29a2831001dd22b95c22f2f8b75c397d2cbf35bda664b651fb0605e7 |
|
MD5 | 154b2eeea247ec5f67894d9aadaf1d14 |
|
BLAKE2b-256 | 3fe44bf1dbe82cf25041d46f4fd716134cb3d8b30c07e608ed4b55bb8c021d0a |
File details
Details for the file ngstream-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: ngstream-0.2.0-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b7a33bca3d419c09429f340f887ed939086a9cae4fe4eabee015165b55c4565 |
|
MD5 | f349a87e27ebec54e43d338797476394 |
|
BLAKE2b-256 | 709ab1fc5b5da100cd4fcb1497a7e524e2ee46635a01edbf9072cafaa05242e6 |