Skip to main content

Utilities for streaming NGS reads from SRA and GA4GH accessions.

Project description

PyPi

ngstream: Streaming NGS reads from public databases

ngstream is a small python (3.6+) library that makes it easy to stream NGS reads from the Sequence Read Archive (SRA), GA4GH, and (eventually) other public databases, given an accession number.

Dependencies

  • Interacting with SRA requires NGS and the python language bindings to be installed. Follow the instructions here. We recommend installing the SDK from bioconda or HomeBrew (brew install sratookkit) and then installing the python library from GitHub.
  • pysam is required for converting between BAM/CRAM (e.g. downloaded with Htsget) and SAM/FASTQ.

Note that the SRA toolkit by default caches downloaded data -- if you mysteriously run out of hard disk space, this is probably why. Instructions on how to configure/disable caching are here. If you want to change the cache location, use the following command (it won't return 0, but it still works):

vdb-config --root -s /repository/user/main/public/root=<TARGET_DIR>

Installation

pip install ngstream

Building from source

Clone this repository and run:

make

Accessing Reads from SRA

import ngstream

# Use the API to stream reads within your own python program.
with ngstream.open("SRR3618567", protocol="sra") as reader:
    for record in reader:
        # `record` is an `ngstream.api.Record` object if the data is
        # single-end, and a `ngstream.api.Fragment` object if the data
        # is paired-end.
        print(record.as_fastq())

Accessing Reads Using HTSGet

import ngstream
from pathlib import Path

url = 'https://era.org/hts/ABC123'
ref = ngstream.GenomeReference("GRCh37", Path("GRCh37_sizes.txt"))

with ngstream.open(url, protocol="htsget", reference=ref) as reader:
    for pair in reader:
        print("\n".join(str(read) for read in pair))

Dump reads to a file (or pair of files)

import ngstream

# Grab 1000 read pairs from an SRA run and write them to FASTQ files.
accession = 'SRR3618567'
with ngstream.open("SRR3618567", protocol="sra") as reader:
    files = ngstream.dump_fastq(accession, item_limit=1000)
    print(f"Wrote {reader.read_count} reads from {accession} to {files[0]}, {files[1]}")

Use the command-line tools

# Dump all reads from the ABC123 dataset to ABC123.bam in the current directory.
$ htsget_dump https://era.org/hts/ABC123

Documentation

Coming soon

Developers

  • We welcome any contributions via pull requests.
  • Unit tests are highly desirable.
  • Style-wise, we try to adhere to PEP8, and to the Google python style guidelines when there is ambiguity.
  • We use Google-style docstrings, which are formatted by the Napoleon Sphinx Plugin.
  • We run pylint as part of each build and strive to maintain a 10/10 score.
  • We enforce a Code of Conduct.

Todo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngstream-0.2.0.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

ngstream-0.2.0-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file ngstream-0.2.0.tar.gz.

File metadata

  • Download URL: ngstream-0.2.0.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for ngstream-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1868375a29a2831001dd22b95c22f2f8b75c397d2cbf35bda664b651fb0605e7
MD5 154b2eeea247ec5f67894d9aadaf1d14
BLAKE2b-256 3fe44bf1dbe82cf25041d46f4fd716134cb3d8b30c07e608ed4b55bb8c021d0a

See more details on using hashes here.

File details

Details for the file ngstream-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ngstream-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for ngstream-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b7a33bca3d419c09429f340f887ed939086a9cae4fe4eabee015165b55c4565
MD5 f349a87e27ebec54e43d338797476394
BLAKE2b-256 709ab1fc5b5da100cd4fcb1497a7e524e2ee46635a01edbf9072cafaa05242e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page