Skip to main content

Utilities for streaming NGS reads from SRA and GA4GH accessions.

Project description

PyPi

ngstream: Streaming NGS reads from public databases

ngstream is a small python (3.6+) library that makes it easy to stream NGS reads from the Sequence Read Archive (SRA), GA4GH, and (eventually) other public databases, given an accession number.

Dependencies

  • Interacting with SRA requires NGS and the python language bindings to be installed. Follow the instructions here. We recommend installing the SDK from bioconda or HomeBrew (brew install sratookkit) and then installing the python library from GitHub.
  • pysam is required for converting between BAM/CRAM (e.g. downloaded with Htsget) and SAM/FASTQ.

Note that the SRA toolkit by default caches downloaded data -- if you mysteriously run out of hard disk space, this is probably why. Instructions on how to configure/disable caching are here. If you want to change the cache location, use the following command (it won't return 0, but it still works):

vdb-config --root -s /repository/user/main/public/root=<TARGET_DIR>

Installation

pip install ngstream

Building from source

Clone this repository and run:

make

Accessing Reads from SRA

import ngstream

# Use the API to stream reads within your own python program.
with ngstream.open("SRR3618567", protocol="sra") as reader:
    for record in reader:
        # `record` is an `ngstream.api.Record` object if the data is
        # single-end, and a `ngstream.api.Fragment` object if the data
        # is paired-end.
        print(record.as_fastq())

Accessing Reads Using HTSGet

import ngstream
from pathlib import Path

url = 'https://era.org/hts/ABC123'
ref = ngstream.GenomeReference("GRCh37", Path("GRCh37_sizes.txt"))

with ngstream.open(url, protocol="htsget", reference=ref) as reader:
    for pair in reader:
        print("\n".join(str(read) for read in pair))

Dump reads to a file (or pair of files)

import ngstream

# Grab 1000 read pairs from an SRA run and write them to FASTQ files.
accession = 'SRR3618567'
with ngstream.open("SRR3618567", protocol="sra", item_limit=1000) as reader:
    files = ngstream.dump_fastq(reader)
    print(f"Wrote {reader.read_count} reads from {accession} to {files[0]}, {files[1]}")

Use the command-line tools

# Dump all reads from the ABC123 dataset to ABC123.bam in the current directory.
$ htsget_dump https://era.org/hts/ABC123

Documentation

Coming soon

Developers

  • We welcome contributions via pull requests.
  • Unit tests are highly desirable.
  • Style-wise, we enforce black code style. Please use make reformat.
  • We use Google-style docstrings, which are formatted by the Napoleon Sphinx Plugin.
  • We run pylint as part of each build and strive to maintain a 10/10 score.
  • We enforce a Code of Conduct.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngstream-0.2.2.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

ngstream-0.2.2-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file ngstream-0.2.2.tar.gz.

File metadata

  • Download URL: ngstream-0.2.2.tar.gz
  • Upload date:
  • Size: 38.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for ngstream-0.2.2.tar.gz
Algorithm Hash digest
SHA256 f7737bcf4fff2fbca0068b3097d39837654cd53b6e320385583b997f7a633322
MD5 af69953b4fd7d7801309211ced22e10d
BLAKE2b-256 d8d3fafa8217ec6c5d648a386ab06d0a629ee90c1d7a139fc33db2d9aa8ee3c5

See more details on using hashes here.

File details

Details for the file ngstream-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: ngstream-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for ngstream-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9cb20e195ca96e8c2ff62c0a05b0aabc64d666b91e5d4477bf6e8f97778af5a2
MD5 29ec5e9aa46008d99fa06109b838cbba
BLAKE2b-256 805c998ba26ae2d21b50b3e69552ebe0862fff865bbbb9ac3a217bc98b0a74d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page