Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.29-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.29-cp313-cp313-win_amd64.whl (240.8 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.29-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.29-cp313-cp313-macosx_11_0_arm64.whl (331.7 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.29-cp312-cp312-win_amd64.whl (241.0 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.29-cp312-cp312-macosx_11_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.29-cp311-cp311-win_amd64.whl (240.4 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.29-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.29-cp311-cp311-macosx_11_0_arm64.whl (334.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.29-cp310-cp310-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.29-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.29-cp310-cp310-macosx_11_0_arm64.whl (335.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.29-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2db2e5d0f03435d6a5365d74426469a6a84605ad44215b6d710d6770eeb76d44
MD5 349a88de911ee1c29882281c07df9adb
BLAKE2b-256 1e12bb3c9d02523e6ded9734b096755fa666a2c7599162d40d40f7f6fbfedd87

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.29-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.29-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d06ac41399e1558dcccd5b0db8f1b8120fad7ac0ac4a40725c42c828d6b74804
MD5 43693d1e52338c4437cbe64460ecd9a0
BLAKE2b-256 23aa68a3c0433c653f0baae8afcd17ea1d7236ac1fbd3c76c54b96204a45c969

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0aa76f99fb7ea9f85cebb32f2396f3f728e3797151a22386f884a87b64ef5cb1
MD5 6d550d42622dc49bdc4d85e34e98ef25
BLAKE2b-256 eb78a0a023892f91b06509b3f3edccc88da88a29fae0950ce33d7d2a1cb7031a

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5c04509740b0b00142e4a7f2a52ea509dbbc163a2c3f915c1777c9b172e264cb
MD5 3c8715eb1b67b703f731692f5d97d846
BLAKE2b-256 f440a7d5622adcfc5cf8e53e8ab6ceeec1bc6e08f1ef4d958336c7c9c93d244f

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.29-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 241.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.29-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0e108dd7aa8114f159b5d48c557fcacf4fd3a7f5007e692e1036a092bfa3460e
MD5 27e255946872d4fd0654ad915217074f
BLAKE2b-256 4a136be5049c3aed7e77c236bb0390048844a701922a76f4b880eaa394861dd7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fa553eb2d918f7fe2c68415c23c4094745ae4c04da16093587122ade438b9f1c
MD5 30b794620eeb958a321feb2936bd3c76
BLAKE2b-256 db93423d715154b68850fb6fa191845920d0325c95cc4507f6bb1ae02bc05fdd

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f5b97ec3065d7f08b9dd389ac58fa60a409512106d88dc4af912e8b51b3f4d7a
MD5 b6d57ecbf0a2e66baf9ef73dc3a44265
BLAKE2b-256 e067851d28cc5bcf612953ca4a3d3009c07c6ff542bbca1d14866cb536271673

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.29-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.29-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 40a7007b5f6dc5fa6ba9328c846b0593781853e430fff41a0d09e757784177ea
MD5 08e00b7a6b439ff0773c0d11dc8eb76f
BLAKE2b-256 122130d730e19f00f9bcd79780e0dad0d2d85fa2377948b677512361b2d53caa

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6341129ee266e1155128937b05219652e208413507772d0780eb9e431cb55b1e
MD5 ac4df11a0c83d7280cd638695e44250c
BLAKE2b-256 964e704e6698c711db35346698ce1d4f8de780501b67550e424d88985c9ec0f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 daae591f10e13384305a66a0cbdfa9905fc6329e7e663e9c9fe28d7904ef380c
MD5 b4a96808b2e93ee10d2a305cb8e015d4
BLAKE2b-256 1b4e002be71fd9e861a836c0372b909627a49f046ba319fb6036691df7a9ab0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.29-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.29-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cd463376a87ce53bb1f0d9a2b1d66d32649de893943ebfa6f8c4bedc4a98500f
MD5 079c6d29634a73bb9b75f5e575cd7323
BLAKE2b-256 b5e3ea69daa3fd0139020b8e9d2aa0052fc1cb40135bf79654c0c6464d35519f

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8645f458774e95dd69cb6c002d4536e956880c27a530e33f583327432c04eba4
MD5 6e09ede4d93561ef2f2b410d7995d995
BLAKE2b-256 cd7816f0a0a941ea663ef8e74cadff5a207e36a16989ce1fb49bad4b2ce1be93

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.29-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.29-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 99ec923f9c4454c93ffb2ac908f5341d13276f73498490e14a7a7b57c269a9c4
MD5 29d5a648f07daec982629b9434c8859b
BLAKE2b-256 5d0c1ad38d1cbfbe0e6078bce1ec77e789ef75fa0ca71416b6dd705ce42f70b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.29-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page