Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.28-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.28-cp313-cp313-win_amd64.whl (240.8 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.28-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.28-cp313-cp313-macosx_11_0_arm64.whl (331.7 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.28-cp312-cp312-win_amd64.whl (241.0 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.28-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.28-cp312-cp312-macosx_11_0_arm64.whl (332.0 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.28-cp311-cp311-win_amd64.whl (240.4 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.28-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.28-cp311-cp311-macosx_11_0_arm64.whl (335.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.28-cp310-cp310-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.28-cp310-cp310-macosx_11_0_arm64.whl (335.2 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.28-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 942d458e69d41b82ae36557332e71829a6eb651b3c43b61bea45f8fe2c8228c7
MD5 c0851fe0f3246185812f3d5cda87f05d
BLAKE2b-256 f13383edba71e6d4d0a2d2e25116bfab624470a042a09593b5d4454743910ad0

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.28-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.28-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 abcb9949642d4596a277e533c78d02bedaaf310bca7631176e9cbbe10799a79d
MD5 5a5e12444a83e1a8ea31e54effe61553
BLAKE2b-256 fe2a45960280e537cbc3b837af940c879957c5242430da32cde89b10e3617861

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 41d3b41fb66f37a5bdc8e89778b60ce26e356a1cad817ed00b7ca41c3ed2d133
MD5 00365da41ccc1b9360abe396a40be459
BLAKE2b-256 356a0436deba5bb4ebdee0cb57e589e026bed051614babf937bd1515fa00458f

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1a7626eb9d505e14204ffe00c26cb9c1d0efcbca74f2580fd99146ae3e613553
MD5 c037d7cbf43d59e8fc4b30a9f6e9e716
BLAKE2b-256 8bd617e4f27cc9693d36745eb7392a1120f3da9af81f1729469223869f46a270

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.28-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 241.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.28-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 18cb239bd909b8ce8e3de39d732e29c58ccc66018ba805b1f43ffd174419a21a
MD5 f0a4e64f9d30b6b2f573c0be67189e01
BLAKE2b-256 0a69e2bb38ebd6597016c86511eba5fb1f3426013a52088db802ce9ee10855bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 21c548a49fdff9918afc93498a043cc68e7ceaba9aa466bcc53ecef3de70030f
MD5 f22cb44f8486a10c9c9c037f5377d3d5
BLAKE2b-256 4e8f4f4b60e31a38be895aacb0fea261862a6915409f21296ee5c6f42a0eff10

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b6f38002ed8d44cc861f0127ca965268213e079c6d65bde1514776cefaec668
MD5 7b36306164b0cfb2b8e23c1f78c8bf99
BLAKE2b-256 77a29c3e9f28112cff62069c1c558d210c68436c5f8caa9fc3982ac96bcae577

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.28-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.28-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 65f8cbf41252cb4ac83caf3beb57809f445a732dc9c5f44e02d605adb6fbf64d
MD5 213f011b4964323c1376219862372069
BLAKE2b-256 36a2ccb556fc3303be0cb6402046e01823314e3ae904d4f467b3e738650a8cee

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8cc89eb3ac0034c5dae0387a773faf421831aa3b4d78a49ad7494cc43e02c4a1
MD5 6ab9460563203b318b2815d50aa2c3e0
BLAKE2b-256 638f70cd37d93f6b33aa57bcf701444017ea92fb56c4c33489689356140d5170

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3a7cc6328d4ef1211db769c3f81738e6657c04c9ddcc7d9613d520c0a5447593
MD5 55cc313586d3d1de9a013365aff0c13d
BLAKE2b-256 188d801c9621ffa8b6e9f15addf257bdaaba778784ea2614b817b10c4b7f93e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.28-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.28-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0dece33667d72044c095ad2ccd7daad99624407e80e59a42f128bfbb3f9a9372
MD5 dc11201d94def3df613008f5e25e9fa9
BLAKE2b-256 d04a75f50fc4752f228430577abe4e2f6c50ef86bde752ced10f4271a82036b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 585ad1c43e8317d76267232b745899264a2cf282fda1e3a9f126d5eb326437fc
MD5 71cd133b1de99bd14c463cf3bbfe940f
BLAKE2b-256 f004937582b5c606cac702f9d3e1d9b7ae7c8ffc59eec6e398178c41a30f15af

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.28-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.28-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7af470ee9cbcd20807ffca29f7a62bc0eda5a2e401724aff2bda50cd75fb724d
MD5 9b6aa3e8295c1a8429db4fa23f7831f7
BLAKE2b-256 8744ea36baf156d19995ba096909755ad3a74bbcc13cda7dfcc3ce540ee8f3c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.28-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page