Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.26-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.26-cp313-cp313-win_amd64.whl (240.8 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.26-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.26-cp313-cp313-macosx_11_0_arm64.whl (331.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.26-cp312-cp312-win_amd64.whl (241.0 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.26-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.26-cp312-cp312-macosx_11_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.26-cp311-cp311-win_amd64.whl (240.4 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.26-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.26-cp311-cp311-macosx_11_0_arm64.whl (334.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.26-cp310-cp310-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.26-cp310-cp310-macosx_11_0_arm64.whl (335.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.26-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f9535e4dfea849fdd65a1048630d31f5a00e9cf492398becdaf20e0fd0bb2502
MD5 2cfe58e552d503cfb120a2bb0cbe4083
BLAKE2b-256 42b1c3a877ee467c67c5bbfbb62d8397f7cb0fb5377fb83069c12d06efe31ea7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.26-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.26-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 dc06b96e7c2c5edcbc0e1058525bdd65ff306b2f3c26a526892807db30b6a2e8
MD5 24c65d3d58f07427c379c3e1d7ab7cd0
BLAKE2b-256 313736115191d55675ca6040b55281162c8b24423bf56b2d3d6c145636b2288b

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e8f70748a231b50c5d30ecb5d9103608f054e4c0c341005743e912df04f171d
MD5 7ea3bcdf937b93e07550ef675718ac10
BLAKE2b-256 9a2f49f4b7630e47abfbffe55b5f9130b0e4ce859e774157104341c2d6f39cf8

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9270c7f164e51ad9cdde3c229eb54a5566548fe33132a22c743fc675f5c93790
MD5 119c390a1a8d29d4fb08e82f24cc9613
BLAKE2b-256 acb572956cc0588aceea9e06916525011daaaa90b65b3b3564ffa57f2bcda68c

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.26-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 241.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.26-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 11b8d7c2205122da2bc55177ee5e23b13298c94bb69e89451e729d902679570b
MD5 3e3378ee7657c2e406ac6f4675543fa2
BLAKE2b-256 6aaad777d752dac22a801817022c48de0b985e8aa95e020f57bfef674952effd

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95b9352aca81b78240b4eba7063aa0fcb536a6dbdae8ed79b333cc3046040d45
MD5 832e9612c585a1bf174a89b2c968dac4
BLAKE2b-256 2843df387945d088a8cf8dfe621cf01cfb87a3182f25dcb217db4930a4836949

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eaee2f1a6c3aa654f80e760b2caac730aa5e900a892b57b10c3b069414db5a64
MD5 a6bc5230a08041e21450f84aa5418f08
BLAKE2b-256 4aac88c476f477f98170c65213327a7c2a44db9a90cbbd906553bdc1b753995b

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.26-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.26-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4c8bc23eaab11e11f94b1710af865d5e33d1aa3d52aa18c7bce25ef4af97b39e
MD5 8919b87ff6f6157e83ca5d87a78d7c07
BLAKE2b-256 0a3c111d02f0a4c783d4840021eb9618aa610d6781dcf2fe8acb87452553a1e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2346f9444ff8cf4e714d76001149594fab4ab022a89f8f0a58cd94e44e1e33bb
MD5 0f36ada65b81ffcf75292ce77c49b507
BLAKE2b-256 5fc506344c828004fbbdfed1a5ea8c0ad3fe9a3560bad91e7e869d7c03f329bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 49039477a9889072f75929304e9a039d37e2aaa4877610d282595ebcb08391fa
MD5 3166e2bab3064588d3504ee72e082725
BLAKE2b-256 891ffeb09ea0378b81d98549e06eac467cc67a16fb1a5a17560ffe445a3df9d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.26-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.26-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8ab29ef2baa59349391902e273664eb9e1c8a9df6330d3d881ae34458a3a1aaf
MD5 0e16457852d90e6417fe37874781f7da
BLAKE2b-256 fb8c77c17e9d6a820a5a617522acca26cb6cc0d539de9f3c7cb547e005368c9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a31656eb6cce7b5183d3512be908317429d938b35bfdc46d38db8e7386ff6aa
MD5 c7cd07432774ab49fe2eb52a1b07b467
BLAKE2b-256 b8125d773d29c0177287a8d66eb32c424499e8f5b4176ac3b519ff5af6caeac1

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.26-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.26-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8f1afb7ff70d323247be7047758b3c61a45b9c3b887df651056e0ca4a76b140
MD5 7ed78889e2b1ee5be1609f0ae9ba5476
BLAKE2b-256 08810aae6908ffa56e297cb6607a95af1ed064b69724bb2c5803425a3fce7a52

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.26-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page