Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.23-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.23-cp313-cp313-win_amd64.whl (240.2 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.23-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.2 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.23-cp313-cp313-macosx_11_0_arm64.whl (331.1 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.23-cp312-cp312-win_amd64.whl (240.4 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.23-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.23-cp312-cp312-macosx_11_0_arm64.whl (331.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.23-cp311-cp311-win_amd64.whl (239.8 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.23-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.23-cp311-cp311-macosx_11_0_arm64.whl (334.4 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.23-cp310-cp310-win_amd64.whl (239.7 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.23-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.23-cp310-cp310-macosx_11_0_arm64.whl (334.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.23-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea3c88e50cfe5df7d5992ff3df73b5a85e460fabcb8344a3c308a9973a8840e7
MD5 a15a6ccb7bec0524ad27f7e37f41a1cb
BLAKE2b-256 ca2cc8b091ed42629242e42289a6100582d7414d61369608fcb0321ab25a1361

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.23-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.2 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.23-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 1b87c015263fb5daa08efb173288e0bcb51817969f9a0c3b8bd18b9028561b7f
MD5 180a1a7cc3e04f7c7f035c481a8b2779
BLAKE2b-256 c10765c3a94c9ce3d0bebacfdfd284055ffdb7bba999eddb298aea07e093c165

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a1abf6e463a5070788a9e8f8c532e46dc1a9ab44e82eb0fa527406527afe90d
MD5 a3ca72079e4a65c614dae11557e5f101
BLAKE2b-256 8e7982231cd9a079ecd1d23892fc821a837023a0f5bba73a5cd164673c2f24a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 549815a53b90ce00814defaf91699105f77656a98d6fb99a88ca5729aa32b0e5
MD5 d2d37923c1cc6230fe33709307f44858
BLAKE2b-256 96f398b11b9d9906c8bdb64163e70c95553d15c74e8665b44c4118ab1af36c7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.23-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 240.4 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.23-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 15db2d6990fe08be201666739875dabc640c81182a875c4c8a6d20d1110f538c
MD5 1706745470b5f2ba958e10bc20bfd41c
BLAKE2b-256 5acaeb23e1ef64da835baef4a47f45eb2b3390a441bcf38bf73683102f2bd267

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a3e9099d2f5f7377f4094124becb9cbf683d35e3bb5b3997a1570767c841ae1f
MD5 69a923b38864122ae3986dfda4d0a918
BLAKE2b-256 c1ea97d8112e9219f822d3528cd88678f029e1055a4d16d17aeb431c18023749

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6ff25245468cf5513861515a459a52e071cf6e388f5d77864116075b3a365c6a
MD5 d93545198c5e284d8ce4f7b8dad18127
BLAKE2b-256 e4cd5b16be0e22e86901007622954d676a2dcebab670299410f559f2462eb675

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.23-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 239.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.23-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ec31a548ff88f7226808a89d1b549cadfd7c5b4f553dfc65975e68471c8d4e81
MD5 6b8b4895eac2ebc37577e9e0063e29e4
BLAKE2b-256 7147017d9dfe61837815950592c0189cdd3c36816f225eebf87f41ccdd12f85c

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8903bdda140f245ac01b4aacf99a10d780679abc51e0469ad789d8f0156b254
MD5 a9c558b378114ddb6ad993e548a00858
BLAKE2b-256 e6bc6a2906ba9313a2e06a51c10fb1de8c1b3a93dce66a53b8fb85fede193d7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5aff51ca7f7a6ba823dd0f888d28ae64b1ff0f95151628bb60414491b73fde3f
MD5 cb712d4f0d66701afddd65d95fdfd1e2
BLAKE2b-256 7480dc0a31f3a3c6f4ac97495b053c5c17d442e88a6c8679f9397f1c03a624e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.23-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 239.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.23-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 70651aaf3a8b6af91b7cc4351719a6c069aee48e11931a4e517098d1524bb1e3
MD5 88b19a3b0bb4208a4d460ee2a15fafeb
BLAKE2b-256 3f1ab83d668cd47cd2bdab39fa60c7a869e7879b3733824d343a2731caa66253

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fefe59b65cf2bf0b5b9b1ac742a9b80a1e0650fe9c9026dd60bcabad0aaddf94
MD5 aa7aa368b4a5246cdb676798990457f6
BLAKE2b-256 226d72783ba672bb951a0975b0498fa2bf2b77cb6ab5a4cfa007c9e4fd6d66ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.23-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.23-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 03ea14bf09daf0dac374df1e85c6b4d613bd663ccae2930c80f7e173f02ca03f
MD5 4b4132544075306768786bac7a3f1aa4
BLAKE2b-256 c8b38f11b3b4903a1c7cdff0df1455161f174627a08fae3f3123323955b5c2f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.23-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page