Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.30-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.30-cp313-cp313-win_amd64.whl (240.7 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.30-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.30-cp313-cp313-macosx_11_0_arm64.whl (331.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.30-cp312-cp312-win_amd64.whl (240.9 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.30-cp312-cp312-macosx_11_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.30-cp311-cp311-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.30-cp311-cp311-macosx_11_0_arm64.whl (334.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.30-cp310-cp310-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.30-cp310-cp310-macosx_11_0_arm64.whl (335.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.30-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8c329696d11d8f234c84dc9d2f52b0c98d109c690ede92ab07895b1860c14804
MD5 6c832aafeda04d3a831d5b0fa0f5ea59
BLAKE2b-256 7fab8e2e376cbd3bb6efb7f0fb7b059cc252ae19bdefd9ec4c6f8c505985899b

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.30-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.30-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 3fc4c26e45acf7d10a7800f8dfa7c92637618d47d91a052b063b0b94ccbed03b
MD5 b28d2a29bd6be5ea6a4559feb4857fc1
BLAKE2b-256 d5474dc215fb83fa86ce8326a574ae9ecacb605704e726590b41c81c0cfe6ff6

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 708123479befed7e4a983cd5b7fb1bb530341c0320127c31b41c4792882a37f3
MD5 7425fa446ad97376c469c0d6b0dbea8a
BLAKE2b-256 b3181ae4bcd33cf2439e4fa7e4e5f5d3ef14ab922acda7c6041990f3bb1574bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eabbb8c4927b7e5b76e0e89311eac7bfdb01770eb1730f306ca627cf65474c80
MD5 a7e1dd816c5d7df40ffb57788be84f91
BLAKE2b-256 132aa6ab07b6d8467f1a8472c31a6dff48ea381fab16be6febeb6044d8ba0d4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.30-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 240.9 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.30-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 742b7dd6ab567c8d3c864888dbedb741b071afe07bd9042904ef191ce5e02227
MD5 52ff08b8e397dd3c9d9bf9c66c8eb4c5
BLAKE2b-256 911852c15f3b857c12e0db1018b6aaf6713027ed8a03483e2baa17312b2d90a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f769d906a29a701ab278351446ef113bd5d7ba0fd19e093ed0536375fe9145c9
MD5 fd68db95c675a359e75048a8483ba4cd
BLAKE2b-256 f070bc32da88466e00f3685545a003330e6c40289ae179464c997429bb235ad8

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a3a0a5cd4133b72f52d3a32df4e7e699760216a146dc54290041c735ecd32558
MD5 ac4276a6ec311de37776f8571779ec19
BLAKE2b-256 ee613c8e16c388ecfb8483fe8e455df046faf8d4e3dd61c6c4af9fc1dfc2a3bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.30-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.30-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 608f35dfcd883325afa6d4c6cbebfa0c19ceb9078ae9d8f0ae081aed5bdf0494
MD5 6ddcda8f29912dc9eb490e7cb88c5241
BLAKE2b-256 75811f662dd1a71f7b2b6a4f1e9148c70299cde50cd2ea2ba272fe1648307786

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8da7b0b21cb3c7305de8ab9079b9999fc7ff276c93c933cf384fa6aa0b963303
MD5 d63df15e0ef13cdf29d846ab2bdd5c79
BLAKE2b-256 b01680d0bdfe73de73a3b136dc50fca1cac57f7b8e33a0d9747c2ee8b4501625

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 30b5f352a9e09de4c3020384df43f6c4947e598b76c8ef0d3ef12228a01bd773
MD5 99cdd661ee8fbc8892ff5bffc22d97df
BLAKE2b-256 f095412d64c4e618419c4ccf8090e665c663e042b907868cc3278c4f570e0466

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.30-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.30-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9f7bcb751c2a3052028739ca3cc44ac3b2c80b99c27562946ed653753e8449ce
MD5 688ce7bee7c390b7c1570131bd014132
BLAKE2b-256 4a3a3bdee2a6e2060dc517ee5e5b8592221a7e2fd604b57ff24a9a10c28b07e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32871439839ca33023f6639d5df84243644fe8efbd967c28dc7ae0c1031c7ee0
MD5 19eb349ae6b1703b775ba39f4bf7547c
BLAKE2b-256 d69ad3d12876b83d9482208a4ab0e47cb4b8ee2724971926c845c1fd4197ecae

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.30-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.30-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64d76372c32c6157ef566b3b4e0ab87e18a483e1d5a504a816203559707a794a
MD5 7af09b285b1a61d88bb02d6ff49f5c80
BLAKE2b-256 42309751f78293df361aa1f039e435af04fedbebebd4b5707b55709120066126

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.30-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page