Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.24-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.1 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.24-cp313-cp313-win_amd64.whl (239.9 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.24-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.1 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.24-cp313-cp313-macosx_11_0_arm64.whl (330.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.24-cp312-cp312-win_amd64.whl (240.2 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (393.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.24-cp312-cp312-macosx_11_0_arm64.whl (331.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.24-cp311-cp311-win_amd64.whl (239.5 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.24-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.24-cp311-cp311-macosx_11_0_arm64.whl (334.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.24-cp310-cp310-win_amd64.whl (239.5 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.24-cp310-cp310-macosx_11_0_arm64.whl (334.3 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.24-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d4a49058f31fd87f177b9b98c787e34d6bb4c75a8ed931103d1802a452ccdb0
MD5 791c68d9efcef111978a9346dc41ddfe
BLAKE2b-256 b6d9f8d7a79f7e23095913ffda033ce67d18899117b2b3b1931672e8f559d982

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.24-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 239.9 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.24-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d86380c3831d8671e6431258e842e6c8d5c418e75a4891c5d0046c93a1276717
MD5 9ab3cc44fdea0f80216adef804c868b0
BLAKE2b-256 507df1e811c3286dc6cca46497fbdf8b8025c0872b9068b2eeee19c282cbcbd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3d74a8b330e9ef6c9b4ac77ac7782a7bef0d62a4f1dc964547b7e2bc9210f090
MD5 3e9347d16fbcf03baa7b4054d9a4df62
BLAKE2b-256 5aceb90aa5995dd7fc0b246e1da6b0706a30c5e3febe44b20312423227f14611

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 98ecbb02c233a7654d35adad3490dcff64d35466c0b13c8f7a78d9c0ddabafe3
MD5 05135a327fb3c5303bca68782c7f534b
BLAKE2b-256 c1eb16b9af88dc5447b412ec99d95cd8b91e9b2cd53eb03007309275f1fa6f30

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.24-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 240.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.24-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 757740d6e41ee35b32f3422ceaf229064de47a80889612bb4a9c62ffa5c3a7ad
MD5 87fe35caf1aa09b5972dbf5318e79326
BLAKE2b-256 c534f039d9856ede6ed84a36820cc856ace14663ace1475f41a8614f064be1ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2b662c7f916f6dfd1d3d948a6d7b65f5600c776cd7b98f2b5716887c5f721aae
MD5 70a022e64d040e8aa5498fdbb48323ee
BLAKE2b-256 b55158aa6b875290f603310407af120f33360d1c20725f5e1074d59e728bd59f

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 09f25313dda0be0373a34cea827c98ad7e1fa83fc56f67d2c07d609772c8f1b3
MD5 9c31381b2ec49bb020bc868c559984e6
BLAKE2b-256 b5e3caba89c11bbc2d976316a6eb72635ac7776d2579b00291792a7f8680dbe8

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.24-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 239.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.24-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b152d954f888a0c38ba2172885e21d741107e6cbe22226121e230283ac660e6e
MD5 54009fb508e33ce6d6e67f096ea3e971
BLAKE2b-256 b92c27d2c815e971cbc1ac623d3aa3fb0dd170f40d7463df7de30b251ceba7cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a982816f7b82c4abc81ee0a39e801b94b2c897bbbee9ca687df3efbd69200d1b
MD5 e6d794b078222e73fbd47952ffa515cf
BLAKE2b-256 54976ddfcd07dbcbdf6becc818078878a92ff4d971b0e7e6f580437012a496cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 27e1ab128bd2bda9e5b109194e7c0cc7da320badf2e5cb2e1e4ec9e85c8b14a2
MD5 190a8ee758367ff7a94680e49581f789
BLAKE2b-256 2655654dbac154f97da921c7b0658461f758dc148a1bfb747e45f2d0e2122cac

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.24-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 239.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.24-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0592d2f37d01d5c804498c1204e0b04c285daa5d954135d2eb65cde4a9dec5f2
MD5 3f07ed85eac5a3a29f8574a356e2e54e
BLAKE2b-256 1b854a3a3ca78e2c9b8acf82d8425125192595524cbf5c97959a3e1c948c3558

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6303ce9807ad67181de70f18f248e256f51f0cb20f58cf5481acd4975456d89f
MD5 05e4526302dd8a115834f023911f7005
BLAKE2b-256 25d5cb02c63ed4a7243e76d3bc79e0f9de1e8ed1343904b099a8a509eb3dcb54

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.24-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.24-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b255daaa904e9093cdcdb5386b201ca8fcca70ea44b5ddb03b4471fc9f6eb067
MD5 34c40ad169a06d19fc9169f8a6b5beb5
BLAKE2b-256 b72df57bac1e9e0ff105a4eaeb82b59574c9847925f382424279ee715bd1f444

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.24-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page