Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.25-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.25-cp313-cp313-win_amd64.whl (240.7 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.25-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.25-cp313-cp313-macosx_11_0_arm64.whl (331.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.25-cp312-cp312-win_amd64.whl (240.9 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.25-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.25-cp312-cp312-macosx_11_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.25-cp311-cp311-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.25-cp311-cp311-macosx_11_0_arm64.whl (334.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.25-cp310-cp310-win_amd64.whl (240.2 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.25-cp310-cp310-macosx_11_0_arm64.whl (335.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.25-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0a754fdc725d207b5b5a6528c3b779d1688188c867605daa86e5fe20edb58a67
MD5 99c027e4cffdcf8be0cc60e90e39deed
BLAKE2b-256 ba7876c218a1d2d13ce9f38250be5203f28ebc97983e79bcdd1b0c2173fac690

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.25-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.25-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 a5c544ba6f92faff6245cf4fcd7c00e56f2bb7326fde0f9ed7733b7d2b068cfa
MD5 858675d959d828e455d6219eeb72b2f2
BLAKE2b-256 3aa39fbebcee37ed0b7fb244e2494f394618d084556c0f823c7999ae4f45b871

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b22192f6af6536217eff912faa5bd767869a66338311b3ebb960ee75340807b8
MD5 2fd9a01979f4d37024d2f6332f0b37d9
BLAKE2b-256 cca322f5469815f23bb9777eb5a739f55d351bdfef4669adb6cff82aead5af4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4cf25600b13e6ff3d2db136ba1f5562345a49c4076ceed4a76029b30eb295b56
MD5 8304d6a4ecd9f41d65426a49c9852ab4
BLAKE2b-256 e9da71fb5ce8f5830b28694246f5a1fcfb3a70d98e28d19072251cdf48be1fc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.25-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 240.9 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.25-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f161e0ef560fc8c24ece9509d1c0a255c58864fe5b4553458ade763655c81a50
MD5 dec3db008fe040a66612eea72a43a570
BLAKE2b-256 c1b2d0f8737da0558d64ffb5527ff2a5ec936cbb3c8e3ca3414b36b5ea328adf

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d2c477a02cfad8791a1d1adf037e46bed5dfdd50a177ddd5020d465f51dcc12a
MD5 a3b8fc2002f8782428cf3c3aa59f36b8
BLAKE2b-256 aa0457c27c2ae064fbfd1f8c1568d418d31a8dead81fbef06901e3ab0e7fee98

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff08b012f253b5d40e44cdaf933e6238e3edb1a95e134a6c217feefa2c1caae4
MD5 16ce96c60e51f1d0989fd7a68f9f1fa7
BLAKE2b-256 79cb57c09e35ea83c774b8dc7586834009b6c61b7f0ad95597c4f3ad28639f5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.25-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.25-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 cdf9846e38d2b73fb87dd433f57a1789466383c2a65edb701f35e48934fba658
MD5 57075c9bb186096ef00ca3958d16fce2
BLAKE2b-256 f2f0303426ba15ab7c4e77b42fc060a25f435c74e52bc24e5b4f3cc2da2e0627

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 801d396a36217fd1789a8982942ce694d26c8157c6fd3205f84bd33c3979dfb5
MD5 97f6992cdaf9b68fbf77ce4ef14fefb0
BLAKE2b-256 9790e5bdc9c6cf07048edaa238e0d656179a11f32582fbf14220cf8659bc00c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d459a3b31b2026adcc0e9836bddc7ce9878dd6c367e035aca801c5b782dcf7b3
MD5 820e9b253fdd6f3454160a7aabd158c3
BLAKE2b-256 0c16da1fd7fa98c4b8b547e0eae49d702aef99be26bdc3d1b8666bb1b1a1439a

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.25-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.2 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.25-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fef7953767d4033ef47901a6537b0f83f2982771d8a735f3ca6dd3482d1f871c
MD5 3a3753d472c430cab7f0bd526265c344
BLAKE2b-256 8ad9d67635b966e167de4960e6d2c0f448116f6293c386f2c8ba49652d09cd5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1f0aab42b9633def3b405bbceace2f1a7d047db0b45e4d6d6867f50a2fc5d23a
MD5 97cefd9b32ad9f0190eba79f96f60c66
BLAKE2b-256 9ed4cdf771721ba912276ae0e1e73ff642c95cb18829d57f50d97fff54b383d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.25-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.25-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fa2949f79f074d560d51abef82f7d1bc94fa87e76867c0b79ae5ffe2010f5222
MD5 a1c8aa81231e50b9f50b8ad72efe4f1c
BLAKE2b-256 343be3be22fff55ec9ec2317ba1fe754656df1aead869acfdf139d17a3faa238

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.25-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page