Skip to main content

Python tools (backed by Rust) for sequence analysis

Project description

prseq (Python)

Python tools for sequence analysis, powered by Rust.

PyPI Python Version Build Status Downloads License: MIT

Overview

prseq provides Python bindings to a high-performance Rust library for FASTA and FASTQ parsing. It includes:

  • Pythonic API: Full type hints and Python-native data structures
  • CLI Tools: Ready-to-use command-line utilities
  • Rust Performance: Fast parsing with automatic compression detection
  • Memory Efficient: Streaming parsers for large files
  • Universal Input: Files, compressed files, and stdin support

The core parsing is implemented in the Rust prseq library.

Installation

Using uv (recommended)

uv add prseq

Using pip

pip install prseq

From source (developers)

git clone https://github.com/VirologyCharite/prseq.git
cd prseq/python
pip install maturin
maturin develop

Quick Start

Command Line Tools

# Analyze a FASTA file
fasta-info sequences.fasta
fasta-stats sequences.fasta.gz  # Works with compressed files
fasta-filter 100 sequences.fasta  # Keep sequences ≥100bp

# Analyze a FASTQ file
fastq-info reads.fastq
fastq-stats reads.fastq.bz2
fastq-filter 50 reads.fastq  # Keep sequences ≥50bp

# All tools support stdin
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 75

Python API

import prseq
from pathlib import Path

# FASTA files
records = prseq.read_fasta("sequences.fasta")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp")

# FASTQ files
records = prseq.read_fastq("reads.fastq")
for record in records:
    print(f"{record.id}: {len(record.sequence)} bp, quality: {len(record.quality)}")

# Streaming for large files - accepts str, Path, file object, or None
for record in prseq.FastaReader("large.fasta"):  # String path
    if len(record.sequence) > 1000:
        print(f"Long sequence: {record.id}")

for record in prseq.FastaReader(Path("large.fasta")):  # Path object
    print(f"{record.id}")

# Read from stdin
for record in prseq.FastqReader():  # None = stdin
    print(f"Read: {record.id}")

# Read from file object (must use binary mode 'rb')
with open("sequences.fasta", "rb") as f:
    for record in prseq.FastaReader(f):
        print(f"{record.id}")

Python API Reference

FASTA Support

from pathlib import Path
from prseq import FastaRecord, FastaReader, read_fasta

# FastaRecord - represents a single sequence
record = FastaRecord(id="seq1", sequence="ATCG")
print(record.id)        # "seq1"
print(record.sequence)  # "ATCG"

# Read all records into memory
records = read_fasta("file.fasta")
records = read_fasta("file.fasta.gz")  # Auto-detects compression
records = read_fasta(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastaReader("large.fasta")  # String path
reader = FastaReader(Path("large.fasta"))  # Path object
reader = FastaReader()  # None = stdin

with open("file.fasta", "rb") as f:  # Binary mode required
    reader = FastaReader(f)  # File object
    for record in reader:
        print(f"{record.id}: {len(record.sequence)}")

# Performance tuning
reader = FastaReader("file.fasta", sequence_size_hint=50000)

FASTQ Support

from pathlib import Path
from prseq import FastqRecord, FastqReader, read_fastq

# FastqRecord - represents a single read
record = FastqRecord(id="read1", sequence="ATCG", quality="IIII")
print(record.id)        # "read1"
print(record.sequence)  # "ATCG"
print(record.quality)   # "IIII"

# Read all records into memory
records = read_fastq("reads.fastq")
records = read_fastq("reads.fastq.bz2")  # Auto-detects compression
records = read_fastq(None)  # Read from stdin

# Stream records (memory efficient) - source can be:
# - str: file path
# - Path: pathlib.Path object
# - file object: open file in binary mode
# - None: read from stdin

reader = FastqReader("large.fastq")  # String path
reader = FastqReader(Path("large.fastq"))  # Path object
reader = FastqReader()  # None = stdin

with open("reads.fastq", "rb") as f:  # Binary mode required
    reader = FastqReader(f)  # File object
    for record in reader:
        # Validate quality length matches sequence
        assert len(record.sequence) == len(record.quality)
        print(f"{record.id}: {len(record.sequence)} bp")

# Performance tuning for short/long reads
reader = FastqReader("reads.fastq", sequence_size_hint=150)  # Short reads
reader = FastqReader("nanopore.fastq", sequence_size_hint=10000)  # Long reads

Advanced Usage

import prseq

# Filter sequences by length
def filter_by_length(filename, min_length):
    for record in prseq.FastaReader(filename):
        if len(record.sequence) >= min_length:
            yield record

# Calculate GC content
def gc_content(sequence):
    gc_count = sequence.upper().count('G') + sequence.upper().count('C')
    return gc_count / len(sequence) if sequence else 0

# Process compressed files
records = prseq.read_fasta("sequences.fasta.gz")
avg_gc = sum(gc_content(r.sequence) for r in records) / len(records)

# Convert FASTQ to FASTA
def fastq_to_fasta(fastq_file, fasta_file):
    with open(fasta_file, 'w') as f:
        for record in prseq.FastqReader(fastq_file):
            f.write(f">{record.id}\n{record.sequence}\n")

CLI Tools

FASTA Tools

Command Description Example
fasta-info Show basic file information fasta-info sequences.fasta
fasta-stats Calculate sequence statistics fasta-stats sequences.fasta.gz
fasta-filter Filter by minimum length fasta-filter 100 sequences.fasta

FASTQ Tools

Command Description Example
fastq-info Show basic file information fastq-info reads.fastq
fastq-stats Calculate sequence statistics fastq-stats reads.fastq.bz2
fastq-filter Filter by minimum length fastq-filter 50 reads.fastq

CLI Examples

# Basic usage
fasta-info genome.fasta
fastq-stats reads.fastq

# With compressed files (auto-detected)
fasta-stats sequences.fasta.gz
fastq-info reads.fastq.bz2

# Using stdin (great for pipelines)
cat sequences.fasta | fasta-stats
gunzip -c reads.fastq.gz | fastq-filter 100

# Performance tuning for large sequences
fasta-stats --size-hint 50000 genome.fasta
fastq-filter --size-hint 10000 150 nanopore.fastq

Development

Prerequisites

  • Python 3.8-3.12
  • Rust 1.70+
  • maturin for building Python extensions

Setup

cd python
pip install maturin
maturin develop

Testing

# Run all tests
python -m pytest tests/ -v

# Run integration tests
python -m pytest tests/ -v --integration

# Type checking with MyPy
mypy src/prseq

Building

# Development build
maturin develop

# Production wheel
maturin build --release

Publishing

cd python
maturin publish

Type Checking

The package includes full type hints and is configured for MyPy with Python 3.8+ compatibility. Type stubs are automatically generated for the Rust extension modules.

Rust Core

The Python package is built on top of the Rust prseq library, which provides the high-performance parsing implementation. If you need Rust-native parsing without Python, check out the Rust crate directly.

Links

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prseq-0.0.27-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.0 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

prseq-0.0.27-cp313-cp313-win_amd64.whl (240.8 kB view details)

Uploaded CPython 3.13Windows x86-64

prseq-0.0.27-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

prseq-0.0.27-cp313-cp313-macosx_11_0_arm64.whl (331.7 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

prseq-0.0.27-cp312-cp312-win_amd64.whl (241.0 kB view details)

Uploaded CPython 3.12Windows x86-64

prseq-0.0.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

prseq-0.0.27-cp312-cp312-macosx_11_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

prseq-0.0.27-cp311-cp311-win_amd64.whl (240.4 kB view details)

Uploaded CPython 3.11Windows x86-64

prseq-0.0.27-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

prseq-0.0.27-cp311-cp311-macosx_11_0_arm64.whl (335.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

prseq-0.0.27-cp310-cp310-win_amd64.whl (240.3 kB view details)

Uploaded CPython 3.10Windows x86-64

prseq-0.0.27-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (394.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

prseq-0.0.27-cp310-cp310-macosx_11_0_arm64.whl (335.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file prseq-0.0.27-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1231370b473a8c66469128daf642fc0a0ce69c08ec48e3fb1c12f5cb7672dcf2
MD5 25923a3bfeb80db7d0aa39c0dfb2ecdc
BLAKE2b-256 c25d7b136ee0d0ecfd4d9e092311eb14a4dd3b13b54609c555d85b8e2af530e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.27-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 240.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.27-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 6d4c5a1e22e0a845cebe17e7468b2c996289c1c635882f858e5126b080f8a975
MD5 05ae256a5c77a483ac4ebfde3318ffef
BLAKE2b-256 5481a4b8ca848caadb0bd5165548827b7b4492f33051c57728f92f371635ea5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp313-cp313-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 85e587a4a2e4cafe6c5250d2c79f0c7f5d75ccd1582cb01e8692d5b5b9bfc115
MD5 7e0e7af145b56de39ed7045dd80e259f
BLAKE2b-256 1a9bc5a800cc91e84942684ab90cedcede39ab562a23505d1bb57390aad6065b

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 286c87f815497096589477b765bde7abac36aa1ee112dc7ee7381384c6e63039
MD5 dfffd0a191e36267848250d5e2af2952
BLAKE2b-256 ddfca10299d9f778dcc79f76391e889cd7a7c565e3ece8727252670ed2d540ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.27-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 241.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.27-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 473a409f33d668dd121cec3f202b6d1956512d28d40e84429ed30a28d6fae950
MD5 ddb4cb601ca08063d77ec91a247249f4
BLAKE2b-256 00c7b1043095ca9f3a9e3cf0f3b9ceda2cac11aeae13fb1473bbf9ed69a15885

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp312-cp312-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 abd7cc8324e45452c5f6c67526903cc7b4fc19db3c3d4cca0a92185df40974d1
MD5 cb530d8f60622ed7a7a74fc997dbffd9
BLAKE2b-256 b6256fddfc28fdb799833a09ddd0bb5cb465237f38f9fa5517950eed51173910

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a41b832cb027c67fd61fa9aeb79661113071d3f42323a5f3172eba7e108898c5
MD5 8a0fa3578eaaaf5f55f7528b8cca4261
BLAKE2b-256 f8a65d54192827d2ca386cc906164e2eb52cad703f55a945612efb2f446d4010

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.27-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 240.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.27-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2a83a73296847d4d6f5c3a6b3e781f0681ce0e426082306dadadb364713309b2
MD5 de254a6f58cbd615cea173d3b8088369
BLAKE2b-256 8f6305397845e74e3d7526b41d55e171d3e0eafbeeea90007918f14a16f1c566

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp311-cp311-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a4cd334a603eada5bc1a53a966df730e9110f88afc0721794362e8d9b271e0b4
MD5 e5707989ba0de25c3d133765bef330e9
BLAKE2b-256 84fd2ad6e4a593749543cbc20b2ba19fabb3860762b110c46c965f4452bdc0be

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2d203457af4b6cd8ec367d0eb206d56164a9e9597a21d337c311a765db8c5cc4
MD5 6e8b86c3287608e5854aed1d6a612633
BLAKE2b-256 2f0b0932a1c0a3a81cc73e8becf31e8ce95a1a580170cb5c3111c20ade9e5b29

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: prseq-0.0.27-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 240.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prseq-0.0.27-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9c1c8c1cbcd6dfd3266db440357919603af1958a8261faa1aa13a2827750adfb
MD5 5f629f3405af10f8649aba0f19cb83cd
BLAKE2b-256 d70e2d4f8ff09ab69afdb7b394ab7b83e5df191efb4fd9998aaa6c5108628da7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp310-cp310-win_amd64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a17b4a691729a5eb82e98327a0ee247e2494a76f79c5dc06beed85da9f06bd3
MD5 2e0f0d3862ba26c2af83b2462fe1b23e
BLAKE2b-256 2973f5fdbcd593aceb35e9c71ab9ad572a769647238649e95072b7c30f7d7ae7

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prseq-0.0.27-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prseq-0.0.27-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb2277b57d5acea37e68da6820f89a77238a1ae8106c150f59fc5eaa3fe758af
MD5 253bedd27579993b9162a7b638676cae
BLAKE2b-256 ae2db5009270a1b45280205b02dd9afdb252b0ebfb603279ff19443694d85773

See more details on using hashes here.

Provenance

The following attestation bundles were made for prseq-0.0.27-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: workflow.yaml on VirologyCharite/prseq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page