High-performance PyO3 Python bindings for rustkmer k-mer library

These details have not been verified by PyPI

Project links

Project description

RustKmer PyO3 Python Bindings

High-performance Python bindings for the RustKmer k-mer counting and querying library using PyO3.

Features

High Performance: Native Rust extensions with minimal Python overhead
Memory Efficient: Optimized memory usage for large genomic datasets
Flexible Loading: Choose between Preload, MemoryMapped, or Lazy loading modes
Complete API: Full access to k-mer counting, database querying, and fuzzy matching
Pythonic Interface: Clean, intuitive Python API design
Compatible: Works with Python 3.11+

Database Loading Modes

The PyDatabase class supports three loading modes to balance performance and memory usage:

LoadMode.Preload

Description: Loads all k-mers into memory HashMap for fastest queries
Performance: Fastest query speed (sub-millisecond)
Memory Usage: High (stores all k-mers in memory)
Best For: Applications with frequent queries on the same database

LoadMode.MemoryMapped

Description: Uses memory-mapped file access for balanced performance
Performance: Good query speed with moderate memory usage
Memory Usage: Low to Moderate (OS manages caching)
Best For: Large databases where memory is limited

LoadMode.Lazy

Description: Loads k-mers on-demand using binary search
Performance: Slower queries but no memory overhead for unused k-mers
Memory Usage: Very Low (only stores sorted index)
Best For: Applications with infrequent queries or very large databases

Installation

# Install from source
maturin develop --release

# Or with pip (when published)
pip install pyrustkmer

Quick Start

from pyrustkmer import PyDatabase, PyCounter, LoadMode

# Create a k-mer counter
counter = PyCounter(k=21, canonical=True)

# Add sequences to count k-mers
counter.add_sequence("ATCGATCGATCGATCG")

# Get statistics
stats = counter.get_stats()
print(f"Counted {stats.unique_kmers} unique k-mers")

# Load database with different modes
# Preload: Fastest queries, highest memory usage
db_preload = PyDatabase("genome.rkdb", LoadMode.Preload)

# MemoryMapped: Balanced performance
db_mmap = PyDatabase("genome.rkdb", LoadMode.MemoryMapped)

# Lazy: Lowest memory usage, binary search
db_lazy = PyDatabase("genome.rkdb", LoadMode.Lazy)

# Query k-mers using unified API
result = db_preload.query_exact("ATCGATCGATCGATCG")
print(f"K-mer count: {result.count}")

# Check memory usage
memory_info = db_preload.get_memory_usage()
print(f"Memory usage: {memory_info}")

# Batch query
results = db_preload.query_exact_batch(["ATCGATCGATCGATCG", "GGGGGGGGGGGGGGGGGGGGG"])
print(f"Batch results: {len(results)} queries processed")

Unified Query API

PyO3 0.4.1 introduces a unified query interface with consistent naming:

Exact Query Methods

# Single k-mer query
result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")

# Batch query
results = db.query_exact_batch(["AAAAA", "TTTTT", "CCCCC"])

Prefix Query Methods

# Optimized prefix query
results = db.query_prefix("ATGCG")

# Batch prefix query
prefixes = ["ATG", "CGA", "GCT"]
results = db.query_prefix_batch(prefixes)

Hybrid Query Methods (Wildcard Support)

# Pattern-based hybrid query with {N} syntax
result = db.query_hybrid("ATGCG{N3}CGAT")

# Batch hybrid query
patterns = ["ATG{1}CGAT", "CGA{2}TGC"]
results = db.query_hybrid_batch(patterns)

# Parse pattern syntax
info = db.parse_pattern("ATGCG{N3}CGAT")
# Returns: {'prefix': 'ATGCG', 'suffix': 'CGAT', 'n_count': 3, ...}

Fuzzy Query Methods

from pyrustkmer import PyFuzzyQuery

fuzzy = PyFuzzyQuery(db)
result = fuzzy.query_fuzzy("ATNNN", max_mutations=2)

Legacy API (Backward Compatible)

The legacy methods are still available but marked as deprecated:

# Old methods (still work but show deprecation warnings)
result = db.query("ATGCGATGCTAGCGCTAGCTAG")  # Use query_exact() instead
result = db.fuzzy_query("ATNNN", max_mutations=2)  # Use PyFuzzyQuery.query_fuzzy() instead
result = db.query_prefix_optimized("ATGCG")  # Use query_prefix() instead

Formatter Methods

All query results support multiple output formats:

PyQueryResult Formatting

result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")
print(result.to_json())
print(result.to_csv())
print(result.to_tsv())

PyPrefixQueryResult Formatting

results = db.query_prefix("ATGCG")
print(results.to_json())
print(results.to_csv())
print(results.to_tsv())
print(results.to_table())  # ASCII table format

PyFuzzyResult Formatting

fuzzy = PyFuzzyQuery(db)
result = fuzzy.query_fuzzy("ATNNN", max_mutations=2)
print(result.to_json())
print(result.to_csv())
print(result.to_tsv())

PyDatabaseStats Formatting

stats = db.get_stats()
print(stats.to_json())
print(stats.to_csv())
print(stats.to_tsv())

API Reference

PyCounter

High-performance k-mer counter for counting k-mers in DNA sequences.

Available Methods:

add_kmer(kmer) - Add single k-mer
add_sequence(sequence) - Add from sequence string
add_from_fasta(path) - Read FASTA file
add_from_fastq(path) - Read FASTQ file
get_count(kmer) - Query specific k-mer count
get_all_counts() - Get all counts as dict
reset() - Clear counter
save_database(path) - Save to RKDB format
get_stats() - Get counter statistics
is_empty() - Check if empty
kmer_length (property) - Get k-mer size
canonical (property) - Get canonical mode flag

PyDatabase

Efficient database querying for k-mer count lookups.

Query Methods:

query_exact(kmer) - Single exact k-mer query
query_exact_batch(kmers) - Batch exact queries
query_prefix(prefix) - Prefix-based queries
query_prefix_batch(prefixes) - Batch prefix queries
query_hybrid(pattern) - Pattern queries with {N} wildcards
query_hybrid_batch(patterns) - Batch pattern queries
parse_pattern(pattern) - Parse hybrid pattern syntax

Fuzzy Query: Use PyFuzzyQuery class for fuzzy matching with mutations.

Utility Methods:

get_stats() - Get database statistics
get_memory_usage() - Get memory usage info
database_info() - Get database metadata
exists(kmer) - Check if k-mer exists
export_all_kmers() - Export all k-mers
dump(limit, offset) - Paginated database dump

PyFuzzyQuery

Advanced fuzzy matching with wildcard and mutation support.

Methods:

query_fuzzy(pattern, max_mutations) - Fuzzy k-mer query

PyFormatter

K-mer result formatting utilities.

Methods:

format_kmer(kmer) - Format k-mer string
format_count(count) - Format count result
canonical - Get/Set canonical k-mer mode

Test Suite

Test Coverage

PyCounter Tests (77 tests)

Basic creation, k-mer sizes, invalid inputs
Add k-mer, add sequence, get count
Reset, is_empty, get_stats
Canonical mode, FASTA/FASTQ files
Save database, edge cases, memory usage

Formatter Tests (52 tests)

PyQueryResult formatting (to_json, to_csv, to_tsv, to_dict)
PyPrefixQueryResult formatting (to_json, to_csv, to_tsv, to_table)
PyFuzzyResult formatting (to_json, to_csv, to_tsv)
PyDatabaseStats formatting (to_json, to_csv, to_tsv)
Format consistency, edge cases, integration tests

API Tests

Module import and class detection
Method signature verification
Legacy vs new API compatibility

Running Tests

# Run all tests with coverage
pytest pyo3/tests/ -v --cov=pyrustkmer

# Run specific test file
pytest pyo3/tests/test_counter.py -v

# Run with coverage report
pytest pyo3/tests/ --cov-report=term-missing --cov-report=html

Build Status

Current Version: 0.4.1

Build Status: ✅ Successful

All PyO3 compilation errors resolved
105 tests passing with 100% coverage
New unified API methods properly exported

Known Warnings (Non-blocking)

The following warnings don't affect functionality:

Unused type alias: RustKmerResult
Unused structs: QueryResultSerializable, PrefixQueryResultSerializable, etc.
Unused functions: validate_kmer, py_string_to_string, string_vec_to_py_list

Requirements

Python 3.11+
Rust toolchain (1.80+)
maturin build tool
numpy>=1.21

Version History

v0.4.1 (Current)

✅ Fixed Python module export issues
✅ Added unified query API methods (query_exact, query_prefix, etc.)
✅ Improved PyFuzzyQuery integration
✅ Enhanced formatter output options
✅ 105 tests passing, 100% coverage

v0.4.0

Initial PyO3 binding release
Basic k-mer counting and querying
Multiple load modes (Preload, MemoryMapped, Lazy)
Fuzzy query support

Migration Guide

Upgrading from v0.4.0 to v0.4.1

Old API (Still works but deprecated):

result = db.query("ATGCGATGCTAGCGCTAGCTAG")
result = db.fuzzy_query("ATNNN", max_mutations=2)
result = db.query_prefix_optimized("ATGCG")

New API (Recommended):

result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")
result = PyFuzzyQuery(db).query_fuzzy("ATNNN", max_mutations=2)
result = db.query_prefix("ATGCG")

Performance Benefits

The unified API provides:

66% memory reduction: Single PyDatabase instance instead of 3
3x faster loading: No duplicate database loading
Simplified API: Single entry point for all query types
Better batch processing: Reduced Python/Rust boundary crossing

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.2

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyrustkmer-0.5.2-cp312-cp312-macosx_11_0_arm64.whl (651.6 kB view details)

Uploaded Feb 28, 2026 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file pyrustkmer-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: pyrustkmer-0.5.2-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Feb 28, 2026
Size: 651.6 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyrustkmer-0.5.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`aa7cc109ffaf1d8751776c5f3dbaf7769bcd297c130897d8a7beb47eb9415700`
MD5	`2c6a5f947af647e33be10e6b219b236a`
BLAKE2b-256	`38ca286f3ec318afc0d4c57fc095b9a12f3fce5e04cb3b70c7fe5c0f61e3cdb4`

See more details on using hashes here.

pyrustkmer 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RustKmer PyO3 Python Bindings

Features

Database Loading Modes

LoadMode.Preload

LoadMode.MemoryMapped

LoadMode.Lazy

Installation

Quick Start

Unified Query API

Exact Query Methods

Prefix Query Methods

Hybrid Query Methods (Wildcard Support)

Fuzzy Query Methods

Legacy API (Backward Compatible)

Formatter Methods

PyQueryResult Formatting

PyPrefixQueryResult Formatting

PyFuzzyResult Formatting

PyDatabaseStats Formatting

API Reference

PyCounter

PyDatabase

PyFuzzyQuery

PyFormatter

Test Suite

Test Coverage

PyCounter Tests (77 tests)

Formatter Tests (52 tests)

API Tests

Running Tests

Build Status

Current Version: 0.4.1

Known Warnings (Non-blocking)

Requirements

Version History

v0.4.1 (Current)

v0.4.0

Migration Guide

Upgrading from v0.4.0 to v0.4.1

Performance Benefits

License

See Also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes