Skip to main content

Perfect hash based index for genome data.

Project description

aindex: perfect hash based index for genomic data

PyPI version PyPI pyversions PyPI - Wheel GitHub Actions Workflow Status PyPI license DOI

Features

🚀 High Performance: Ultra-fast k-mer querying with optimized C++ backend

  • 13-mers: 2.0M queries/sec (batch), 491K queries/sec (single)
  • 23-mers: 2.3M queries/sec (batch), 1.1M queries/sec (single)
  • Sequence coverage analysis: 24.5K sequences/sec (13-mers), 17.5K sequences/sec (23-mers)

🧬 Dual K-mer Support: Native support for both 13-mer and 23-mer k-mers

  • 13-mers: Complete 4^13 space coverage with perfect hashing
  • 23-mers: Efficient sparse indexing for genomic sequences
  • Auto-detection: Seamlessly switches between modes based on k-mer length

💾 Memory Efficient: Optimized data structures and memory-mapped files

  • Batch operations: Up to 4x faster than single queries
  • Minimal memory overhead: Constant memory usage during processing
  • Real-time processing: Stream processing for large genomic datasets

🔧 Modern API: Clean pybind11 interface with comprehensive functionality

Installation

Quick install with pip:

pip install aindex2

✅ Supported platforms (pre-built wheels available):

  • macOS: arm64 (Apple Silicon M1/M2/M3) - full functionality with C++ optimizations
  • Linux: x86_64 - full functionality with C++ optimizations

⚡ Currently optimized for: Our builds are specifically optimized for the most widely used platforms:

  • Apple Silicon (M1/M2/M3): Native ARM64 optimizations with up to 30% faster performance
  • Linux x86_64: Standard Intel/AMD processors with full C++ backend

🔄 Other platforms: For platforms not listed above (Windows, Linux ARM64, macOS Intel), you can:

  • Use Windows Subsystem for Linux (WSL) for Windows users
  • Build from source (see building instructions below)
  • Use cloud environments (Google Colab, Jupyter notebooks, etc.)

Recommended platforms for production use: Linux x86_64 or macOS arm64

📋 For detailed platform support information, see PLATFORM_SUPPORT.md

Building from source (optional):

git clone https://github.com/ad3002/aindex.git
cd aindex
make arm64  # For Apple Silicon
# or
make all    # For x86_64
pip install .

Requirements:

  • Python 3.8+
  • Standard build tools (automatically handled by pip)
  • No external dependencies required

For Google Colab users:

!pip install aindex2

Detailed Installation Instructions

Standard installation with pip (all platforms):

pip install aindex2

Installation from source (for development or custom builds):

git clone https://github.com/ad3002/aindex.git
cd aindex

# Standard build (all platforms)
make all
pip install .

# For Apple Silicon with ARM64 optimizations  
make arm64
pip install .

Platform support:

  • macOS arm64 (Apple Silicon M1/M2/M3): Pre-built wheels with ARM64 optimizations
  • Linux x86_64: Pre-built wheels with full C++ functionality
  • Other platforms: Build from source or use alternative environments (WSL, Docker, Colab)
  • macOS: x86_64 (Intel), arm64 (Apple Silicon) - pre-built wheels available with full C++ functionality
  • Windows: AMD64 - pre-built wheels available with Python-only functionality

All platforms include optimized builds with no external dependencies required.

Windows-Specific Notes

Current Status: Python-only functionality

  • The Windows build installs successfully but has limited functionality
  • C++ k-mer counting and high-performance indexing are not available on Windows
  • This is due to POSIX-specific dependencies (sys/mman.h, memory mapping) in the C++ backend

What works on Windows:

# Python utilities and scripts work normally
import aindex

# File format conversion utilities
aindex.reads_to_fasta(input_file, output_file)

# Command-line utilities for file processing
# Note: High-performance k-mer operations require Linux/macOS

What doesn't work on Windows:

# These operations require C++ backend (Linux/macOS only)
from aindex.core.aindex import AIndex  # Will show clear error message
index = AIndex.load_from_prefix("data")  # Not available on Windows

For Windows users who need full functionality:

  1. Use WSL (Windows Subsystem for Linux): Install Linux subsystem and use the Linux version
  2. Use Docker: Run aindex in a Linux container
  3. Use cloud/remote Linux machine: Process data on Linux and transfer results

Alternative for Windows users:

# Use WSL or Docker to get full functionality
wsl --install
wsl
pip install aindex2  # Now runs Linux version with full functionality

Google Colab Installation

For installation in Google Colab environment, there's a known cmake conflict that needs to be resolved first:

# Quick fix for cmake conflict
!pip uninstall -y cmake
!apt-get update
!apt-get install -y build-essential cmake git python3-dev

# Clone and install aindex
!git clone https://github.com/ad3002/aindex.git
%cd aindex
!pip install .

Alternative: Use automatic installation script

# Download and run the installation script
!wget https://raw.githubusercontent.com/ad3002/aindex/main/install_colab.py
!python install_colab.py

For troubleshooting, use the diagnostic script:

!wget https://raw.githubusercontent.com/ad3002/aindex/main/diagnose_colab.py
!python diagnose_colab.py

Note: Google Colab has a conflict between the Python cmake package and system cmake. The scripts above automatically resolve this issue.

To uninstall:

pip uninstall aindex2
pip uninstall clean

To clean up the compiled files, run:

make clean

macOS Compilation (ARM64/Apple Silicon Support)

For macOS systems (including Apple Silicon M1/M2), aindex now provides full ARM64 support with optimized performance:

# Build all components including the fast kmer_counter utility
make

# Alternative: build only core components
make macos

The project has been fully ported to ARM64/macOS, removing x86-specific dependencies (SSE instructions) and adding native ARM64 optimization.

Requirements for macOS:

  • For jellyfish-based pipeline: brew install jellyfish (optional)
  • Built-in kmer_counter provides faster alternative to jellyfish
  • Python development headers (usually included with Xcode tools)

Performance Note: The new built-in kmer_counter utility is approximately 5x faster than jellyfish for k-mer counting tasks.

Usage

Note: The examples below demonstrate full functionality available on Linux and macOS. Windows users have access to Python utilities only. See Windows-Specific Notes for details.

Command Line Interface (CLI)

aindex provides a unified command-line interface for all tools and utilities. After installation, all functions are accessible through the aindex command:

# Get help for all available commands
aindex --help

# Get help for a specific command
aindex count --help
aindex compute-aindex --help

Available Commands

Core indexing tools:

# Compute AIndex for genomic sequences (supports both 13-mer and 23-mer modes)
aindex compute-aindex -i input.fastq -o output_prefix -k 23

# Compute general index
aindex compute-index -i input.fasta -o output_prefix

# Process reads for indexing
aindex compute-reads input.fastq output.fastq fastq reads_prefix

K-mer analysis:

# Count k-mers in sequences (fast built-in counter)
aindex count -i input.fasta -o output.txt -k 23 -t 4

# Count 13-mers specifically (optimized for complete 13-mer space)
aindex count -i input.fasta -o output.txt -k 13

# Build hash table for k-mers
aindex build-hash -i kmers.txt -o hash_output

# Generate all possible 13-mers
aindex generate -o all_13mers.txt -k 13

Utilities:

# Convert reads to FASTA format
aindex reads-to-fasta input.fastq output.fasta

# Show version information
aindex version

# Show system and installation information
aindex info

Examples

Count 23-mers in a FASTA file:

aindex count -i genome.fasta -o kmer_counts.txt -k 23 -t 8

Build AIndex for 13-mer analysis:

aindex compute-aindex -i reads.fastq -o reads_index -k 13 --lu 2 -P 16

Generate all possible 13-mers for reference:

aindex generate -o all_13mers.txt -k 13

K-mer Counting Pipelines

aindex supports two k-mer counting backends:

  1. Built-in kmer_counter (Recommended) - Fast native implementation, ~5x faster than jellyfish
  2. Jellyfish - Traditional external tool (requires brew install jellyfish on macOS)

Quick Start

Compute all binary arrays using the fast built-in counter:

FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./tests/raw_reads.101bp.IS350bp25

# Using built-in kmer_counter (recommended, faster) via CLI
aindex compute-aindex -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use-kmer-counter

# Using built-in kmer_counter (legacy script approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

# Using jellyfish (traditional approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30

Command Line Options

  • -i, --input: Input FASTQ/FASTA files (comma-separated for multiple files)
  • -t, --type: Input file type ('fastq' or 'fasta')
  • -o, --output: Output prefix for generated files
  • --lu: Lower frequency threshold for k-mers
  • -P, --threads: Number of threads to use
  • --use-kmer-counter: Use built-in fast k-mer counter instead of jellyfish

Pipeline Outputs

Both pipelines generate identical output files:

  • .reads - Processed reads file
  • .dat - K-mer frequency data
  • .aindex - Binary index file
  • .stat - Statistics and metadata

Usage from Python

Platform Note: Full Python API with C++ backend is available on Linux and macOS. Windows provides Python utilities only.

Modern API

The aindex package provides a unified API supporting both 13-mer and 23-mer modes:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp

# Load 23-mer index (for genomic sequences)
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")  # Optional: load actual read sequences

# Load 13-mer index (for complete k-mer space analysis)
index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")  # Optional: load reads

print(f"23-mer index: {index_23mer.n_kmers:,} k-mers, {index_23mer.n_reads:,} reads")
print(f"13-mer index: {index_13mer.get_13mer_statistics()}")

K-mer Frequency Queries

Single k-mer queries:

# 23-mer queries (using AIndex wrapper)
tf_23 = index_23mer.get_tf_value("ATCGATCGATCGATCGATCGATC")  # 23 characters
print(f"23-mer frequency: {tf_23}")

# Alternative 23-mer query using [] operator
tf_23_alt = index_23mer["ATCGATCGATCGATCGATCGATC"]
print(f"23-mer frequency (alt): {tf_23_alt}")

# 13-mer queries (using C++ wrapper directly)
tf_13 = index_13mer.get_total_tf_value_13mer("ATCGATCGATCGA")  # 13 characters
print(f"13-mer frequency: {tf_13}")

# Get forward and reverse frequencies separately for 13-mers
tf_fwd, tf_rev = index_13mer.get_tf_both_directions_13mer("ATCGATCGATCGA")
print(f"13-mer forward: {tf_fwd}, reverse: {tf_rev}, total: {tf_fwd + tf_rev}")

Batch queries (much faster):

# Batch 23-mer queries (2-3x faster than single queries)
kmers_23 = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA", "TTTTTTTTTTTTTTTTTTTTTTT"]
tf_values_23 = index_23mer.get_tf_values(kmers_23)
print(f"23-mer batch results: {tf_values_23}")

# Batch 13-mer queries (total frequencies)
kmers_13 = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] 
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
print(f"13-mer batch results: {tf_values_13}")

# Batch directional 13-mer queries (forward + reverse separately)
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}, total={fwd+rev}")

Advanced 13-mer Operations

Directional analysis (forward + reverse complement):

# Get frequencies in both directions
kmer = "ATCGATCGATCGA"
forward_tf, reverse_tf = index_13mer.get_tf_both_directions_13mer(kmer)
total_tf = index_13mer.get_total_tf_value_13mer(kmer)

print(f"Forward: {forward_tf}, Reverse: {reverse_tf}, Total: {total_tf}")

# Batch directional analysis
results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}")

Complete 13-mer space analysis:

# Get statistics for the entire 13-mer space
stats = index_13mer.get_13mer_statistics()
print(f"Total 13-mers: {stats['total_kmers']:,}")
print(f"Non-zero frequencies: {stats['non_zero_kmers']:,}")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (4^13 = 67M elements)
# Note: This loads 256MB into memory
full_array = index_13mer.get_13mer_tf_array()
print(f"Array size: {len(full_array):,} elements")

Sequence Coverage Analysis

Analyze k-mer coverage in sequences:

# Using real reads from the index
real_read = index_23mer.get_read_by_rid(0)  # Get first read
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]    # Take first 100 bp

# Analyze 23-mer coverage using built-in function
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
print(f"23-mer coverage: {len(coverage_23)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_23 if tf > 0)}")
print(f"Average TF: {sum(coverage_23)/len(coverage_23):.2f}")

# Analyze 13-mer coverage using batch queries
kmers_13_in_seq = [sequence[i:i+13] for i in range(len(sequence) - 12)]
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13_in_seq)
print(f"13-mer coverage: {len(coverage_13)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_13 if tf > 0)}")
print(f"Average TF: {sum(coverage_13)/len(coverage_13):.2f}")

Iterate over k-mers in sequences:

# 23-mer iteration using built-in iterator
for kmer, tf in index_23mer.iter_sequence_kmers(sequence, k=23):
    if tf > 0:  # Only show k-mers found in index
        print(f"{kmer}: {tf}")

# 13-mer iteration using manual approach (more efficient with batch)
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)

for i, (kmer, tf) in enumerate(zip(kmers_13, tf_values_13)):
    if tf > 0:
        print(f"Position {i}: {kmer}: {tf}")

# For directional analysis of 13-mers
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    if fwd > 0 or rev > 0:
        print(f"Position {i}: {kmers_13[i]}: forward={fwd}, reverse={rev}")

Performance Benchmarks

Based on stress testing with 1M queries and 10K sequence analyses:

Operation 13-mers 23-mers Speedup
Single TF queries 491K queries/sec 1.1M queries/sec 23-mer 2.2x faster
Batch TF queries 2.0M queries/sec 2.3M queries/sec 23-mer 1.2x faster
Sequence coverage 24.5K sequences/sec 17.5K sequences/sec 13-mer 1.4x faster
K-mer positions 2.2M positions/sec 1.4M positions/sec 13-mer 1.6x faster

Key findings:

  • Batch operations: 2-4x faster than single queries for both modes
  • 23-mers: Better for single/batch TF queries due to optimized sparse indexing
  • 13-mers: Better for sequence analysis due to complete space coverage
  • Memory efficiency: Minimal memory growth during batch operations

Working with Reads

Access reads by ID:

# Get reads from either index
for rid in range(min(5, index_23mer.n_reads)):
    read = index_23mer.get_read_by_rid(rid)
    print(f"Read {rid}: {read[:50]}...")  # First 50 characters
    
    # Split paired reads (separated by '~')
    if '~' in read:
        read1, read2 = read.split('~')
        print(f"  Read 1: {len(read1)} bp, Read 2: {len(read2)} bp")

Iterate over all reads:

# Iterate through reads with automatic ID assignment
read_count = 0
for rid, read in index_23mer.iter_reads():
    read_count += 1
    if read_count <= 5:  # Show first 5 reads
        print(f"Read {rid}: {len(read)} bp")
    if read_count >= 1000:  # Process first 1000 reads
        break
        
print(f"Processed {read_count} reads")

Complete Example

Here's a practical example showing both 13-mer and 23-mer analysis:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp
import time

# Load both indices
print("Loading indices...")
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")

index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")

# Get a real sequence to analyze
real_read = index_23mer.get_read_by_rid(0)
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]
print(f"Analyzing sequence: {sequence[:50]}...")

# Compare 13-mer vs 23-mer coverage
print("\n=== Coverage Analysis ===")

# 23-mer coverage using built-in function
start = time.time()
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
time_23 = time.time() - start

# 13-mer coverage using batch query
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
start = time.time()
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
time_13 = time.time() - start

print(f"23-mers: {len(coverage_23)} positions, {sum(1 for x in coverage_23 if x > 0)} covered ({time_23*1000:.1f}ms)")
print(f"13-mers: {len(coverage_13)} positions, {sum(1 for x in coverage_13 if x > 0)} covered ({time_13*1000:.1f}ms)")

# Performance comparison
print(f"\n=== Performance Test ===")
test_kmers_23 = ["ATCGATCGATCGATCGATCGATC"] * 1000
test_kmers_13 = ["ATCGATCGATCGA"] * 1000

# 23-mer batch query
start = time.time()
results_23 = index_23mer.get_tf_values(test_kmers_23)
time_23_batch = time.time() - start

# 13-mer batch query
start = time.time()
results_13 = index_13mer.get_total_tf_values_13mer(test_kmers_13)
time_13_batch = time.time() - start

print(f"23-mer batch (1K queries): {len(test_kmers_23)/time_23_batch:.0f} queries/sec")
print(f"13-mer batch (1K queries): {len(test_kmers_13)/time_13_batch:.0f} queries/sec")

# Statistics
stats_23 = {"kmers": index_23mer.n_kmers, "reads": index_23mer.n_reads}
stats_13 = index_13mer.get_13mer_statistics()

print(f"\n=== Index Statistics ===")
print(f"23-mer index: {stats_23['kmers']:,} k-mers, {stats_23['reads']:,} reads")
print(f"13-mer index: {stats_13['total_kmers']:,} total k-mers, {stats_13['non_zero_kmers']:,} non-zero")

Expected output:

Loading indices...
Analyzing sequence: NNNNNNNNNNACTGAACCGCCTTCCGATCTCCAGCTGCAAAGCGTAG...

=== Coverage Analysis ===
23-mers: 78 positions, 42 covered (0.3ms)
13-mers: 88 positions, 88 covered (0.1ms)

=== Performance Test ===
23-mer batch (1K queries): 2,300,000 queries/sec
13-mer batch (1K queries): 2,000,000 queries/sec

=== Index Statistics ===
23-mer index: 15,234,567 k-mers, 125,000 reads  
13-mer index: 67,108,864 total k-mers, 8,945,123 non-zero

Advanced Features

13-mer Integration

The aindex library provides highly optimized 13-mer k-mer counting and querying using precomputed perfect hash tables. This mode offers complete coverage of the 4^13 k-mer space with exceptional performance.

Performance Characteristics

Query Performance:

  • Single queries: 491K queries/second
  • Batch queries: 2.0M queries/second (4.1x speedup)
  • Directional queries: 1.8M queries/second (forward + reverse complement)
  • Complete space: Access to all 67,108,864 possible 13-mers

Sequence Analysis Performance:

  • Coverage analysis: 24,500 sequences/second
  • Position analysis: 2.2M k-mer positions/second
  • Memory efficiency: Zero memory growth during batch operations
  • Real data coverage: 100% (all k-mers found in biological data)

13-mer Workflow

1. Generate Complete 13-mer Space:

# Generate all possible 13-mers (67M k-mers)
./bin/generate_all_13mers.exe all_13mers.txt

# Build perfect hash for instant lookup
./bin/build_13mer_hash.exe all_13mers.txt temp/all_13mers 4

# Count k-mers in your genomic data
./bin/count_kmers13.exe input_reads.fasta temp/all_13mers.tf.bin hash_file 4

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 13-mer index with complete k-mer space
index = AIndex.load_from_prefix_13mer("temp/all_13mers")
index.load_reads("temp/reads.reads")  # Optional: load read sequences

# Query performance demonstration
import time

# Single k-mer query
start = time.time()
tf = index.get_total_tf_value_13mer("ATCGATCGATCGA")
single_time = time.time() - start
print(f"Single query: {tf} (took {single_time*1000:.3f}ms)")

# Batch query (much faster)
kmers = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] * 1000  # 3K queries
start = time.time() 
tf_values = index.get_total_tf_values_13mer(kmers)
batch_time = time.time() - start
print(f"Batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Directional analysis (forward + reverse complement)
forward, reverse = index.get_tf_both_directions_13mer("ATCGATCGATCGA")
total = index.get_total_tf_value_13mer("ATCGATCGATCGA")
print(f"Directional: forward={forward}, reverse={reverse}, total={total}")

13-mer Statistics and Analysis

Get comprehensive statistics:

# Complete 13-mer space statistics
stats = index.get_13mer_statistics()
print(f"Total 13-mer space: {stats['total_kmers']:,}")
print(f"Found in data: {stats['non_zero_kmers']:,} ({stats['non_zero_kmers']/stats['total_kmers']*100:.2f}%)")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Total occurrences: {stats['total_count']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (warning: 256MB)
if stats['non_zero_kmers'] > 0:
    # Get subset for analysis rather than full array
    sample_indices = range(0, 1000000, 1000)  # Sample every 1000th element
    sample_tfs = [index.get_tf_by_index_13mer(i) for i in sample_indices]
    non_zero_sample = [tf for tf in sample_tfs if tf > 0]
    print(f"Sample analysis: {len(non_zero_sample)}/{len(sample_tfs)} non-zero in sample")

Sequence coverage analysis:

# Analyze real genomic sequences
for rid in range(min(5, index.n_reads)):
    read = index.get_read_by_rid(rid)
    if '~' in read:
        sequence = read.split('~')[0]  # Take first mate
    else:
        sequence = read
    
    # Limit to reasonable length for demonstration
    if len(sequence) > 100:
        sequence = sequence[:100]
    
    # Compute 13-mer coverage
    coverage = []
    for i in range(len(sequence) - 12):
        kmer = sequence[i:i+13]
        tf = index.get_total_tf_value_13mer(kmer)
        coverage.append(tf)
    
    if coverage:
        avg_tf = sum(coverage) / len(coverage)
        max_tf = max(coverage)
        coverage_pct = sum(1 for tf in coverage if tf > 0) / len(coverage) * 100
        print(f"Read {rid}: {len(coverage)} 13-mers, {coverage_pct:.1f}% covered, avg TF {avg_tf:.1f}, max TF {max_tf}")

23-mer Integration

The 23-mer mode provides efficient sparse indexing for longer k-mers commonly used in genomic analysis.

Performance Characteristics

Query Performance:

  • Single queries: 1.0M queries/second
  • Batch queries: 2.4M queries/second (2.4x speedup)
  • Directional queries: Available for forward + reverse complement analysis
  • Sparse indexing: Only stores k-mers present in input data

Sequence Analysis Performance:

  • Coverage analysis: 16,900 sequences/second
  • Position analysis: 1.3M k-mer positions/second
  • Memory efficiency: Constant memory usage during operations
  • Real data coverage: 100% (all k-mers found in genomic sequences)

23-mer Workflow

1. Build 23-mer Index:

# Using the fast built-in k-mer counter (recommended)
FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./temp/reads.23

python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 23-mer index
index = AIndex.load_from_prefix("temp/reads.23")
index.load_reads("temp/reads.reads")

# Performance demonstration
import time

# Batch query performance
kmers = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA"] * 1000  # 2K queries
start = time.time()
tf_values = index.get_tf_values(kmers)  # Auto-detects 23-mer mode
batch_time = time.time() - start
print(f"23-mer batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Sequence coverage analysis
read = index.get_read_by_rid(0)
sequence = read.split('~')[0][:100] if '~' in read else read[:100]

start = time.time()
coverage = index.get_sequence_coverage(sequence, cutoff=0, k=23)
coverage_time = time.time() - start

print(f"23-mer coverage analysis: {len(coverage)} positions in {coverage_time*1000:.1f}ms")
print(f"Coverage: {sum(1 for tf in coverage if tf > 0)/len(coverage)*100:.1f}% positions covered")
print(f"Average TF: {sum(coverage)/len(coverage):.1f}")

Performance Comparison

Throughput Comparison (Operations per Second)

Operation Type 13-mers 23-mers Winner
Single TF queries 491K/sec 1.1M/sec 23-mer (+124%)
Batch TF queries 2.0M/sec 2.3M/sec 23-mer (+15%)
Sequence coverage 24.5K/sec 17.5K/sec 13-mer (+40%)
Position analysis 2.2M/sec 1.4M/sec 13-mer (+57%)

Use Case Recommendations

Choose 13-mers when:

  • Analyzing complete k-mer space (population genetics, mutation analysis)
  • Maximum query performance needed
  • Working with shorter sequences or fragments
  • Need comprehensive coverage statistics

Choose 23-mers when:

  • Standard genomic analysis (assembly, alignment, variant calling)
  • Working with longer reads (>100bp)
  • Memory efficiency is critical
  • Integration with existing 23-mer workflows

Memory Usage

  • 13-mer index: ~277MB (256MB frequencies + 21MB hash)
  • 23-mer index: Variable, depends on data complexity
  • Both modes: Memory-mapped files for efficient access
  • Batch operations: Minimal additional memory overhead

File Formats

13-mer Files

  • .tf.bin: Binary frequency array (uint64_t × 67M elements = 512MB)
  • .pf: Perfect hash function for k-mer → index mapping
  • .kmers.bin: Binary k-mer encoding (optional validation)

23-mer Files

  • .tf.bin: Binary frequency array (variable size)
  • .pf: Perfect hash function
  • .kmers.bin: Binary k-mer storage
  • .aindex.indices.bin & .aindex.index.bin: Position indices (optional)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aindex2-1.4.4.tar.gz (102.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

aindex2-1.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

aindex2-1.4.4-cp312-cp312-macosx_11_0_arm64.whl (693.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

aindex2-1.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

aindex2-1.4.4-cp311-cp311-macosx_11_0_arm64.whl (695.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

aindex2-1.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

aindex2-1.4.4-cp310-cp310-macosx_11_0_arm64.whl (693.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

aindex2-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

aindex2-1.4.4-cp39-cp39-macosx_11_0_arm64.whl (693.7 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

aindex2-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

aindex2-1.4.4-cp38-cp38-macosx_11_0_arm64.whl (706.8 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

File details

Details for the file aindex2-1.4.4.tar.gz.

File metadata

  • Download URL: aindex2-1.4.4.tar.gz
  • Upload date:
  • Size: 102.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.4.tar.gz
Algorithm Hash digest
SHA256 a5f23c7d1095c3f32a07730349d768bc31f5a100aeb18e00bab111e3e2273c8e
MD5 9fbe8bdfdbe346361e090b890638ea1e
BLAKE2b-256 a46e36ee9328918da8b66c9274ad2575f457826d53542f2832bcb5194a4a7e0f

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0dba26aa63835bb02b69c5cbf244a191210ce2bec7d23d1aad112687dedff05
MD5 a14e7550cdf163fa98a87b9ce6372586
BLAKE2b-256 234976d7828b288c035069779f463fca8d5a20de52e49249297b935a32dd08f9

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a4b5dd98702d66437a87f69b632dbc38f743ca21527466bc205afe00fce65c92
MD5 3fba810bc830ec3b97bd2f82094ad749
BLAKE2b-256 16020ae0f5074818802f0e7515571815ba160699ba4d574732205c7974f74a91

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c56dfbdee4ff8c9605a354200841d9d80f0d7b6f740152965119bbc27ac3124a
MD5 30f510ace6bfe59f2a5e3d19e1ad25dd
BLAKE2b-256 f166d647f12e9d6191107d95ea9e926db9b92ed9f45f79c39fc694b6aebbd1a9

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 640f9c5bb7e96f59923a3b21e1f7754e48a5891f4633323fad03da27e5d80ee8
MD5 062b2c6a14ddbccb48898b40f3fc3406
BLAKE2b-256 5516c7a4db1af04da55462a2657981d3c4b54a5aacbba694fc7e4bcb176d0867

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c6f46b10338b23ce88fc0d91d8270dce7baab1535a5a7019ea36e1f7b8cc6a18
MD5 750471aba0cf4ce1acc98794ac91bf6d
BLAKE2b-256 93c725614d7456d7b7d1be96ea4482689ca70eb6e2c6a6e9c2de26677a304104

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dc76c9fdc47fd9bb2790b05849223a8340222606cd3160cbbdbc9e4f893ad533
MD5 b1bba273d0a8bda1cd91bc9e4a9e7a54
BLAKE2b-256 419944b7b21d3a1d3fc0a18697e5864d72065675ad3e0fff4647cc13c79fbfda

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e95f955418e6c60d44a36ef4479a1a10d65c7f72a484861febd5cf057a2bf349
MD5 28ccb1382e69b47bd8c67abdc84880f4
BLAKE2b-256 2bccd5aac2784f322958c7576e6f906389cddf2bc5156ff414612311901b5e1a

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bba9098a9e77e0b18fad83ef36d6d161b26742376dacaaeb778f4b72cbe96da7
MD5 81e926a4870de0a1e2ab0e297831df9c
BLAKE2b-256 714cd1b5d105587d392e308607741099543a09dbe7ef7234a98e59ecc212d1bc

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 04cc79d55f8e363c48d206104b596c24f85ec8a3df7c4ebf239358a0e95edf57
MD5 50351043215fa6cf3dbc52391730eac3
BLAKE2b-256 225b994ab57c4dd1b712134d0bf0649fe3a66ab3b6f00335ae70d16aab253c60

See more details on using hashes here.

File details

Details for the file aindex2-1.4.4-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.4-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7bb338bba1df9bf12dfb82fcc5dc92bfd128778809fc18a4a9ec0486ebfd046d
MD5 07ff7db04de6fad690ea224176b907e2
BLAKE2b-256 e7dd23e28e477cf04a04b5b64cc50a3981170e7ebea2a838ac6b5899e467174a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page