Skip to main content

Perfect hash based index for genome data.

Project description

aindex: perfect hash based index for genomic data

PyPI version PyPI pyversions PyPI - Wheel GitHub Actions Workflow Status PyPI license DOI

Features

🚀 High Performance: Ultra-fast k-mer querying with optimized C++ backend

  • 13-mers: 2.0M queries/sec (batch), 491K queries/sec (single)
  • 23-mers: 2.3M queries/sec (batch), 1.1M queries/sec (single)
  • Sequence coverage analysis: 24.5K sequences/sec (13-mers), 17.5K sequences/sec (23-mers)

🧬 Dual K-mer Support: Native support for both 13-mer and 23-mer k-mers

  • 13-mers: Complete 4^13 space coverage with perfect hashing
  • 23-mers: Efficient sparse indexing for genomic sequences
  • Auto-detection: Seamlessly switches between modes based on k-mer length

💾 Memory Efficient: Optimized data structures and memory-mapped files

  • Batch operations: Up to 4x faster than single queries
  • Minimal memory overhead: Constant memory usage during processing
  • Real-time processing: Stream processing for large genomic datasets

🔧 Modern API: Clean pybind11 interface with comprehensive functionality

Installation

Quick install with pip:

pip install aindex2

✅ Supported platforms (pre-built wheels available):

  • macOS: arm64 (Apple Silicon M1/M2/M3) - full functionality with C++ optimizations
  • Linux: x86_64 - full functionality with C++ optimizations

⚡ Currently optimized for: Our builds are specifically optimized for the most widely used platforms:

  • Apple Silicon (M1/M2/M3): Native ARM64 optimizations with up to 30% faster performance
  • Linux x86_64: Standard Intel/AMD processors with full C++ backend

🔄 Other platforms: For platforms not listed above (Windows, Linux ARM64, macOS Intel), you can:

  • Use Windows Subsystem for Linux (WSL) for Windows users
  • Build from source (see building instructions below)
  • Use cloud environments (Google Colab, Jupyter notebooks, etc.)

Recommended platforms for production use: Linux x86_64 or macOS arm64

📋 For detailed platform support information, see PLATFORM_SUPPORT.md

Building from source (optional):

git clone https://github.com/ad3002/aindex.git
cd aindex
make arm64  # For Apple Silicon
# or
make all    # For x86_64
pip install .

Requirements:

  • Python 3.8+
  • Standard build tools (automatically handled by pip)
  • No external dependencies required

For Google Colab users:

!pip install aindex2

Detailed Installation Instructions

Standard installation with pip (all platforms):

pip install aindex2

Installation from source (for development or custom builds):

git clone https://github.com/ad3002/aindex.git
cd aindex

# Standard build (all platforms)
make all
pip install .

# For Apple Silicon with ARM64 optimizations  
make arm64
pip install .

Platform support:

  • macOS arm64 (Apple Silicon M1/M2/M3): Pre-built wheels with ARM64 optimizations
  • Linux x86_64: Pre-built wheels with full C++ functionality
  • Other platforms: Build from source or use alternative environments (WSL, Docker, Colab)
  • macOS: x86_64 (Intel), arm64 (Apple Silicon) - pre-built wheels available with full C++ functionality
  • Windows: AMD64 - pre-built wheels available with Python-only functionality

All platforms include optimized builds with no external dependencies required.

Windows-Specific Notes

Current Status: Python-only functionality

  • The Windows build installs successfully but has limited functionality
  • C++ k-mer counting and high-performance indexing are not available on Windows
  • This is due to POSIX-specific dependencies (sys/mman.h, memory mapping) in the C++ backend

What works on Windows:

# Python utilities and scripts work normally
import aindex

# File format conversion utilities
aindex.reads_to_fasta(input_file, output_file)

# Command-line utilities for file processing
# Note: High-performance k-mer operations require Linux/macOS

What doesn't work on Windows:

# These operations require C++ backend (Linux/macOS only)
from aindex.core.aindex import AIndex  # Will show clear error message
index = AIndex.load_from_prefix("data")  # Not available on Windows

For Windows users who need full functionality:

  1. Use WSL (Windows Subsystem for Linux): Install Linux subsystem and use the Linux version
  2. Use Docker: Run aindex in a Linux container
  3. Use cloud/remote Linux machine: Process data on Linux and transfer results

Alternative for Windows users:

# Use WSL or Docker to get full functionality
wsl --install
wsl
pip install aindex2  # Now runs Linux version with full functionality

Google Colab Installation

For installation in Google Colab environment, there's a known cmake conflict that needs to be resolved first:

# Quick fix for cmake conflict
!pip uninstall -y cmake
!apt-get update
!apt-get install -y build-essential cmake git python3-dev

# Clone and install aindex
!git clone https://github.com/ad3002/aindex.git
%cd aindex
!pip install .

Alternative: Use automatic installation script

# Download and run the installation script
!wget https://raw.githubusercontent.com/ad3002/aindex/main/install_colab.py
!python install_colab.py

For troubleshooting, use the diagnostic script:

!wget https://raw.githubusercontent.com/ad3002/aindex/main/diagnose_colab.py
!python diagnose_colab.py

Note: Google Colab has a conflict between the Python cmake package and system cmake. The scripts above automatically resolve this issue.

To uninstall:

pip uninstall aindex2
pip uninstall clean

To clean up the compiled files, run:

make clean

macOS Compilation (ARM64/Apple Silicon Support)

For macOS systems (including Apple Silicon M1/M2), aindex now provides full ARM64 support with optimized performance:

# Build all components including the fast kmer_counter utility
make

# Alternative: build only core components
make macos

The project has been fully ported to ARM64/macOS, removing x86-specific dependencies (SSE instructions) and adding native ARM64 optimization.

Requirements for macOS:

  • For jellyfish-based pipeline: brew install jellyfish (optional)
  • Built-in kmer_counter provides faster alternative to jellyfish
  • Python development headers (usually included with Xcode tools)

Performance Note: The new built-in kmer_counter utility is approximately 5x faster than jellyfish for k-mer counting tasks.

Usage

Note: The examples below demonstrate full functionality available on Linux and macOS. Windows users have access to Python utilities only. See Windows-Specific Notes for details.

Command Line Interface (CLI)

aindex provides a unified command-line interface for all tools and utilities. After installation, all functions are accessible through the aindex command:

# Get help for all available commands
aindex --help

# Get help for a specific command
aindex count --help
aindex compute-aindex --help

Available Commands

Core indexing tools:

# Compute AIndex for genomic sequences (supports both 13-mer and 23-mer modes)
aindex compute-aindex -i input.fastq -o output_prefix -k 23

# Compute general index
aindex compute-index -i input.fasta -o output_prefix

# Process reads for indexing
aindex compute-reads input.fastq output.fastq fastq reads_prefix

K-mer analysis:

# Count k-mers in sequences (fast built-in counter)
aindex count -i input.fasta -o output.txt -k 23 -t 4

# Count 13-mers specifically (optimized for complete 13-mer space)
aindex count -i input.fasta -o output.txt -k 13

# Build hash table for k-mers
aindex build-hash -i kmers.txt -o hash_output

# Generate all possible 13-mers
aindex generate -o all_13mers.txt -k 13

Utilities:

# Convert reads to FASTA format
aindex reads-to-fasta input.fastq output.fasta

# Show version information
aindex version

# Show system and installation information
aindex info

Examples

Count 23-mers in a FASTA file:

aindex count -i genome.fasta -o kmer_counts.txt -k 23 -t 8

Build AIndex for 13-mer analysis:

aindex compute-aindex -i reads.fastq -o reads_index -k 13 --lu 2 -P 16

Generate all possible 13-mers for reference:

aindex generate -o all_13mers.txt -k 13

K-mer Counting Pipelines

aindex supports two k-mer counting backends:

  1. Built-in kmer_counter (Recommended) - Fast native implementation, ~5x faster than jellyfish
  2. Jellyfish - Traditional external tool (requires brew install jellyfish on macOS)

Quick Start

Compute all binary arrays using the fast built-in counter:

FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./tests/raw_reads.101bp.IS350bp25

# Using built-in kmer_counter (recommended, faster) via CLI
aindex compute-aindex -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use-kmer-counter

# Using built-in kmer_counter (legacy script approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

# Using jellyfish (traditional approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30

Command Line Options

  • -i, --input: Input FASTQ/FASTA files (comma-separated for multiple files)
  • -t, --type: Input file type ('fastq' or 'fasta')
  • -o, --output: Output prefix for generated files
  • --lu: Lower frequency threshold for k-mers
  • -P, --threads: Number of threads to use
  • --use-kmer-counter: Use built-in fast k-mer counter instead of jellyfish

Pipeline Outputs

Both pipelines generate identical output files:

  • .reads - Processed reads file
  • .dat - K-mer frequency data
  • .aindex - Binary index file
  • .stat - Statistics and metadata

Usage from Python

Platform Note: Full Python API with C++ backend is available on Linux and macOS. Windows provides Python utilities only.

Modern API

The aindex package provides a unified API supporting both 13-mer and 23-mer modes:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp

# Load 23-mer index (for genomic sequences)
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")  # Optional: load actual read sequences

# Load 13-mer index (for complete k-mer space analysis)
index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")  # Optional: load reads

print(f"23-mer index: {index_23mer.n_kmers:,} k-mers, {index_23mer.n_reads:,} reads")
print(f"13-mer index: {index_13mer.get_13mer_statistics()}")

K-mer Frequency Queries

Single k-mer queries:

# 23-mer queries (using AIndex wrapper)
tf_23 = index_23mer.get_tf_value("ATCGATCGATCGATCGATCGATC")  # 23 characters
print(f"23-mer frequency: {tf_23}")

# Alternative 23-mer query using [] operator
tf_23_alt = index_23mer["ATCGATCGATCGATCGATCGATC"]
print(f"23-mer frequency (alt): {tf_23_alt}")

# 13-mer queries (using C++ wrapper directly)
tf_13 = index_13mer.get_total_tf_value_13mer("ATCGATCGATCGA")  # 13 characters
print(f"13-mer frequency: {tf_13}")

# Get forward and reverse frequencies separately for 13-mers
tf_fwd, tf_rev = index_13mer.get_tf_both_directions_13mer("ATCGATCGATCGA")
print(f"13-mer forward: {tf_fwd}, reverse: {tf_rev}, total: {tf_fwd + tf_rev}")

Batch queries (much faster):

# Batch 23-mer queries (2-3x faster than single queries)
kmers_23 = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA", "TTTTTTTTTTTTTTTTTTTTTTT"]
tf_values_23 = index_23mer.get_tf_values(kmers_23)
print(f"23-mer batch results: {tf_values_23}")

# Batch 13-mer queries (total frequencies)
kmers_13 = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] 
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
print(f"13-mer batch results: {tf_values_13}")

# Batch directional 13-mer queries (forward + reverse separately)
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}, total={fwd+rev}")

Advanced 13-mer Operations

Directional analysis (forward + reverse complement):

# Get frequencies in both directions
kmer = "ATCGATCGATCGA"
forward_tf, reverse_tf = index_13mer.get_tf_both_directions_13mer(kmer)
total_tf = index_13mer.get_total_tf_value_13mer(kmer)

print(f"Forward: {forward_tf}, Reverse: {reverse_tf}, Total: {total_tf}")

# Batch directional analysis
results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}")

Complete 13-mer space analysis:

# Get statistics for the entire 13-mer space
stats = index_13mer.get_13mer_statistics()
print(f"Total 13-mers: {stats['total_kmers']:,}")
print(f"Non-zero frequencies: {stats['non_zero_kmers']:,}")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (4^13 = 67M elements)
# Note: This loads 256MB into memory
full_array = index_13mer.get_13mer_tf_array()
print(f"Array size: {len(full_array):,} elements")

Sequence Coverage Analysis

Analyze k-mer coverage in sequences:

# Using real reads from the index
real_read = index_23mer.get_read_by_rid(0)  # Get first read
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]    # Take first 100 bp

# Analyze 23-mer coverage using built-in function
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
print(f"23-mer coverage: {len(coverage_23)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_23 if tf > 0)}")
print(f"Average TF: {sum(coverage_23)/len(coverage_23):.2f}")

# Analyze 13-mer coverage using batch queries
kmers_13_in_seq = [sequence[i:i+13] for i in range(len(sequence) - 12)]
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13_in_seq)
print(f"13-mer coverage: {len(coverage_13)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_13 if tf > 0)}")
print(f"Average TF: {sum(coverage_13)/len(coverage_13):.2f}")

Iterate over k-mers in sequences:

# 23-mer iteration using built-in iterator
for kmer, tf in index_23mer.iter_sequence_kmers(sequence, k=23):
    if tf > 0:  # Only show k-mers found in index
        print(f"{kmer}: {tf}")

# 13-mer iteration using manual approach (more efficient with batch)
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)

for i, (kmer, tf) in enumerate(zip(kmers_13, tf_values_13)):
    if tf > 0:
        print(f"Position {i}: {kmer}: {tf}")

# For directional analysis of 13-mers
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    if fwd > 0 or rev > 0:
        print(f"Position {i}: {kmers_13[i]}: forward={fwd}, reverse={rev}")

Performance Benchmarks

Based on stress testing with 1M queries and 10K sequence analyses:

Operation 13-mers 23-mers Speedup
Single TF queries 491K queries/sec 1.1M queries/sec 23-mer 2.2x faster
Batch TF queries 2.0M queries/sec 2.3M queries/sec 23-mer 1.2x faster
Sequence coverage 24.5K sequences/sec 17.5K sequences/sec 13-mer 1.4x faster
K-mer positions 2.2M positions/sec 1.4M positions/sec 13-mer 1.6x faster

Key findings:

  • Batch operations: 2-4x faster than single queries for both modes
  • 23-mers: Better for single/batch TF queries due to optimized sparse indexing
  • 13-mers: Better for sequence analysis due to complete space coverage
  • Memory efficiency: Minimal memory growth during batch operations

Working with Reads

Access reads by ID:

# Get reads from either index
for rid in range(min(5, index_23mer.n_reads)):
    read = index_23mer.get_read_by_rid(rid)
    print(f"Read {rid}: {read[:50]}...")  # First 50 characters
    
    # Split paired reads (separated by '~')
    if '~' in read:
        read1, read2 = read.split('~')
        print(f"  Read 1: {len(read1)} bp, Read 2: {len(read2)} bp")

Iterate over all reads:

# Iterate through reads with automatic ID assignment
read_count = 0
for rid, read in index_23mer.iter_reads():
    read_count += 1
    if read_count <= 5:  # Show first 5 reads
        print(f"Read {rid}: {len(read)} bp")
    if read_count >= 1000:  # Process first 1000 reads
        break
        
print(f"Processed {read_count} reads")

Complete Example

Here's a practical example showing both 13-mer and 23-mer analysis:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp
import time

# Load both indices
print("Loading indices...")
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")

index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")

# Get a real sequence to analyze
real_read = index_23mer.get_read_by_rid(0)
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]
print(f"Analyzing sequence: {sequence[:50]}...")

# Compare 13-mer vs 23-mer coverage
print("\n=== Coverage Analysis ===")

# 23-mer coverage using built-in function
start = time.time()
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
time_23 = time.time() - start

# 13-mer coverage using batch query
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
start = time.time()
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
time_13 = time.time() - start

print(f"23-mers: {len(coverage_23)} positions, {sum(1 for x in coverage_23 if x > 0)} covered ({time_23*1000:.1f}ms)")
print(f"13-mers: {len(coverage_13)} positions, {sum(1 for x in coverage_13 if x > 0)} covered ({time_13*1000:.1f}ms)")

# Performance comparison
print(f"\n=== Performance Test ===")
test_kmers_23 = ["ATCGATCGATCGATCGATCGATC"] * 1000
test_kmers_13 = ["ATCGATCGATCGA"] * 1000

# 23-mer batch query
start = time.time()
results_23 = index_23mer.get_tf_values(test_kmers_23)
time_23_batch = time.time() - start

# 13-mer batch query
start = time.time()
results_13 = index_13mer.get_total_tf_values_13mer(test_kmers_13)
time_13_batch = time.time() - start

print(f"23-mer batch (1K queries): {len(test_kmers_23)/time_23_batch:.0f} queries/sec")
print(f"13-mer batch (1K queries): {len(test_kmers_13)/time_13_batch:.0f} queries/sec")

# Statistics
stats_23 = {"kmers": index_23mer.n_kmers, "reads": index_23mer.n_reads}
stats_13 = index_13mer.get_13mer_statistics()

print(f"\n=== Index Statistics ===")
print(f"23-mer index: {stats_23['kmers']:,} k-mers, {stats_23['reads']:,} reads")
print(f"13-mer index: {stats_13['total_kmers']:,} total k-mers, {stats_13['non_zero_kmers']:,} non-zero")

Expected output:

Loading indices...
Analyzing sequence: NNNNNNNNNNACTGAACCGCCTTCCGATCTCCAGCTGCAAAGCGTAG...

=== Coverage Analysis ===
23-mers: 78 positions, 42 covered (0.3ms)
13-mers: 88 positions, 88 covered (0.1ms)

=== Performance Test ===
23-mer batch (1K queries): 2,300,000 queries/sec
13-mer batch (1K queries): 2,000,000 queries/sec

=== Index Statistics ===
23-mer index: 15,234,567 k-mers, 125,000 reads  
13-mer index: 67,108,864 total k-mers, 8,945,123 non-zero

Advanced Features

13-mer Integration

The aindex library provides highly optimized 13-mer k-mer counting and querying using precomputed perfect hash tables. This mode offers complete coverage of the 4^13 k-mer space with exceptional performance.

Performance Characteristics

Query Performance:

  • Single queries: 491K queries/second
  • Batch queries: 2.0M queries/second (4.1x speedup)
  • Directional queries: 1.8M queries/second (forward + reverse complement)
  • Complete space: Access to all 67,108,864 possible 13-mers

Sequence Analysis Performance:

  • Coverage analysis: 24,500 sequences/second
  • Position analysis: 2.2M k-mer positions/second
  • Memory efficiency: Zero memory growth during batch operations
  • Real data coverage: 100% (all k-mers found in biological data)

13-mer Workflow

1. Generate Complete 13-mer Space:

# Generate all possible 13-mers (67M k-mers)
./bin/generate_all_13mers.exe all_13mers.txt

# Build perfect hash for instant lookup
./bin/build_13mer_hash.exe all_13mers.txt temp/all_13mers 4

# Count k-mers in your genomic data
./bin/count_kmers13.exe input_reads.fasta temp/all_13mers.tf.bin hash_file 4

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 13-mer index with complete k-mer space
index = AIndex.load_from_prefix_13mer("temp/all_13mers")
index.load_reads("temp/reads.reads")  # Optional: load read sequences

# Query performance demonstration
import time

# Single k-mer query
start = time.time()
tf = index.get_total_tf_value_13mer("ATCGATCGATCGA")
single_time = time.time() - start
print(f"Single query: {tf} (took {single_time*1000:.3f}ms)")

# Batch query (much faster)
kmers = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] * 1000  # 3K queries
start = time.time() 
tf_values = index.get_total_tf_values_13mer(kmers)
batch_time = time.time() - start
print(f"Batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Directional analysis (forward + reverse complement)
forward, reverse = index.get_tf_both_directions_13mer("ATCGATCGATCGA")
total = index.get_total_tf_value_13mer("ATCGATCGATCGA")
print(f"Directional: forward={forward}, reverse={reverse}, total={total}")

13-mer Statistics and Analysis

Get comprehensive statistics:

# Complete 13-mer space statistics
stats = index.get_13mer_statistics()
print(f"Total 13-mer space: {stats['total_kmers']:,}")
print(f"Found in data: {stats['non_zero_kmers']:,} ({stats['non_zero_kmers']/stats['total_kmers']*100:.2f}%)")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Total occurrences: {stats['total_count']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (warning: 256MB)
if stats['non_zero_kmers'] > 0:
    # Get subset for analysis rather than full array
    sample_indices = range(0, 1000000, 1000)  # Sample every 1000th element
    sample_tfs = [index.get_tf_by_index_13mer(i) for i in sample_indices]
    non_zero_sample = [tf for tf in sample_tfs if tf > 0]
    print(f"Sample analysis: {len(non_zero_sample)}/{len(sample_tfs)} non-zero in sample")

Sequence coverage analysis:

# Analyze real genomic sequences
for rid in range(min(5, index.n_reads)):
    read = index.get_read_by_rid(rid)
    if '~' in read:
        sequence = read.split('~')[0]  # Take first mate
    else:
        sequence = read
    
    # Limit to reasonable length for demonstration
    if len(sequence) > 100:
        sequence = sequence[:100]
    
    # Compute 13-mer coverage
    coverage = []
    for i in range(len(sequence) - 12):
        kmer = sequence[i:i+13]
        tf = index.get_total_tf_value_13mer(kmer)
        coverage.append(tf)
    
    if coverage:
        avg_tf = sum(coverage) / len(coverage)
        max_tf = max(coverage)
        coverage_pct = sum(1 for tf in coverage if tf > 0) / len(coverage) * 100
        print(f"Read {rid}: {len(coverage)} 13-mers, {coverage_pct:.1f}% covered, avg TF {avg_tf:.1f}, max TF {max_tf}")

23-mer Integration

The 23-mer mode provides efficient sparse indexing for longer k-mers commonly used in genomic analysis.

Performance Characteristics

Query Performance:

  • Single queries: 1.0M queries/second
  • Batch queries: 2.4M queries/second (2.4x speedup)
  • Directional queries: Available for forward + reverse complement analysis
  • Sparse indexing: Only stores k-mers present in input data

Sequence Analysis Performance:

  • Coverage analysis: 16,900 sequences/second
  • Position analysis: 1.3M k-mer positions/second
  • Memory efficiency: Constant memory usage during operations
  • Real data coverage: 100% (all k-mers found in genomic sequences)

23-mer Workflow

1. Build 23-mer Index:

# Using the fast built-in k-mer counter (recommended)
FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./temp/reads.23

python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 23-mer index
index = AIndex.load_from_prefix("temp/reads.23")
index.load_reads("temp/reads.reads")

# Performance demonstration
import time

# Batch query performance
kmers = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA"] * 1000  # 2K queries
start = time.time()
tf_values = index.get_tf_values(kmers)  # Auto-detects 23-mer mode
batch_time = time.time() - start
print(f"23-mer batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Sequence coverage analysis
read = index.get_read_by_rid(0)
sequence = read.split('~')[0][:100] if '~' in read else read[:100]

start = time.time()
coverage = index.get_sequence_coverage(sequence, cutoff=0, k=23)
coverage_time = time.time() - start

print(f"23-mer coverage analysis: {len(coverage)} positions in {coverage_time*1000:.1f}ms")
print(f"Coverage: {sum(1 for tf in coverage if tf > 0)/len(coverage)*100:.1f}% positions covered")
print(f"Average TF: {sum(coverage)/len(coverage):.1f}")

Performance Comparison

Throughput Comparison (Operations per Second)

Operation Type 13-mers 23-mers Winner
Single TF queries 491K/sec 1.1M/sec 23-mer (+124%)
Batch TF queries 2.0M/sec 2.3M/sec 23-mer (+15%)
Sequence coverage 24.5K/sec 17.5K/sec 13-mer (+40%)
Position analysis 2.2M/sec 1.4M/sec 13-mer (+57%)

Use Case Recommendations

Choose 13-mers when:

  • Analyzing complete k-mer space (population genetics, mutation analysis)
  • Maximum query performance needed
  • Working with shorter sequences or fragments
  • Need comprehensive coverage statistics

Choose 23-mers when:

  • Standard genomic analysis (assembly, alignment, variant calling)
  • Working with longer reads (>100bp)
  • Memory efficiency is critical
  • Integration with existing 23-mer workflows

Memory Usage

  • 13-mer index: ~277MB (256MB frequencies + 21MB hash)
  • 23-mer index: Variable, depends on data complexity
  • Both modes: Memory-mapped files for efficient access
  • Batch operations: Minimal additional memory overhead

File Formats

13-mer Files

  • .tf.bin: Binary frequency array (uint64_t × 67M elements = 512MB)
  • .pf: Perfect hash function for k-mer → index mapping
  • .kmers.bin: Binary k-mer encoding (optional validation)

23-mer Files

  • .tf.bin: Binary frequency array (variable size)
  • .pf: Perfect hash function
  • .kmers.bin: Binary k-mer storage
  • .aindex.indices.bin & .aindex.index.bin: Position indices (optional)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aindex2-1.4.2.tar.gz (93.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl (685.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl (687.4 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl (685.4 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl (685.6 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl (698.8 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

File details

Details for the file aindex2-1.4.2.tar.gz.

File metadata

  • Download URL: aindex2-1.4.2.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2.tar.gz
Algorithm Hash digest
SHA256 577c75fb749e919543faf373b3fa95c601f08b778f65d058e10e4ebaff8aac00
MD5 525350536185eb69f4e2df7de189775b
BLAKE2b-256 73b2e45f82dad92211a6083fdf091c956b62c93e33b74868e94f60b4f90eb4d7

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 71ad8441604fb6c7f9670b94420813d82c9d883a7c24c777fffd6d76172b2ffc
MD5 1135d79e60a05301237c248ce0aea8a5
BLAKE2b-256 7904284909f71d153179d70a23ca978871a2ca495f164d9d39a1ee4b58b83d21

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ee11d8ca23d03630cc634dd537d6406667f135cfaa2d3500cd5efa400e79eb58
MD5 ca1bf2c9c819c11081f9254fbf90f879
BLAKE2b-256 1397c0eddddb84ec1dffa5e74eff0b89903f3671ea502ae96cdf43c3da71c302

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2da6c2a03585c6982d3007de43daf19a967a9c703a0da732ea79385c174f9fad
MD5 ccc2de610132c65aad21a56c3953c448
BLAKE2b-256 bc06260d5a53285d6587f9cd6745ac479d52d0452f3e582d320c227caf3a0e10

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4eccc3980943ce8224cdc96360fc7e87f997d841ab2a2e6bcf791bc95c03799b
MD5 8a40197d05516eca14d1d8df936f1e96
BLAKE2b-256 911705dce504d6388943b0b29f1e55aa999a4b51fe7e2d5e0a0351c57ed1ee11

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eff7288259f3899aee3b9bed27f82540fb6e3dac61a073f00c4b9a520c0f0aeb
MD5 1edfd6979357318939d0a13ffeac025d
BLAKE2b-256 78a53f0fcaf2b044f8514d66f8f55a6ce59abd3315c581e127bdc97e5d2c80b9

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2defcf4b324cd089f81f21ae91aff9ce7ff2cf4e9bdfd2a41bb811b2042c8991
MD5 61943025b94fba441c29cf226797d28f
BLAKE2b-256 185581d8cca489c57443b04293d3ea649ee521752b1ff7ea2e60ab9c8ba351ec

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b6b71807cf1c2dd1612721bd3ff56c3e5c606458efb13e27a5e4ee07ed2bbb57
MD5 cf9c643038dc6f3e3a2e96c8c6e5db84
BLAKE2b-256 d0228cc72ec055eec723e5184bf7b1683e720aa0a2519852810edc09133cfeff

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 98c46de614a54cda38b996678d458b6ea525666ada576963dfd708ae7ae7fc18
MD5 6bb548eb49ada2b0f50673ced809e4b1
BLAKE2b-256 00104831ba621d131fa82ad8ee03cc5bdf6e588bd0713447516c32732ce317d6

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5684c5e9c86357cd389f2638fe71f9a951ec4222e42148a79935099e27cba9e5
MD5 68eada44efe4431e59d810fd962a8439
BLAKE2b-256 333131816324d4f96ae8e8a5b36a3cad741e35b0e63fe03fd112796900467db0

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 45fe859aa88400af35a778510fd6d19a89a8ee61f3cef5a574c94dd708c23090
MD5 77a93b7c0fe512288cbcdf1e65582303
BLAKE2b-256 699d75ecaed3026dce2809225e0745b54425a5536d307b1b2621d354b9c3129e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page