Skip to main content

Perfect hash based index for genome data.

Project description

aindex: perfect hash based index for genomic data

PyPI version PyPI pyversions PyPI - Wheel GitHub Actions Workflow Status PyPI license DOI

Features

🚀 High Performance: Ultra-fast k-mer querying with optimized C++ backend

  • 13-mers: 2.0M queries/sec (batch), 491K queries/sec (single)
  • 23-mers: 2.3M queries/sec (batch), 1.1M queries/sec (single)
  • Sequence coverage analysis: 24.5K sequences/sec (13-mers), 17.5K sequences/sec (23-mers)

🧬 Dual K-mer Support: Native support for both 13-mer and 23-mer k-mers

  • 13-mers: Complete 4^13 space coverage with perfect hashing
  • 23-mers: Efficient sparse indexing for genomic sequences
  • Auto-detection: Seamlessly switches between modes based on k-mer length

💾 Memory Efficient: Optimized data structures and memory-mapped files

  • Batch operations: Up to 4x faster than single queries
  • Minimal memory overhead: Constant memory usage during processing
  • Real-time processing: Stream processing for large genomic datasets

🔧 Modern API: Clean pybind11 interface with comprehensive functionality

Installation

Quick install with pip:

pip install aindex2

⚠️ Note for Apple M1/M2 Mac users: Pre-built wheels are not yet available for Apple Silicon (ARM64) architecture. Please install from source:

# Install system dependencies
brew install cmake

# Install from source
git clone https://github.com/ad3002/aindex.git
cd aindex
make
pip install .

For Google Colab users:

!pip uninstall -y cmake && !apt-get update && !apt-get install -y build-essential cmake git python3-dev
!pip install aindex2

Detailed Installation Instructions

Standard installation with pip (Linux x86_64, Intel Mac):

pip install aindex2

Installation from source (recommended for M1/M2 Mac, development, or troubleshooting):

git clone https://github.com/ad3002/aindex.git
cd aindex
make
pip install .

Platform-specific notes:

  • Linux x86_64: Pre-built wheels available via pip
  • Intel Mac: Pre-built wheels available via pip
  • Apple Silicon Mac (M1/M2): Install from source (see above)
  • Windows: Install from source (WSL recommended)

This will create the necessary executables in the bin directory.

Google Colab Installation

For installation in Google Colab environment, there's a known cmake conflict that needs to be resolved first:

# Quick fix for cmake conflict
!pip uninstall -y cmake
!apt-get update
!apt-get install -y build-essential cmake git python3-dev

# Clone and install aindex
!git clone https://github.com/ad3002/aindex.git
%cd aindex
!pip install .

Alternative: Use automatic installation script

# Download and run the installation script
!wget https://raw.githubusercontent.com/ad3002/aindex/main/install_colab.py
!python install_colab.py

For troubleshooting, use the diagnostic script:

!wget https://raw.githubusercontent.com/ad3002/aindex/main/diagnose_colab.py
!python diagnose_colab.py

Note: Google Colab has a conflict between the Python cmake package and system cmake. The scripts above automatically resolve this issue.

To uninstall:

pip uninstall aindex2
pip uninstall clean

To clean up the compiled files, run:

make clean

macOS Compilation (ARM64/Apple Silicon Support)

For macOS systems (including Apple Silicon M1/M2), aindex now provides full ARM64 support with optimized performance:

# Build all components including the fast kmer_counter utility
make

# Alternative: build only core components
make macos

The project has been fully ported to ARM64/macOS, removing x86-specific dependencies (SSE instructions) and adding native ARM64 optimization.

Requirements for macOS:

  • For jellyfish-based pipeline: brew install jellyfish (optional)
  • Built-in kmer_counter provides faster alternative to jellyfish
  • Python development headers (usually included with Xcode tools)

Performance Note: The new built-in kmer_counter utility is approximately 5x faster than jellyfish for k-mer counting tasks.

Usage

Command Line Interface (CLI)

aindex provides a unified command-line interface for all tools and utilities. After installation, all functions are accessible through the aindex command:

# Get help for all available commands
aindex --help

# Get help for a specific command
aindex count --help
aindex compute-aindex --help

Available Commands

Core indexing tools:

# Compute AIndex for genomic sequences (supports both 13-mer and 23-mer modes)
aindex compute-aindex -i input.fastq -o output_prefix -k 23

# Compute general index
aindex compute-index -i input.fasta -o output_prefix

# Process reads for indexing
aindex compute-reads input.fastq output.fastq fastq reads_prefix

K-mer analysis:

# Count k-mers in sequences (fast built-in counter)
aindex count -i input.fasta -o output.txt -k 23 -t 4

# Count 13-mers specifically (optimized for complete 13-mer space)
aindex count -i input.fasta -o output.txt -k 13

# Build hash table for k-mers
aindex build-hash -i kmers.txt -o hash_output

# Generate all possible 13-mers
aindex generate -o all_13mers.txt -k 13

Utilities:

# Convert reads to FASTA format
aindex reads-to-fasta input.fastq output.fasta

# Show version information
aindex version

# Show system and installation information
aindex info

Examples

Count 23-mers in a FASTA file:

aindex count -i genome.fasta -o kmer_counts.txt -k 23 -t 8

Build AIndex for 13-mer analysis:

aindex compute-aindex -i reads.fastq -o reads_index -k 13 --lu 2 -P 16

Generate all possible 13-mers for reference:

aindex generate -o all_13mers.txt -k 13

K-mer Counting Pipelines

aindex supports two k-mer counting backends:

  1. Built-in kmer_counter (Recommended) - Fast native implementation, ~5x faster than jellyfish
  2. Jellyfish - Traditional external tool (requires brew install jellyfish on macOS)

Quick Start

Compute all binary arrays using the fast built-in counter:

FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./tests/raw_reads.101bp.IS350bp25

# Using built-in kmer_counter (recommended, faster) via CLI
aindex compute-aindex -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use-kmer-counter

# Using built-in kmer_counter (legacy script approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

# Using jellyfish (traditional approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30

Command Line Options

  • -i, --input: Input FASTQ/FASTA files (comma-separated for multiple files)
  • -t, --type: Input file type ('fastq' or 'fasta')
  • -o, --output: Output prefix for generated files
  • --lu: Lower frequency threshold for k-mers
  • -P, --threads: Number of threads to use
  • --use-kmer-counter: Use built-in fast k-mer counter instead of jellyfish

Pipeline Outputs

Both pipelines generate identical output files:

  • .reads - Processed reads file
  • .dat - K-mer frequency data
  • .aindex - Binary index file
  • .stat - Statistics and metadata

Usage from Python

Modern API

The aindex package provides a unified API supporting both 13-mer and 23-mer modes:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp

# Load 23-mer index (for genomic sequences)
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")  # Optional: load actual read sequences

# Load 13-mer index (for complete k-mer space analysis)
index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")  # Optional: load reads

print(f"23-mer index: {index_23mer.n_kmers:,} k-mers, {index_23mer.n_reads:,} reads")
print(f"13-mer index: {index_13mer.get_13mer_statistics()}")

K-mer Frequency Queries

Single k-mer queries:

# 23-mer queries (using AIndex wrapper)
tf_23 = index_23mer.get_tf_value("ATCGATCGATCGATCGATCGATC")  # 23 characters
print(f"23-mer frequency: {tf_23}")

# Alternative 23-mer query using [] operator
tf_23_alt = index_23mer["ATCGATCGATCGATCGATCGATC"]
print(f"23-mer frequency (alt): {tf_23_alt}")

# 13-mer queries (using C++ wrapper directly)
tf_13 = index_13mer.get_total_tf_value_13mer("ATCGATCGATCGA")  # 13 characters
print(f"13-mer frequency: {tf_13}")

# Get forward and reverse frequencies separately for 13-mers
tf_fwd, tf_rev = index_13mer.get_tf_both_directions_13mer("ATCGATCGATCGA")
print(f"13-mer forward: {tf_fwd}, reverse: {tf_rev}, total: {tf_fwd + tf_rev}")

Batch queries (much faster):

# Batch 23-mer queries (2-3x faster than single queries)
kmers_23 = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA", "TTTTTTTTTTTTTTTTTTTTTTT"]
tf_values_23 = index_23mer.get_tf_values(kmers_23)
print(f"23-mer batch results: {tf_values_23}")

# Batch 13-mer queries (total frequencies)
kmers_13 = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] 
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
print(f"13-mer batch results: {tf_values_13}")

# Batch directional 13-mer queries (forward + reverse separately)
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}, total={fwd+rev}")

Advanced 13-mer Operations

Directional analysis (forward + reverse complement):

# Get frequencies in both directions
kmer = "ATCGATCGATCGA"
forward_tf, reverse_tf = index_13mer.get_tf_both_directions_13mer(kmer)
total_tf = index_13mer.get_total_tf_value_13mer(kmer)

print(f"Forward: {forward_tf}, Reverse: {reverse_tf}, Total: {total_tf}")

# Batch directional analysis
results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}")

Complete 13-mer space analysis:

# Get statistics for the entire 13-mer space
stats = index_13mer.get_13mer_statistics()
print(f"Total 13-mers: {stats['total_kmers']:,}")
print(f"Non-zero frequencies: {stats['non_zero_kmers']:,}")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (4^13 = 67M elements)
# Note: This loads 256MB into memory
full_array = index_13mer.get_13mer_tf_array()
print(f"Array size: {len(full_array):,} elements")

Sequence Coverage Analysis

Analyze k-mer coverage in sequences:

# Using real reads from the index
real_read = index_23mer.get_read_by_rid(0)  # Get first read
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]    # Take first 100 bp

# Analyze 23-mer coverage using built-in function
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
print(f"23-mer coverage: {len(coverage_23)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_23 if tf > 0)}")
print(f"Average TF: {sum(coverage_23)/len(coverage_23):.2f}")

# Analyze 13-mer coverage using batch queries
kmers_13_in_seq = [sequence[i:i+13] for i in range(len(sequence) - 12)]
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13_in_seq)
print(f"13-mer coverage: {len(coverage_13)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_13 if tf > 0)}")
print(f"Average TF: {sum(coverage_13)/len(coverage_13):.2f}")

Iterate over k-mers in sequences:

# 23-mer iteration using built-in iterator
for kmer, tf in index_23mer.iter_sequence_kmers(sequence, k=23):
    if tf > 0:  # Only show k-mers found in index
        print(f"{kmer}: {tf}")

# 13-mer iteration using manual approach (more efficient with batch)
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)

for i, (kmer, tf) in enumerate(zip(kmers_13, tf_values_13)):
    if tf > 0:
        print(f"Position {i}: {kmer}: {tf}")

# For directional analysis of 13-mers
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    if fwd > 0 or rev > 0:
        print(f"Position {i}: {kmers_13[i]}: forward={fwd}, reverse={rev}")

Performance Benchmarks

Based on stress testing with 1M queries and 10K sequence analyses:

Operation 13-mers 23-mers Speedup
Single TF queries 491K queries/sec 1.1M queries/sec 23-mer 2.2x faster
Batch TF queries 2.0M queries/sec 2.3M queries/sec 23-mer 1.2x faster
Sequence coverage 24.5K sequences/sec 17.5K sequences/sec 13-mer 1.4x faster
K-mer positions 2.2M positions/sec 1.4M positions/sec 13-mer 1.6x faster

Key findings:

  • Batch operations: 2-4x faster than single queries for both modes
  • 23-mers: Better for single/batch TF queries due to optimized sparse indexing
  • 13-mers: Better for sequence analysis due to complete space coverage
  • Memory efficiency: Minimal memory growth during batch operations

Working with Reads

Access reads by ID:

# Get reads from either index
for rid in range(min(5, index_23mer.n_reads)):
    read = index_23mer.get_read_by_rid(rid)
    print(f"Read {rid}: {read[:50]}...")  # First 50 characters
    
    # Split paired reads (separated by '~')
    if '~' in read:
        read1, read2 = read.split('~')
        print(f"  Read 1: {len(read1)} bp, Read 2: {len(read2)} bp")

Iterate over all reads:

# Iterate through reads with automatic ID assignment
read_count = 0
for rid, read in index_23mer.iter_reads():
    read_count += 1
    if read_count <= 5:  # Show first 5 reads
        print(f"Read {rid}: {len(read)} bp")
    if read_count >= 1000:  # Process first 1000 reads
        break
        
print(f"Processed {read_count} reads")

Complete Example

Here's a practical example showing both 13-mer and 23-mer analysis:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp
import time

# Load both indices
print("Loading indices...")
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")

index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")

# Get a real sequence to analyze
real_read = index_23mer.get_read_by_rid(0)
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]
print(f"Analyzing sequence: {sequence[:50]}...")

# Compare 13-mer vs 23-mer coverage
print("\n=== Coverage Analysis ===")

# 23-mer coverage using built-in function
start = time.time()
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
time_23 = time.time() - start

# 13-mer coverage using batch query
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
start = time.time()
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
time_13 = time.time() - start

print(f"23-mers: {len(coverage_23)} positions, {sum(1 for x in coverage_23 if x > 0)} covered ({time_23*1000:.1f}ms)")
print(f"13-mers: {len(coverage_13)} positions, {sum(1 for x in coverage_13 if x > 0)} covered ({time_13*1000:.1f}ms)")

# Performance comparison
print(f"\n=== Performance Test ===")
test_kmers_23 = ["ATCGATCGATCGATCGATCGATC"] * 1000
test_kmers_13 = ["ATCGATCGATCGA"] * 1000

# 23-mer batch query
start = time.time()
results_23 = index_23mer.get_tf_values(test_kmers_23)
time_23_batch = time.time() - start

# 13-mer batch query
start = time.time()
results_13 = index_13mer.get_total_tf_values_13mer(test_kmers_13)
time_13_batch = time.time() - start

print(f"23-mer batch (1K queries): {len(test_kmers_23)/time_23_batch:.0f} queries/sec")
print(f"13-mer batch (1K queries): {len(test_kmers_13)/time_13_batch:.0f} queries/sec")

# Statistics
stats_23 = {"kmers": index_23mer.n_kmers, "reads": index_23mer.n_reads}
stats_13 = index_13mer.get_13mer_statistics()

print(f"\n=== Index Statistics ===")
print(f"23-mer index: {stats_23['kmers']:,} k-mers, {stats_23['reads']:,} reads")
print(f"13-mer index: {stats_13['total_kmers']:,} total k-mers, {stats_13['non_zero_kmers']:,} non-zero")

Expected output:

Loading indices...
Analyzing sequence: NNNNNNNNNNACTGAACCGCCTTCCGATCTCCAGCTGCAAAGCGTAG...

=== Coverage Analysis ===
23-mers: 78 positions, 42 covered (0.3ms)
13-mers: 88 positions, 88 covered (0.1ms)

=== Performance Test ===
23-mer batch (1K queries): 2,300,000 queries/sec
13-mer batch (1K queries): 2,000,000 queries/sec

=== Index Statistics ===
23-mer index: 15,234,567 k-mers, 125,000 reads  
13-mer index: 67,108,864 total k-mers, 8,945,123 non-zero

Advanced Features

13-mer Integration

The aindex library provides highly optimized 13-mer k-mer counting and querying using precomputed perfect hash tables. This mode offers complete coverage of the 4^13 k-mer space with exceptional performance.

Performance Characteristics

Query Performance:

  • Single queries: 491K queries/second
  • Batch queries: 2.0M queries/second (4.1x speedup)
  • Directional queries: 1.8M queries/second (forward + reverse complement)
  • Complete space: Access to all 67,108,864 possible 13-mers

Sequence Analysis Performance:

  • Coverage analysis: 24,500 sequences/second
  • Position analysis: 2.2M k-mer positions/second
  • Memory efficiency: Zero memory growth during batch operations
  • Real data coverage: 100% (all k-mers found in biological data)

13-mer Workflow

1. Generate Complete 13-mer Space:

# Generate all possible 13-mers (67M k-mers)
./bin/generate_all_13mers.exe all_13mers.txt

# Build perfect hash for instant lookup
./bin/build_13mer_hash.exe all_13mers.txt temp/all_13mers 4

# Count k-mers in your genomic data
./bin/count_kmers13.exe input_reads.fasta temp/all_13mers.tf.bin hash_file 4

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 13-mer index with complete k-mer space
index = AIndex.load_from_prefix_13mer("temp/all_13mers")
index.load_reads("temp/reads.reads")  # Optional: load read sequences

# Query performance demonstration
import time

# Single k-mer query
start = time.time()
tf = index.get_total_tf_value_13mer("ATCGATCGATCGA")
single_time = time.time() - start
print(f"Single query: {tf} (took {single_time*1000:.3f}ms)")

# Batch query (much faster)
kmers = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] * 1000  # 3K queries
start = time.time() 
tf_values = index.get_total_tf_values_13mer(kmers)
batch_time = time.time() - start
print(f"Batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Directional analysis (forward + reverse complement)
forward, reverse = index.get_tf_both_directions_13mer("ATCGATCGATCGA")
total = index.get_total_tf_value_13mer("ATCGATCGATCGA")
print(f"Directional: forward={forward}, reverse={reverse}, total={total}")

13-mer Statistics and Analysis

Get comprehensive statistics:

# Complete 13-mer space statistics
stats = index.get_13mer_statistics()
print(f"Total 13-mer space: {stats['total_kmers']:,}")
print(f"Found in data: {stats['non_zero_kmers']:,} ({stats['non_zero_kmers']/stats['total_kmers']*100:.2f}%)")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Total occurrences: {stats['total_count']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (warning: 256MB)
if stats['non_zero_kmers'] > 0:
    # Get subset for analysis rather than full array
    sample_indices = range(0, 1000000, 1000)  # Sample every 1000th element
    sample_tfs = [index.get_tf_by_index_13mer(i) for i in sample_indices]
    non_zero_sample = [tf for tf in sample_tfs if tf > 0]
    print(f"Sample analysis: {len(non_zero_sample)}/{len(sample_tfs)} non-zero in sample")

Sequence coverage analysis:

# Analyze real genomic sequences
for rid in range(min(5, index.n_reads)):
    read = index.get_read_by_rid(rid)
    if '~' in read:
        sequence = read.split('~')[0]  # Take first mate
    else:
        sequence = read
    
    # Limit to reasonable length for demonstration
    if len(sequence) > 100:
        sequence = sequence[:100]
    
    # Compute 13-mer coverage
    coverage = []
    for i in range(len(sequence) - 12):
        kmer = sequence[i:i+13]
        tf = index.get_total_tf_value_13mer(kmer)
        coverage.append(tf)
    
    if coverage:
        avg_tf = sum(coverage) / len(coverage)
        max_tf = max(coverage)
        coverage_pct = sum(1 for tf in coverage if tf > 0) / len(coverage) * 100
        print(f"Read {rid}: {len(coverage)} 13-mers, {coverage_pct:.1f}% covered, avg TF {avg_tf:.1f}, max TF {max_tf}")

23-mer Integration

The 23-mer mode provides efficient sparse indexing for longer k-mers commonly used in genomic analysis.

Performance Characteristics

Query Performance:

  • Single queries: 1.0M queries/second
  • Batch queries: 2.4M queries/second (2.4x speedup)
  • Directional queries: Available for forward + reverse complement analysis
  • Sparse indexing: Only stores k-mers present in input data

Sequence Analysis Performance:

  • Coverage analysis: 16,900 sequences/second
  • Position analysis: 1.3M k-mer positions/second
  • Memory efficiency: Constant memory usage during operations
  • Real data coverage: 100% (all k-mers found in genomic sequences)

23-mer Workflow

1. Build 23-mer Index:

# Using the fast built-in k-mer counter (recommended)
FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./temp/reads.23

python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 23-mer index
index = AIndex.load_from_prefix("temp/reads.23")
index.load_reads("temp/reads.reads")

# Performance demonstration
import time

# Batch query performance
kmers = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA"] * 1000  # 2K queries
start = time.time()
tf_values = index.get_tf_values(kmers)  # Auto-detects 23-mer mode
batch_time = time.time() - start
print(f"23-mer batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Sequence coverage analysis
read = index.get_read_by_rid(0)
sequence = read.split('~')[0][:100] if '~' in read else read[:100]

start = time.time()
coverage = index.get_sequence_coverage(sequence, cutoff=0, k=23)
coverage_time = time.time() - start

print(f"23-mer coverage analysis: {len(coverage)} positions in {coverage_time*1000:.1f}ms")
print(f"Coverage: {sum(1 for tf in coverage if tf > 0)/len(coverage)*100:.1f}% positions covered")
print(f"Average TF: {sum(coverage)/len(coverage):.1f}")

Performance Comparison

Throughput Comparison (Operations per Second)

Operation Type 13-mers 23-mers Winner
Single TF queries 491K/sec 1.1M/sec 23-mer (+124%)
Batch TF queries 2.0M/sec 2.3M/sec 23-mer (+15%)
Sequence coverage 24.5K/sec 17.5K/sec 13-mer (+40%)
Position analysis 2.2M/sec 1.4M/sec 13-mer (+57%)

Use Case Recommendations

Choose 13-mers when:

  • Analyzing complete k-mer space (population genetics, mutation analysis)
  • Maximum query performance needed
  • Working with shorter sequences or fragments
  • Need comprehensive coverage statistics

Choose 23-mers when:

  • Standard genomic analysis (assembly, alignment, variant calling)
  • Working with longer reads (>100bp)
  • Memory efficiency is critical
  • Integration with existing 23-mer workflows

Memory Usage

  • 13-mer index: ~277MB (256MB frequencies + 21MB hash)
  • 23-mer index: Variable, depends on data complexity
  • Both modes: Memory-mapped files for efficient access
  • Batch operations: Minimal additional memory overhead

File Formats

13-mer Files

  • .tf.bin: Binary frequency array (uint64_t × 67M elements = 512MB)
  • .pf: Perfect hash function for k-mer → index mapping
  • .kmers.bin: Binary k-mer encoding (optional validation)

23-mer Files

  • .tf.bin: Binary frequency array (variable size)
  • .pf: Perfect hash function
  • .kmers.bin: Binary k-mer storage
  • .aindex.indices.bin & .aindex.index.bin: Position indices (optional)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aindex2-1.3.21.tar.gz (78.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

aindex2-1.3.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (546.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

aindex2-1.3.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (545.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

aindex2-1.3.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (543.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

aindex2-1.3.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (543.7 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

aindex2-1.3.21-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (387.3 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file aindex2-1.3.21.tar.gz.

File metadata

  • Download URL: aindex2-1.3.21.tar.gz
  • Upload date:
  • Size: 78.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.3.21.tar.gz
Algorithm Hash digest
SHA256 8f524635810b355fbe100fec054ae5cb7bc6495e916eccbe340ecbb1636b9cd5
MD5 b688c9124d91e95ef827384ae43daa2f
BLAKE2b-256 80706fca56e364b20cf73841d2611a9dc4f5ce120265aa4d398404a009d5bf68

See more details on using hashes here.

File details

Details for the file aindex2-1.3.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.3.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7221ca7e3200a322ab400f44ba97c6140b3c900872fcbd536765b2e18065babf
MD5 d5d05cfc210169206df3afcfa0a75cf2
BLAKE2b-256 7a65dfa37223320525254272605af0c0850fe12760f21cff21dce105f464ba98

See more details on using hashes here.

File details

Details for the file aindex2-1.3.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.3.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 706d0f579715250ab6bcdedae1c1ee79c6693109e4bd6c65485352587f9c4dd6
MD5 0a6c71088102b5722fefbce41048dc64
BLAKE2b-256 5dc4cd271ef6e354ac47cfc335af5d1170ae7056f3288034dd3a0b3c8fa15538

See more details on using hashes here.

File details

Details for the file aindex2-1.3.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.3.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eda6eb4641558b751f9b16558f45712be9a3a89769e591ba9a0dc31072f084e2
MD5 d42b5419879593e774ada26559396edd
BLAKE2b-256 977799ef14ac64941a60345a6704a68a17e9a2f4243b6c6c4b6bf56e840d35ed

See more details on using hashes here.

File details

Details for the file aindex2-1.3.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.3.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb4fd5ce30589ad54c458468b680f45375dc1a52e97ea7cb00c4439cde895f62
MD5 ad5cf53109674f5b9687552463435f46
BLAKE2b-256 0489b1e5991b0faece3fea9280308f20d3cf49fca4c7c050f72dbcf36cb388c5

See more details on using hashes here.

File details

Details for the file aindex2-1.3.21-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for aindex2-1.3.21-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 736fb458ab942cf687252f4f4300983b579bd2a48b70b6d1061541fadf8e6823
MD5 e5f70eb080a39a70c13952cbe517def4
BLAKE2b-256 0cd4f14be1e8993a01fc1c8d33b5c4e057bad1d005591264cdc7356e668a02b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page