Perfect hash based index for genome data.

These details have not been verified by PyPI

Project links

Homepage

Project description

aindex: perfect hash based index for genomic data

PyPI - Wheel GitHub Actions Workflow Status

Features

🚀 High Performance: Ultra-fast k-mer querying with optimized C++ backend

13-mers: 2.0M queries/sec (batch), 491K queries/sec (single)
23-mers: 2.3M queries/sec (batch), 1.1M queries/sec (single)
Sequence coverage analysis: 24.5K sequences/sec (13-mers), 17.5K sequences/sec (23-mers)

🧬 Dual K-mer Support: Native support for both 13-mer and 23-mer k-mers

13-mers: Complete 4^13 space coverage with perfect hashing
23-mers: Efficient sparse indexing for genomic sequences
Auto-detection: Seamlessly switches between modes based on k-mer length

💾 Memory Efficient: Optimized data structures and memory-mapped files

Batch operations: Up to 4x faster than single queries
Minimal memory overhead: Constant memory usage during processing
Real-time processing: Stream processing for large genomic datasets

🔧 Modern API: Clean pybind11 interface with comprehensive functionality

Installation

Quick install with pip:

pip install aindex2

✅ Supported platforms (pre-built wheels available):

macOS: arm64 (Apple Silicon M1/M2/M3) - full functionality with C++ optimizations
Linux: x86_64 - full functionality with C++ optimizations

⚡ Currently optimized for: Our builds are specifically optimized for the most widely used platforms:

Apple Silicon (M1/M2/M3): Native ARM64 optimizations with up to 30% faster performance
Linux x86_64: Standard Intel/AMD processors with full C++ backend

🔄 Other platforms: For platforms not listed above (Windows, Linux ARM64, macOS Intel), you can:

Use Windows Subsystem for Linux (WSL) for Windows users
Build from source (see building instructions below)
Use cloud environments (Google Colab, Jupyter notebooks, etc.)

Recommended platforms for production use: Linux x86_64 or macOS arm64

📋 For detailed platform support information, see PLATFORM_SUPPORT.md

Building from source (optional):

git clone https://github.com/ad3002/aindex.git
cd aindex
make arm64  # For Apple Silicon
# or
make all    # For x86_64
pip install .

Requirements:

Python 3.8+
Standard build tools (automatically handled by pip)
No external dependencies required

For Google Colab users:

!pip install aindex2

Detailed Installation Instructions

Standard installation with pip (all platforms):

pip install aindex2

Installation from source (for development or custom builds):

git clone https://github.com/ad3002/aindex.git
cd aindex

# Standard build (all platforms)
make all
pip install .

# For Apple Silicon with ARM64 optimizations  
make arm64
pip install .

Platform support:

macOS arm64 (Apple Silicon M1/M2/M3): Pre-built wheels with ARM64 optimizations
Linux x86_64: Pre-built wheels with full C++ functionality
Other platforms: Build from source or use alternative environments (WSL, Docker, Colab)
macOS: x86_64 (Intel), arm64 (Apple Silicon) - pre-built wheels available with full C++ functionality
Windows: AMD64 - pre-built wheels available with Python-only functionality

All platforms include optimized builds with no external dependencies required.

Windows-Specific Notes

Current Status: Python-only functionality

The Windows build installs successfully but has limited functionality
C++ k-mer counting and high-performance indexing are not available on Windows
This is due to POSIX-specific dependencies (sys/mman.h, memory mapping) in the C++ backend

What works on Windows:

# Python utilities and scripts work normally
import aindex

# File format conversion utilities
aindex.reads_to_fasta(input_file, output_file)

# Command-line utilities for file processing
# Note: High-performance k-mer operations require Linux/macOS

What doesn't work on Windows:

# These operations require C++ backend (Linux/macOS only)
from aindex.core.aindex import AIndex  # Will show clear error message
index = AIndex.load_from_prefix("data")  # Not available on Windows

For Windows users who need full functionality:

Use WSL (Windows Subsystem for Linux): Install Linux subsystem and use the Linux version
Use Docker: Run aindex in a Linux container
Use cloud/remote Linux machine: Process data on Linux and transfer results

Alternative for Windows users:

# Use WSL or Docker to get full functionality
wsl --install
wsl
pip install aindex2  # Now runs Linux version with full functionality

Google Colab Installation

For installation in Google Colab environment, there's a known cmake conflict that needs to be resolved first:

# Quick fix for cmake conflict
!pip uninstall -y cmake
!apt-get update
!apt-get install -y build-essential cmake git python3-dev

# Clone and install aindex
!git clone https://github.com/ad3002/aindex.git
%cd aindex
!pip install .

Alternative: Use automatic installation script

# Download and run the installation script
!wget https://raw.githubusercontent.com/ad3002/aindex/main/install_colab.py
!python install_colab.py

For troubleshooting, use the diagnostic script:

!wget https://raw.githubusercontent.com/ad3002/aindex/main/diagnose_colab.py
!python diagnose_colab.py

Note: Google Colab has a conflict between the Python cmake package and system cmake. The scripts above automatically resolve this issue.

To uninstall:

pip uninstall aindex2
pip uninstall clean

To clean up the compiled files, run:

make clean

macOS Compilation (ARM64/Apple Silicon Support)

For macOS systems (including Apple Silicon M1/M2), aindex now provides full ARM64 support with optimized performance:

# Build all components including the fast kmer_counter utility
make

# Alternative: build only core components
make macos

The project has been fully ported to ARM64/macOS, removing x86-specific dependencies (SSE instructions) and adding native ARM64 optimization.

Requirements for macOS:

For jellyfish-based pipeline: brew install jellyfish (optional)
Built-in kmer_counter provides faster alternative to jellyfish
Python development headers (usually included with Xcode tools)

Performance Note: The new built-in kmer_counter utility is approximately 5x faster than jellyfish for k-mer counting tasks.

Usage

Note: The examples below demonstrate full functionality available on Linux and macOS. Windows users have access to Python utilities only. See Windows-Specific Notes for details.

Command Line Interface (CLI)

aindex provides a unified command-line interface for all tools and utilities. After installation, all functions are accessible through the aindex command:

# Get help for all available commands
aindex --help

# Get help for a specific command
aindex count --help
aindex compute-aindex --help

Available Commands

Core indexing tools:

# Compute AIndex for genomic sequences (supports both 13-mer and 23-mer modes)
aindex compute-aindex -i input.fastq -o output_prefix -k 23

# Compute general index
aindex compute-index -i input.fasta -o output_prefix

# Process reads for indexing
aindex compute-reads input.fastq output.fastq fastq reads_prefix

K-mer analysis:

# Count k-mers in sequences (fast built-in counter)
aindex count -i input.fasta -o output.txt -k 23 -t 4

# Count 13-mers specifically (optimized for complete 13-mer space)
aindex count -i input.fasta -o output.txt -k 13

# Build hash table for k-mers
aindex build-hash -i kmers.txt -o hash_output

# Generate all possible 13-mers
aindex generate -o all_13mers.txt -k 13

Utilities:

# Convert reads to FASTA format
aindex reads-to-fasta input.fastq output.fasta

# Show version information
aindex version

# Show system and installation information
aindex info

Examples

Count 23-mers in a FASTA file:

aindex count -i genome.fasta -o kmer_counts.txt -k 23 -t 8

Build AIndex for 13-mer analysis:

aindex compute-aindex -i reads.fastq -o reads_index -k 13 --lu 2 -P 16

Generate all possible 13-mers for reference:

aindex generate -o all_13mers.txt -k 13

K-mer Counting Pipelines

aindex supports two k-mer counting backends:

Built-in kmer_counter (Recommended) - Fast native implementation, ~5x faster than jellyfish
Jellyfish - Traditional external tool (requires brew install jellyfish on macOS)

Quick Start

Compute all binary arrays using the fast built-in counter:

FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./tests/raw_reads.101bp.IS350bp25

# Using built-in kmer_counter (recommended, faster) via CLI
aindex compute-aindex -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use-kmer-counter

# Using built-in kmer_counter (legacy script approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

# Using jellyfish (traditional approach)
python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30

Command Line Options

-i, --input: Input FASTQ/FASTA files (comma-separated for multiple files)
-t, --type: Input file type ('fastq' or 'fasta')
-o, --output: Output prefix for generated files
--lu: Lower frequency threshold for k-mers
-P, --threads: Number of threads to use
--use-kmer-counter: Use built-in fast k-mer counter instead of jellyfish

Pipeline Outputs

Both pipelines generate identical output files:

.reads - Processed reads file
.dat - K-mer frequency data
.aindex - Binary index file
.stat - Statistics and metadata

Usage from Python

Platform Note: Full Python API with C++ backend is available on Linux and macOS. Windows provides Python utilities only.

Modern API

The aindex package provides a unified API supporting both 13-mer and 23-mer modes:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp

# Load 23-mer index (for genomic sequences)
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")  # Optional: load actual read sequences

# Load 13-mer index (for complete k-mer space analysis)
index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")  # Optional: load reads

print(f"23-mer index: {index_23mer.n_kmers:,} k-mers, {index_23mer.n_reads:,} reads")
print(f"13-mer index: {index_13mer.get_13mer_statistics()}")

K-mer Frequency Queries

Single k-mer queries:

# 23-mer queries (using AIndex wrapper)
tf_23 = index_23mer.get_tf_value("ATCGATCGATCGATCGATCGATC")  # 23 characters
print(f"23-mer frequency: {tf_23}")

# Alternative 23-mer query using [] operator
tf_23_alt = index_23mer["ATCGATCGATCGATCGATCGATC"]
print(f"23-mer frequency (alt): {tf_23_alt}")

# 13-mer queries (using C++ wrapper directly)
tf_13 = index_13mer.get_total_tf_value_13mer("ATCGATCGATCGA")  # 13 characters
print(f"13-mer frequency: {tf_13}")

# Get forward and reverse frequencies separately for 13-mers
tf_fwd, tf_rev = index_13mer.get_tf_both_directions_13mer("ATCGATCGATCGA")
print(f"13-mer forward: {tf_fwd}, reverse: {tf_rev}, total: {tf_fwd + tf_rev}")

Batch queries (much faster):

# Batch 23-mer queries (2-3x faster than single queries)
kmers_23 = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA", "TTTTTTTTTTTTTTTTTTTTTTT"]
tf_values_23 = index_23mer.get_tf_values(kmers_23)
print(f"23-mer batch results: {tf_values_23}")

# Batch 13-mer queries (total frequencies)
kmers_13 = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] 
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
print(f"13-mer batch results: {tf_values_13}")

# Batch directional 13-mer queries (forward + reverse separately)
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}, total={fwd+rev}")

Advanced 13-mer Operations

Directional analysis (forward + reverse complement):

# Get frequencies in both directions
kmer = "ATCGATCGATCGA"
forward_tf, reverse_tf = index_13mer.get_tf_both_directions_13mer(kmer)
total_tf = index_13mer.get_total_tf_value_13mer(kmer)

print(f"Forward: {forward_tf}, Reverse: {reverse_tf}, Total: {total_tf}")

# Batch directional analysis
results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(results):
    print(f"{kmers_13[i]}: forward={fwd}, reverse={rev}")

Complete 13-mer space analysis:

# Get statistics for the entire 13-mer space
stats = index_13mer.get_13mer_statistics()
print(f"Total 13-mers: {stats['total_kmers']:,}")
print(f"Non-zero frequencies: {stats['non_zero_kmers']:,}")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (4^13 = 67M elements)
# Note: This loads 256MB into memory
full_array = index_13mer.get_13mer_tf_array()
print(f"Array size: {len(full_array):,} elements")

Sequence Coverage Analysis

Analyze k-mer coverage in sequences:

# Using real reads from the index
real_read = index_23mer.get_read_by_rid(0)  # Get first read
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]    # Take first 100 bp

# Analyze 23-mer coverage using built-in function
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
print(f"23-mer coverage: {len(coverage_23)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_23 if tf > 0)}")
print(f"Average TF: {sum(coverage_23)/len(coverage_23):.2f}")

# Analyze 13-mer coverage using batch queries
kmers_13_in_seq = [sequence[i:i+13] for i in range(len(sequence) - 12)]
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13_in_seq)
print(f"13-mer coverage: {len(coverage_13)} positions")
print(f"Non-zero positions: {sum(1 for tf in coverage_13 if tf > 0)}")
print(f"Average TF: {sum(coverage_13)/len(coverage_13):.2f}")

Iterate over k-mers in sequences:

# 23-mer iteration using built-in iterator
for kmer, tf in index_23mer.iter_sequence_kmers(sequence, k=23):
    if tf > 0:  # Only show k-mers found in index
        print(f"{kmer}: {tf}")

# 13-mer iteration using manual approach (more efficient with batch)
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
tf_values_13 = index_13mer.get_total_tf_values_13mer(kmers_13)

for i, (kmer, tf) in enumerate(zip(kmers_13, tf_values_13)):
    if tf > 0:
        print(f"Position {i}: {kmer}: {tf}")

# For directional analysis of 13-mers
directional_results = index_13mer.get_tf_both_directions_13mer_batch(kmers_13)
for i, (fwd, rev) in enumerate(directional_results):
    if fwd > 0 or rev > 0:
        print(f"Position {i}: {kmers_13[i]}: forward={fwd}, reverse={rev}")

Performance Benchmarks

Based on stress testing with 1M queries and 10K sequence analyses:

Operation	13-mers	23-mers	Speedup
Single TF queries	491K queries/sec	1.1M queries/sec	23-mer 2.2x faster
Batch TF queries	2.0M queries/sec	2.3M queries/sec	23-mer 1.2x faster
Sequence coverage	24.5K sequences/sec	17.5K sequences/sec	13-mer 1.4x faster
K-mer positions	2.2M positions/sec	1.4M positions/sec	13-mer 1.6x faster

Key findings:

Batch operations: 2-4x faster than single queries for both modes
23-mers: Better for single/batch TF queries due to optimized sparse indexing
13-mers: Better for sequence analysis due to complete space coverage
Memory efficiency: Minimal memory growth during batch operations

Working with Reads

Access reads by ID:

# Get reads from either index
for rid in range(min(5, index_23mer.n_reads)):
    read = index_23mer.get_read_by_rid(rid)
    print(f"Read {rid}: {read[:50]}...")  # First 50 characters
    
    # Split paired reads (separated by '~')
    if '~' in read:
        read1, read2 = read.split('~')
        print(f"  Read 1: {len(read1)} bp, Read 2: {len(read2)} bp")

Iterate over all reads:

# Iterate through reads with automatic ID assignment
read_count = 0
for rid, read in index_23mer.iter_reads():
    read_count += 1
    if read_count <= 5:  # Show first 5 reads
        print(f"Read {rid}: {len(read)} bp")
    if read_count >= 1000:  # Process first 1000 reads
        break
        
print(f"Processed {read_count} reads")

Complete Example

Here's a practical example showing both 13-mer and 23-mer analysis:

from aindex.core.aindex import AIndex
import aindex.core.aindex_cpp as aindex_cpp
import time

# Load both indices
print("Loading indices...")
index_23mer = AIndex.load_from_prefix("temp/reads.23")
index_23mer.load_reads("temp/reads.reads")

index_13mer = aindex_cpp.AindexWrapper()
index_13mer.load_from_prefix_13mer("temp/all_13mers")
index_13mer.load_reads("temp/reads.reads")

# Get a real sequence to analyze
real_read = index_23mer.get_read_by_rid(0)
sequence = real_read.split('~')[0][:100] if '~' in real_read else real_read[:100]
print(f"Analyzing sequence: {sequence[:50]}...")

# Compare 13-mer vs 23-mer coverage
print("\n=== Coverage Analysis ===")

# 23-mer coverage using built-in function
start = time.time()
coverage_23 = index_23mer.get_sequence_coverage(sequence, cutoff=0, k=23)
time_23 = time.time() - start

# 13-mer coverage using batch query
kmers_13 = [sequence[i:i+13] for i in range(len(sequence) - 12)]
start = time.time()
coverage_13 = index_13mer.get_total_tf_values_13mer(kmers_13)
time_13 = time.time() - start

print(f"23-mers: {len(coverage_23)} positions, {sum(1 for x in coverage_23 if x > 0)} covered ({time_23*1000:.1f}ms)")
print(f"13-mers: {len(coverage_13)} positions, {sum(1 for x in coverage_13 if x > 0)} covered ({time_13*1000:.1f}ms)")

# Performance comparison
print(f"\n=== Performance Test ===")
test_kmers_23 = ["ATCGATCGATCGATCGATCGATC"] * 1000
test_kmers_13 = ["ATCGATCGATCGA"] * 1000

# 23-mer batch query
start = time.time()
results_23 = index_23mer.get_tf_values(test_kmers_23)
time_23_batch = time.time() - start

# 13-mer batch query
start = time.time()
results_13 = index_13mer.get_total_tf_values_13mer(test_kmers_13)
time_13_batch = time.time() - start

print(f"23-mer batch (1K queries): {len(test_kmers_23)/time_23_batch:.0f} queries/sec")
print(f"13-mer batch (1K queries): {len(test_kmers_13)/time_13_batch:.0f} queries/sec")

# Statistics
stats_23 = {"kmers": index_23mer.n_kmers, "reads": index_23mer.n_reads}
stats_13 = index_13mer.get_13mer_statistics()

print(f"\n=== Index Statistics ===")
print(f"23-mer index: {stats_23['kmers']:,} k-mers, {stats_23['reads']:,} reads")
print(f"13-mer index: {stats_13['total_kmers']:,} total k-mers, {stats_13['non_zero_kmers']:,} non-zero")

Expected output:

Loading indices...
Analyzing sequence: NNNNNNNNNNACTGAACCGCCTTCCGATCTCCAGCTGCAAAGCGTAG...

=== Coverage Analysis ===
23-mers: 78 positions, 42 covered (0.3ms)
13-mers: 88 positions, 88 covered (0.1ms)

=== Performance Test ===
23-mer batch (1K queries): 2,300,000 queries/sec
13-mer batch (1K queries): 2,000,000 queries/sec

=== Index Statistics ===
23-mer index: 15,234,567 k-mers, 125,000 reads  
13-mer index: 67,108,864 total k-mers, 8,945,123 non-zero

Advanced Features

13-mer Integration

The aindex library provides highly optimized 13-mer k-mer counting and querying using precomputed perfect hash tables. This mode offers complete coverage of the 4^13 k-mer space with exceptional performance.

Performance Characteristics

Query Performance:

Single queries: 491K queries/second
Batch queries: 2.0M queries/second (4.1x speedup)
Directional queries: 1.8M queries/second (forward + reverse complement)
Complete space: Access to all 67,108,864 possible 13-mers

Sequence Analysis Performance:

Coverage analysis: 24,500 sequences/second
Position analysis: 2.2M k-mer positions/second
Memory efficiency: Zero memory growth during batch operations
Real data coverage: 100% (all k-mers found in biological data)

13-mer Workflow

1. Generate Complete 13-mer Space:

# Generate all possible 13-mers (67M k-mers)
./bin/generate_all_13mers.exe all_13mers.txt

# Build perfect hash for instant lookup
./bin/build_13mer_hash.exe all_13mers.txt temp/all_13mers 4

# Count k-mers in your genomic data
./bin/count_kmers13.exe input_reads.fasta temp/all_13mers.tf.bin hash_file 4

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 13-mer index with complete k-mer space
index = AIndex.load_from_prefix_13mer("temp/all_13mers")
index.load_reads("temp/reads.reads")  # Optional: load read sequences

# Query performance demonstration
import time

# Single k-mer query
start = time.time()
tf = index.get_total_tf_value_13mer("ATCGATCGATCGA")
single_time = time.time() - start
print(f"Single query: {tf} (took {single_time*1000:.3f}ms)")

# Batch query (much faster)
kmers = ["ATCGATCGATCGA", "AAAAAAAAAAAAA", "TTTTTTTTTTTTT"] * 1000  # 3K queries
start = time.time() 
tf_values = index.get_total_tf_values_13mer(kmers)
batch_time = time.time() - start
print(f"Batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Directional analysis (forward + reverse complement)
forward, reverse = index.get_tf_both_directions_13mer("ATCGATCGATCGA")
total = index.get_total_tf_value_13mer("ATCGATCGATCGA")
print(f"Directional: forward={forward}, reverse={reverse}, total={total}")

13-mer Statistics and Analysis

Get comprehensive statistics:

# Complete 13-mer space statistics
stats = index.get_13mer_statistics()
print(f"Total 13-mer space: {stats['total_kmers']:,}")
print(f"Found in data: {stats['non_zero_kmers']:,} ({stats['non_zero_kmers']/stats['total_kmers']*100:.2f}%)")
print(f"Max frequency: {stats['max_frequency']:,}")
print(f"Total occurrences: {stats['total_count']:,}")
print(f"Average frequency: {stats['total_count']/stats['non_zero_kmers']:.2f}")

# Access complete frequency array (warning: 256MB)
if stats['non_zero_kmers'] > 0:
    # Get subset for analysis rather than full array
    sample_indices = range(0, 1000000, 1000)  # Sample every 1000th element
    sample_tfs = [index.get_tf_by_index_13mer(i) for i in sample_indices]
    non_zero_sample = [tf for tf in sample_tfs if tf > 0]
    print(f"Sample analysis: {len(non_zero_sample)}/{len(sample_tfs)} non-zero in sample")

Sequence coverage analysis:

# Analyze real genomic sequences
for rid in range(min(5, index.n_reads)):
    read = index.get_read_by_rid(rid)
    if '~' in read:
        sequence = read.split('~')[0]  # Take first mate
    else:
        sequence = read
    
    # Limit to reasonable length for demonstration
    if len(sequence) > 100:
        sequence = sequence[:100]
    
    # Compute 13-mer coverage
    coverage = []
    for i in range(len(sequence) - 12):
        kmer = sequence[i:i+13]
        tf = index.get_total_tf_value_13mer(kmer)
        coverage.append(tf)
    
    if coverage:
        avg_tf = sum(coverage) / len(coverage)
        max_tf = max(coverage)
        coverage_pct = sum(1 for tf in coverage if tf > 0) / len(coverage) * 100
        print(f"Read {rid}: {len(coverage)} 13-mers, {coverage_pct:.1f}% covered, avg TF {avg_tf:.1f}, max TF {max_tf}")

23-mer Integration

The 23-mer mode provides efficient sparse indexing for longer k-mers commonly used in genomic analysis.

Performance Characteristics

Query Performance:

Single queries: 1.0M queries/second
Batch queries: 2.4M queries/second (2.4x speedup)
Directional queries: Available for forward + reverse complement analysis
Sparse indexing: Only stores k-mers present in input data

Sequence Analysis Performance:

Coverage analysis: 16,900 sequences/second
Position analysis: 1.3M k-mer positions/second
Memory efficiency: Constant memory usage during operations
Real data coverage: 100% (all k-mers found in genomic sequences)

23-mer Workflow

1. Build 23-mer Index:

# Using the fast built-in k-mer counter (recommended)
FASTQ1=./tests/raw_reads.101bp.IS350bp25_1.fastq
FASTQ2=./tests/raw_reads.101bp.IS350bp25_2.fastq
OUTPUT_PREFIX=./temp/reads.23

python3 scripts/compute_aindex.py -i $FASTQ1,$FASTQ2 -t fastq -o $OUTPUT_PREFIX --lu 2 -P 30 --use_kmer_counter

2. Python API Usage:

from aindex.core.aindex import AIndex

# Load 23-mer index
index = AIndex.load_from_prefix("temp/reads.23")
index.load_reads("temp/reads.reads")

# Performance demonstration
import time

# Batch query performance
kmers = ["ATCGATCGATCGATCGATCGATC", "AAAAAAAAAAAAAAAAAAAAAA"] * 1000  # 2K queries
start = time.time()
tf_values = index.get_tf_values(kmers)  # Auto-detects 23-mer mode
batch_time = time.time() - start
print(f"23-mer batch {len(kmers)} queries: {batch_time:.3f}s ({len(kmers)/batch_time:.0f} queries/sec)")

# Sequence coverage analysis
read = index.get_read_by_rid(0)
sequence = read.split('~')[0][:100] if '~' in read else read[:100]

start = time.time()
coverage = index.get_sequence_coverage(sequence, cutoff=0, k=23)
coverage_time = time.time() - start

print(f"23-mer coverage analysis: {len(coverage)} positions in {coverage_time*1000:.1f}ms")
print(f"Coverage: {sum(1 for tf in coverage if tf > 0)/len(coverage)*100:.1f}% positions covered")
print(f"Average TF: {sum(coverage)/len(coverage):.1f}")

Performance Comparison

Throughput Comparison (Operations per Second)

Operation Type	13-mers	23-mers	Winner
Single TF queries	491K/sec	1.1M/sec	23-mer (+124%)
Batch TF queries	2.0M/sec	2.3M/sec	23-mer (+15%)
Sequence coverage	24.5K/sec	17.5K/sec	13-mer (+40%)
Position analysis	2.2M/sec	1.4M/sec	13-mer (+57%)

Use Case Recommendations

Choose 13-mers when:

Analyzing complete k-mer space (population genetics, mutation analysis)
Maximum query performance needed
Working with shorter sequences or fragments
Need comprehensive coverage statistics

Choose 23-mers when:

Standard genomic analysis (assembly, alignment, variant calling)
Working with longer reads (>100bp)
Memory efficiency is critical
Integration with existing 23-mer workflows

Memory Usage

13-mer index: ~277MB (256MB frequencies + 21MB hash)
23-mer index: Variable, depends on data complexity
Both modes: Memory-mapped files for efficient access
Batch operations: Minimal additional memory overhead

File Formats

13-mer Files

.tf.bin: Binary frequency array (uint64_t × 67M elements = 512MB)
.pf: Perfect hash function for k-mer → index mapping
.kmers.bin: Binary k-mer encoding (optional validation)

23-mer Files

.tf.bin: Binary frequency array (variable size)
.pf: Perfect hash function
.kmers.bin: Binary k-mer storage
.aindex.indices.bin & .aindex.index.bin: Position indices (optional)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.4.4

Jun 18, 2025

This version

1.4.2

Jun 17, 2025

1.3.25

Jun 14, 2025

1.3.24

Jun 14, 2025

1.3.23

Jun 14, 2025

1.3.22

Jun 14, 2025

1.3.21

Jun 14, 2025

1.3.20

Jun 14, 2025

1.3.18

Jun 14, 2025

1.3.17

Jun 14, 2025

1.3.15

Jun 14, 2025

1.3.1

Jun 14, 2025

1.3.0

Jun 14, 2025

1.1.3

Sep 23, 2024

1.1.2

Jul 25, 2024

1.1.1

Jul 24, 2024

1.1.0

Jul 24, 2024

1.0.5

Jul 24, 2024

1.0.4

Jul 24, 2024

1.0.3

Jul 24, 2024

1.0.2

Jul 24, 2024

1.0.1

Jul 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aindex2-1.4.2.tar.gz (93.2 kB view details)

Uploaded Jun 17, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded Jun 17, 2025 CPython 3.12manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl (685.9 kB view details)

Uploaded Jun 17, 2025 CPython 3.12macOS 11.0+ ARM64

aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded Jun 17, 2025 CPython 3.11manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl (687.4 kB view details)

Uploaded Jun 17, 2025 CPython 3.11macOS 11.0+ ARM64

aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded Jun 17, 2025 CPython 3.10manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl (685.4 kB view details)

Uploaded Jun 17, 2025 CPython 3.10macOS 11.0+ ARM64

aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded Jun 17, 2025 CPython 3.9manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl (685.6 kB view details)

Uploaded Jun 17, 2025 CPython 3.9macOS 11.0+ ARM64

aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded Jun 17, 2025 CPython 3.8manylinux: glibc 2.17+ x86-64

aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl (698.8 kB view details)

Uploaded Jun 17, 2025 CPython 3.8macOS 11.0+ ARM64

File details

Details for the file aindex2-1.4.2.tar.gz.

File metadata

Download URL: aindex2-1.4.2.tar.gz
Upload date: Jun 17, 2025
Size: 93.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2.tar.gz
Algorithm	Hash digest
SHA256	`577c75fb749e919543faf373b3fa95c601f08b778f65d058e10e4ebaff8aac00`
MD5	`525350536185eb69f4e2df7de189775b`
BLAKE2b-256	`73b2e45f82dad92211a6083fdf091c956b62c93e33b74868e94f60b4f90eb4d7`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 17, 2025
Size: 1.0 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`71ad8441604fb6c7f9670b94420813d82c9d883a7c24c777fffd6d76172b2ffc`
MD5	`1135d79e60a05301237c248ce0aea8a5`
BLAKE2b-256	`7904284909f71d153179d70a23ca978871a2ca495f164d9d39a1ee4b58b83d21`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Jun 17, 2025
Size: 685.9 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`ee11d8ca23d03630cc634dd537d6406667f135cfaa2d3500cd5efa400e79eb58`
MD5	`ca1bf2c9c819c11081f9254fbf90f879`
BLAKE2b-256	`1397c0eddddb84ec1dffa5e74eff0b89903f3671ea502ae96cdf43c3da71c302`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 17, 2025
Size: 1.0 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`2da6c2a03585c6982d3007de43daf19a967a9c703a0da732ea79385c174f9fad`
MD5	`ccc2de610132c65aad21a56c3953c448`
BLAKE2b-256	`bc06260d5a53285d6587f9cd6745ac479d52d0452f3e582d320c227caf3a0e10`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl
Upload date: Jun 17, 2025
Size: 687.4 kB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`4eccc3980943ce8224cdc96360fc7e87f997d841ab2a2e6bcf791bc95c03799b`
MD5	`8a40197d05516eca14d1d8df936f1e96`
BLAKE2b-256	`911705dce504d6388943b0b29f1e55aa999a4b51fe7e2d5e0a0351c57ed1ee11`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 17, 2025
Size: 1.0 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`eff7288259f3899aee3b9bed27f82540fb6e3dac61a073f00c4b9a520c0f0aeb`
MD5	`1edfd6979357318939d0a13ffeac025d`
BLAKE2b-256	`78a53f0fcaf2b044f8514d66f8f55a6ce59abd3315c581e127bdc97e5d2c80b9`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

Download URL: aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl
Upload date: Jun 17, 2025
Size: 685.4 kB
Tags: CPython 3.10, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`2defcf4b324cd089f81f21ae91aff9ce7ff2cf4e9bdfd2a41bb811b2042c8991`
MD5	`61943025b94fba441c29cf226797d28f`
BLAKE2b-256	`185581d8cca489c57443b04293d3ea649ee521752b1ff7ea2e60ab9c8ba351ec`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 17, 2025
Size: 1.0 MB
Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`b6b71807cf1c2dd1612721bd3ff56c3e5c606458efb13e27a5e4ee07ed2bbb57`
MD5	`cf9c643038dc6f3e3a2e96c8c6e5db84`
BLAKE2b-256	`d0228cc72ec055eec723e5184bf7b1683e720aa0a2519852810edc09133cfeff`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

Download URL: aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl
Upload date: Jun 17, 2025
Size: 685.6 kB
Tags: CPython 3.9, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`98c46de614a54cda38b996678d458b6ea525666ada576963dfd708ae7ae7fc18`
MD5	`6bb548eb49ada2b0f50673ced809e4b1`
BLAKE2b-256	`00104831ba621d131fa82ad8ee03cc5bdf6e588bd0713447516c32732ce317d6`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 17, 2025
Size: 1.0 MB
Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`5684c5e9c86357cd389f2638fe71f9a951ec4222e42148a79935099e27cba9e5`
MD5	`68eada44efe4431e59d810fd962a8439`
BLAKE2b-256	`333131816324d4f96ae8e8a5b36a3cad741e35b0e63fe03fd112796900467db0`

See more details on using hashes here.

File details

Details for the file aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

Download URL: aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl
Upload date: Jun 17, 2025
Size: 698.8 kB
Tags: CPython 3.8, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for aindex2-1.4.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`45fe859aa88400af35a778510fd6d19a89a8ee61f3cef5a574c94dd708c23090`
MD5	`77a93b7c0fe512288cbcdf1e65582303`
BLAKE2b-256	`699d75ecaed3026dce2809225e0745b54425a5536d307b1b2621d354b9c3129e`

See more details on using hashes here.

aindex2 1.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aindex: perfect hash based index for genomic data

Features

Installation

Detailed Installation Instructions

Windows-Specific Notes

Google Colab Installation

macOS Compilation (ARM64/Apple Silicon Support)

Usage

Command Line Interface (CLI)

Available Commands

Examples

K-mer Counting Pipelines

Quick Start

Command Line Options

Pipeline Outputs

Usage from Python

Modern API

K-mer Frequency Queries

Advanced 13-mer Operations

Sequence Coverage Analysis

Performance Benchmarks

Working with Reads

Complete Example

Advanced Features

13-mer Integration

Performance Characteristics

13-mer Workflow

13-mer Statistics and Analysis

23-mer Integration

Performance Characteristics

23-mer Workflow

Performance Comparison

Throughput Comparison (Operations per Second)

Use Case Recommendations

Memory Usage

File Formats

13-mer Files

23-mer Files

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details