Canonicalization helpers and plugins for the AccuralAI LLM pipeline.

These details have not been verified by PyPI

Project links

Project description

accuralai-canonicalize

accuralai-canonicalize provides advanced canonicalization utilities and plugins for the AccuralAI pipeline. The library includes sophisticated token optimization, semantic caching, and comprehensive metrics tracking to maximize LLM efficiency and reduce costs.

Features

🚀 Advanced Token Optimization

Whitespace Preservation: Original whitespace is preserved for better response quality (compression disabled by default)
Phrase Deduplication: Removes repeated phrases to reduce token count
Structure Optimization: Normalizes punctuation and formatting for efficiency
Context-Aware Processing: Optimizes conversation history and system prompts

🧠 Semantic Caching

Semantic Cache Keys: Groups similar requests for better cache hit rates
Key Phrase Extraction: Identifies important concepts for intelligent grouping
Hierarchical Caching: Multiple cache key strategies for different use cases

📊 Comprehensive Metrics

Token Savings Tracking: Monitor compression ratios and optimization effectiveness
Performance Analytics: Track which optimizations provide the most benefit
Real-time Statistics: Live metrics during canonicalization

🔧 Flexible Configuration

Granular Controls: Enable/disable specific optimization features
Validation Options: Configurable length limits and quality checks
Backward Compatibility: Drop-in replacement for existing implementations

Installation

Install alongside accuralai-core to enable the canonicalizer:

pip install accuralai-core accuralai-canonicalize

Usage

Basic Usage

The canonicalizer automatically integrates with the AccuralAI pipeline:

accuralai-core generate --prompt "Hello there!"

Advanced Configuration

Configure the canonicalizer in your config.toml:

[canonicalizer]
plugin = "advanced"  # Use the advanced canonicalizer
[canonicalizer.options]
# Basic options
normalize_tags = true
auto_cache_key = true
cache_key_metadata_fields = ["topic", "domain"]

# Advanced token optimization
enable_deduplication = true
deduplication_min_length = 10
enable_structure_optimization = true
enable_whitespace_compression = false  # Disabled by default - preserving whitespace improves response quality

# Semantic caching
use_semantic_cache_keys = true
semantic_key_max_phrases = 5

# Context-aware processing
optimize_conversation_history = true
max_history_entries = 50
compress_system_prompt = true

# Metrics and telemetry
track_metrics = true
log_optimization_stats = false

# Validation
max_prompt_length = 10000
min_prompt_length = 1

Programmatic Usage

from accuralai_canonicalize.canonicalizer import AdvancedCanonicalizer, CanonicalizerOptions
from accuralai_core.contracts.models import GenerateRequest

# Create canonicalizer with custom options
options = CanonicalizerOptions(
    enable_deduplication=True,
    enable_structure_optimization=True,
    use_semantic_cache_keys=True,
    track_metrics=True
)
canonicalizer = AdvancedCanonicalizer(options=options)

# Process a request
request = GenerateRequest(
    prompt="  Hello   world!!!   Hello   world!!!  ",
    system_prompt="You are a helpful assistant.",
    tags=["test"]
)

canonical = await canonicalizer.canonicalize(request)

# Access optimization metrics
metrics = canonicalizer.metrics
print(f"Tokens saved: {metrics.tokens_saved}")
print(f"Compression ratio: {metrics.compression_ratio:.2%}")

Configuration Options

Basic Options

prompt_template: Template string for prompt formatting
normalize_tags: Normalize and deduplicate tags
default_tags: Default tags to add to all requests
metadata_defaults: Default metadata values
auto_cache_key: Automatically generate cache keys
cache_key_metadata_fields: Metadata fields to include in cache keys

Advanced Token Optimization

enable_deduplication: Remove repeated phrases (default: true)
deduplication_min_length: Minimum phrase length for deduplication (default: 10)
enable_structure_optimization: Optimize punctuation and formatting (default: true)
enable_whitespace_compression: Whitespace compression (default: false - preserving whitespace improves response quality)

Semantic Caching

use_semantic_cache_keys: Use semantic similarity for cache keys (default: false)
semantic_key_max_phrases: Maximum phrases to extract for semantic keys (default: 5)

Context-Aware Processing

optimize_conversation_history: Optimize conversation history (default: true)
max_history_entries: Maximum history entries to keep (default: 50)
compress_system_prompt: Apply optimizations to system prompt (default: true, but whitespace is preserved)

Metrics and Telemetry

track_metrics: Track optimization metrics (default: true)
log_optimization_stats: Log optimization statistics (default: false)

Validation

max_prompt_length: Maximum prompt length (default: none)
min_prompt_length: Minimum prompt length (default: 1)

Optimization Examples

Whitespace Preservation

Input:  "  Hello   world    with    multiple    spaces  "
Output: "  Hello   world    with    multiple    spaces  "  # Whitespace preserved by default

Deduplication

Input:  "Hello world Hello world Hello world"
Output: "Hello world"

Structure Optimization

Input:  "Hello!!!   How are you???   Fine..."
Output: "Hello! How are you? Fine..."

Semantic Cache Keys

Input:  "Explain Python programming concepts"
Output: Cache key: "sem:a1b2c3d4e5f6g7h8" (groups with similar Python questions)

Metrics and Monitoring

The canonicalizer provides detailed metrics about optimization effectiveness:

metrics = canonicalizer.metrics
print(f"Original tokens: {metrics.original_token_count}")
print(f"Optimized tokens: {metrics.optimized_token_count}")
print(f"Tokens saved: {metrics.tokens_saved}")
print(f"Compression ratio: {metrics.compression_ratio:.2%}")
print(f"Deduplication applied: {metrics.deduplication_applied}")
print(f"Whitespace compression applied: {metrics.whitespace_compression_applied}")
print(f"Structure optimization applied: {metrics.structure_optimization_applied}")

Performance Benefits

Typical token savings with the advanced canonicalizer:

Whitespace Preservation: Enabled by default for better response quality
Deduplication: 10-30% reduction (when applicable)
Structure Optimization: 3-8% reduction
Combined Optimization: 10-35% total reduction (with whitespace preservation)

Migration from Standard Canonicalizer

The advanced canonicalizer is backward compatible. To migrate:

Update your configuration to use plugin = "advanced"

Optionally enable additional features:

[canonicalizer.options]
enable_deduplication = true
use_semantic_cache_keys = true
track_metrics = true

Monitor metrics to measure optimization effectiveness

Contributing

Contributions are welcome! Please see the main AccuralAI repository for contribution guidelines.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jan 26, 2026

0.2.0

Nov 15, 2025

0.1.0

Oct 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

accuralai_canonicalize-0.2.1.tar.gz (12.7 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

accuralai_canonicalize-0.2.1-py3-none-any.whl (9.7 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file accuralai_canonicalize-0.2.1.tar.gz.

File metadata

Download URL: accuralai_canonicalize-0.2.1.tar.gz
Upload date: Jan 26, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for accuralai_canonicalize-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`44e7912d441d055b7243614f34e4c9e8f7ce3745e39a2dc9ba3e62223154d483`
MD5	`b7f0a13d47c204bb948064271498cdd0`
BLAKE2b-256	`a78c3299e3df30c291929b958d76f1429ea298f462f99b703d30a682fef7cbdc`

See more details on using hashes here.

File details

Details for the file accuralai_canonicalize-0.2.1-py3-none-any.whl.

File metadata

Download URL: accuralai_canonicalize-0.2.1-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for accuralai_canonicalize-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`630649b70799e663e504f6ecf609297fe7cacdd4e985ffb1028239fabb885f42`
MD5	`842d1e0b3b978dd64b10ce47ad2beea9`
BLAKE2b-256	`2727c56b770a52d09f3a63b450ba05096cefaa6ad0bfeafe35fbe0ca088759e4`

See more details on using hashes here.

accuralai-canonicalize 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

accuralai-canonicalize

Features

🚀 Advanced Token Optimization

🧠 Semantic Caching

📊 Comprehensive Metrics

🔧 Flexible Configuration

Installation

Usage

Basic Usage

Advanced Configuration

Programmatic Usage

Configuration Options

Basic Options

Advanced Token Optimization

Semantic Caching

Context-Aware Processing

Metrics and Telemetry

Validation

Optimization Examples

Whitespace Preservation

Deduplication

Structure Optimization

Semantic Cache Keys

Metrics and Monitoring

Performance Benefits

Migration from Standard Canonicalizer

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes