Skip to main content

Natural Language compressor for LLMs (Compressed Language Model).

Project description

CLM

CLM

Semantic Token Encoding for LLMs

Test Suite PyPI License

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.


CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder Input Typical Compression
Thread Support calls, chat transcripts, email threads 62–80%
Structured Data Product catalogs, knowledge bases, business rules 40–85%
System Prompt Task instructions, role definitions, agent configs 65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)
[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric Value
Token reduction 72–80%
Latency improvement Up to 56%
Semantic preservation Validated via Shannon Entropy
Languages EN, PT, ES, FR
Schema version v2.0
Language detection detect_lang (default: on)
Context values include_ctx_values — emit raw NER values alongside context tokens
Duration estimation estimate_thread_duration — infer duration from content
Built-in summary include_summary + optional custom_summary_template (Jinja2)
Custom redaction redaction_pattern — regex for PII placeholder detection

Structured Data Encoder

Metric Value
Token reduction 40–85%
Supports Single objects, arrays, nested structures
Field filtering Importance threshold + required/excluded
Per-field truncation Configurable

System Prompt Encoder

Metric Value
Token reduction 65–90%
Output Hierarchical CLM token vocabulary
Type inference Optional (infer_types=True)
Attribute preservation Optional (add_attrs=True)

Documentation

Official documentation: https://yanickjair.github.io/cllm

Topic Link
Getting started docs/index.md
Thread Encoder docs/thread_encoder/index.md
Transcript encoding docs/thread_encoder/transcript_encoder.md
Free-Form Encoder docs/thread_encoder/free_form_encoder.md
Structured Data Encoder docs/sd_encoder.md
System Prompt Encoder docs/sys_prompt/index.md
CLM Configuration docs/advanced/clm_configuration.md
Token hierarchy docs/advanced/clm_tokenization.md
Output reference docs/advanced/clm_output.md

License

Dual-licensed:

  • AGPL-3.0 — free for open source use (LICENSE-AGPL)
  • Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.0.8.tar.gz (244.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clm_core-1.0.8-py3-none-any.whl (239.8 kB view details)

Uploaded Python 3

File details

Details for the file clm_core-1.0.8.tar.gz.

File metadata

  • Download URL: clm_core-1.0.8.tar.gz
  • Upload date:
  • Size: 244.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.8.tar.gz
Algorithm Hash digest
SHA256 01f4dda9da273723a42abf19aec3299ad8146e60b25420e4b5fc569d26bf1abd
MD5 3658fae84e6bca72195e04a3a670ae91
BLAKE2b-256 410d06c2517e6eee4716056c7bd4bae98ac5061dde766358489e07ee20ee6470

See more details on using hashes here.

File details

Details for the file clm_core-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: clm_core-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 239.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 7ccf4dddc82d79e36f0f3aaa0cad63f72649a5b0695face60e445d45aaf56e5a
MD5 ab125c0a57d05ff9bd533c2361775325
BLAKE2b-256 274786f8f6a258b0d25662568be244a168a42e51c87df1cabd9e1b606b720e3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page