Natural Language compressor for LLMs (Compressed Language Model).
Project description
CLM
Semantic Token Encoding for LLMs
Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.
CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.
Three targets, one encoder:
| Encoder | Input | Typical Compression |
|---|---|---|
| Thread | Support calls, chat transcripts, email threads | 62–80% |
| Structured Data | Product catalogs, knowledge bases, business rules | 40–85% |
| System Prompt | Task instructions, role definitions, agent configs | 65–90% |
Installation
pip install clm-core
Install the spaCy model for your language:
python -m spacy download en_core_web_sm # English
python -m spacy download pt_core_news_sm # Portuguese
python -m spacy download es_core_news_sm # Spanish
python -m spacy download fr_core_news_sm # French
Usage
All three encoders share the same interface. CLM auto-detects the input type.
from clm_core import CLMConfig, CLMEncoder
cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)
Thread Encoder — Transcripts
result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)
[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]
Parse into a structured dict for downstream use:
data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
# "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}
Structured Data Encoder
catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]
System Prompt Encoder
result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]
Performance
Based on a dataset test across 5,000+ samples:
Thread Encoder
| Metric | Value |
|---|---|
| Token reduction | 72–80% |
| Latency improvement | Up to 56% |
| Semantic preservation | Validated via Shannon Entropy |
| Languages | EN, PT, ES, FR |
| Schema version | v2.0 |
| Language detection | detect_lang (default: on) |
| Context values | include_ctx_values — emit raw NER values alongside context tokens |
| Duration estimation | estimate_thread_duration — infer duration from content |
| Built-in summary | include_summary + optional custom_summary_template (Jinja2) |
| Custom redaction | redaction_pattern — regex for PII placeholder detection |
Structured Data Encoder
| Metric | Value |
|---|---|
| Token reduction | 40–85% |
| Supports | Single objects, arrays, nested structures |
| Field filtering | Importance threshold + required/excluded |
| Per-field truncation | Configurable |
System Prompt Encoder
| Metric | Value |
|---|---|
| Token reduction | 65–90% |
| Output | Hierarchical CLM token vocabulary |
| Type inference | Optional (infer_types=True) |
| Attribute preservation | Optional (add_attrs=True) |
Documentation
Official documentation: https://yanickjair.github.io/cllm
| Topic | Link |
|---|---|
| Getting started | docs/index.md |
| Thread Encoder | docs/thread_encoder/index.md |
| Transcript encoding | docs/thread_encoder/transcript_encoder.md |
| Free-Form Encoder | docs/thread_encoder/free_form_encoder.md |
| Structured Data Encoder | docs/sd_encoder.md |
| System Prompt Encoder | docs/sys_prompt/index.md |
| CLM Configuration | docs/advanced/clm_configuration.md |
| Token hierarchy | docs/advanced/clm_tokenization.md |
| Output reference | docs/advanced/clm_output.md |
License
Dual-licensed:
- AGPL-3.0 — free for open source use (LICENSE-AGPL)
- Commercial — for proprietary products and SaaS (contact)
Issues · Discussions · Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clm_core-1.0.9.tar.gz.
File metadata
- Download URL: clm_core-1.0.9.tar.gz
- Upload date:
- Size: 242.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b707a5ed747b6dd3b4bca23e360ffa5cb5cd0fec78e744996bdbcf1ef7302e7
|
|
| MD5 |
aa8d019cdcad3480c22669cbb74cf0cd
|
|
| BLAKE2b-256 |
435699af127b813e4453553c278f7d6a1c6363054ce2706b8809323cb6c17e7a
|
File details
Details for the file clm_core-1.0.9-py3-none-any.whl.
File metadata
- Download URL: clm_core-1.0.9-py3-none-any.whl
- Upload date:
- Size: 233.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
504c298af9997cce4e046ad7fa4e7683dfdf83a6c28a94b5427d2abc67728008
|
|
| MD5 |
39c3196302da38ea7bbb31d4b4e43d44
|
|
| BLAKE2b-256 |
e7f9af3de6429f72cc4fea6d3ae6a49629cde0d8beff5e0cd21f29615b6c0fce
|