Skip to main content

Natural Language compressor for LLMs (Compressed Language Model).

Project description

CLM

CLM

Semantic Token Encoding for LLMs

Test Suite PyPI License

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.


CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder Input Typical Compression
Thread Support calls, chat transcripts, email threads 62–80%
Structured Data Product catalogs, knowledge bases, business rules 40–85%
System Prompt Task instructions, role definitions, agent configs 65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)
[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric Value
Token reduction 72–80%
Latency improvement Up to 56%
Semantic preservation Validated via Shannon Entropy
Languages EN, PT, ES, FR
Schema version v2.0
Language detection detect_lang (default: on)
Context values include_ctx_values — emit raw NER values alongside context tokens
Duration estimation estimate_thread_duration — infer duration from content
Built-in summary include_summary + optional custom_summary_template (Jinja2)
Custom redaction redaction_pattern — regex for PII placeholder detection

Structured Data Encoder

Metric Value
Token reduction 40–85%
Supports Single objects, arrays, nested structures
Field filtering Importance threshold + required/excluded
Per-field truncation Configurable

System Prompt Encoder

Metric Value
Token reduction 65–90%
Output Hierarchical CLM token vocabulary
Type inference Optional (infer_types=True)
Attribute preservation Optional (add_attrs=True)

Documentation

Official documentation: https://yanickjair.github.io/cllm

Topic Link
Getting started docs/index.md
Thread Encoder docs/thread_encoder/index.md
Transcript encoding docs/thread_encoder/transcript_encoder.md
Free-Form Encoder docs/thread_encoder/free_form_encoder.md
Structured Data Encoder docs/sd_encoder.md
System Prompt Encoder docs/sys_prompt/index.md
CLM Configuration docs/advanced/clm_configuration.md
Token hierarchy docs/advanced/clm_tokenization.md
Output reference docs/advanced/clm_output.md

License

Dual-licensed:

  • AGPL-3.0 — free for open source use (LICENSE-AGPL)
  • Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.0.9.tar.gz (242.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clm_core-1.0.9-py3-none-any.whl (233.8 kB view details)

Uploaded Python 3

File details

Details for the file clm_core-1.0.9.tar.gz.

File metadata

  • Download URL: clm_core-1.0.9.tar.gz
  • Upload date:
  • Size: 242.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.9.tar.gz
Algorithm Hash digest
SHA256 2b707a5ed747b6dd3b4bca23e360ffa5cb5cd0fec78e744996bdbcf1ef7302e7
MD5 aa8d019cdcad3480c22669cbb74cf0cd
BLAKE2b-256 435699af127b813e4453553c278f7d6a1c6363054ce2706b8809323cb6c17e7a

See more details on using hashes here.

File details

Details for the file clm_core-1.0.9-py3-none-any.whl.

File metadata

  • Download URL: clm_core-1.0.9-py3-none-any.whl
  • Upload date:
  • Size: 233.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 504c298af9997cce4e046ad7fa4e7683dfdf83a6c28a94b5427d2abc67728008
MD5 39c3196302da38ea7bbb31d4b4e43d44
BLAKE2b-256 e7f9af3de6429f72cc4fea6d3ae6a49629cde0d8beff5e0cd21f29615b6c0fce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page