Skip to main content

Natural Language compressor for LLMs (Compressed Language Model).

Project description

CLM

CLM

Semantic Token Encoding for LLMs

Test Suite PyPI License

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.


CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder Input Typical Compression
Thread Support calls, chat transcripts, email threads 62–80%
Structured Data Product catalogs, knowledge bases, business rules 40–85%
System Prompt Task instructions, role definitions, agent configs 65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)
[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric Value
Token reduction 72–80%
Latency improvement Up to 56%
Semantic preservation Validated via Shannon Entropy
Languages EN, PT, ES, FR
Schema version v2.0

Structured Data Encoder

Metric Value
Token reduction 40–85%
Supports Single objects, arrays, nested structures
Field filtering Importance threshold + required/excluded
Per-field truncation Configurable

System Prompt Encoder

Metric Value
Token reduction 65–90%
Output Hierarchical CLM token vocabulary
Type inference Optional (infer_types=True)
Attribute preservation Optional (add_attrs=True)

Documentation

Topic Link
Getting started docs/index.md
Thread Encoder docs/thread_encoder/index.md
Transcript encoding docs/thread_encoder/transcript_encoder.md
Structured Data Encoder docs/sd_encoder.md
System Prompt Encoder docs/sys_prompt/index.md
CLM Configuration docs/advanced/clm_configuration.md
Token hierarchy docs/advanced/clm_tokenization.md
Output reference docs/advanced/clm_output.md

License

Dual-licensed:

  • AGPL-3.0 — free for open source use (LICENSE-AGPL)
  • Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.0.6.tar.gz (209.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clm_core-1.0.6-py3-none-any.whl (208.5 kB view details)

Uploaded Python 3

File details

Details for the file clm_core-1.0.6.tar.gz.

File metadata

  • Download URL: clm_core-1.0.6.tar.gz
  • Upload date:
  • Size: 209.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.6.tar.gz
Algorithm Hash digest
SHA256 1555b717db571009e16975bae57103bca026bcdab049e59764cd948d176883f5
MD5 e1c3a775ba4f98c617687cf77a6bbb28
BLAKE2b-256 961881fe68bfb3641c264bd96aa508c2bd2fdc5c69fe03a21e9058dd0329446c

See more details on using hashes here.

File details

Details for the file clm_core-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: clm_core-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 208.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a6d5fd7d0113bf15c41045542c67510d1d0500ac7fac3045a240a77c4997872e
MD5 5dc108bbfecc8c62bc05a5b1959fff3b
BLAKE2b-256 3e6bfe17e26d07daeed1da0e3b771800ce63fdbd9b17695e3d496b7220f040f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page