Skip to main content

Natural Language compressor for LLMs (Compressed Language Model).

Project description

CLM

CLM

Semantic Token Encoding for LLMs

Test Suite PyPI License

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.


CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder Input Typical Compression
Thread Support calls, chat transcripts, email threads 62–80%
Structured Data Product catalogs, knowledge bases, business rules 40–85%
System Prompt Task instructions, role definitions, agent configs 65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)
[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric Value
Token reduction 72–80%
Latency improvement Up to 56%
Semantic preservation Validated via Shannon Entropy
Languages EN, PT, ES, FR
Schema version v2.0

Structured Data Encoder

Metric Value
Token reduction 40–85%
Supports Single objects, arrays, nested structures
Field filtering Importance threshold + required/excluded
Per-field truncation Configurable

System Prompt Encoder

Metric Value
Token reduction 65–90%
Output Hierarchical CLM token vocabulary
Type inference Optional (infer_types=True)
Attribute preservation Optional (add_attrs=True)

Documentation

Topic Link
Getting started docs/index.md
Thread Encoder docs/thread_encoder/index.md
Transcript encoding docs/thread_encoder/transcript_encoder.md
Structured Data Encoder docs/sd_encoder.md
System Prompt Encoder docs/sys_prompt/index.md
CLM Configuration docs/advanced/clm_configuration.md
Token hierarchy docs/advanced/clm_tokenization.md
Output reference docs/advanced/clm_output.md

License

Dual-licensed:

  • AGPL-3.0 — free for open source use (LICENSE-AGPL)
  • Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.0.7.tar.gz (229.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clm_core-1.0.7-py3-none-any.whl (222.5 kB view details)

Uploaded Python 3

File details

Details for the file clm_core-1.0.7.tar.gz.

File metadata

  • Download URL: clm_core-1.0.7.tar.gz
  • Upload date:
  • Size: 229.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.7.tar.gz
Algorithm Hash digest
SHA256 1fad27845a91fe4a9d3645ab414de3a87252d0284dddae3920a30fa6bb0b1132
MD5 fbe66f9f116858bc0f07575880a37de6
BLAKE2b-256 e597900c35570d38c043c898863c36c2939570768e43ba758e8fd797c482ca58

See more details on using hashes here.

File details

Details for the file clm_core-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: clm_core-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 222.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7260f883d91c244d2567d2471d2f6a6b4905059e20980212e8b85eae28bcafa2
MD5 74541782294d0f04442770117de17070
BLAKE2b-256 3d607d952f6c3bfed2ff96ccb630b94945f9b31f69f28be4d2c14a5186cc28c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page