clm-core

Natural Language compressor for LLMs (Compressed Language Model).

These details have not been verified by PyPI

Project description

CLM

Semantic Token Encoding for LLMs

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.

CLM is a patent-pending semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder	Input	Typical Compression
Thread	Support calls, chat transcripts, email threads	62–80%
Structured Data	Product catalogs, knowledge bases, business rules	40–85%
System Prompt	Task instructions, role definitions, agent configs	65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)

[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

result = encoder.encode(system_prompt)
print(result.compressed)
# [REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
# [EXTRACT:COMPLIANCE,DISCLOSURES,SOFT_SKILLS,SENTIMENT]
# [OUT_JSON:{summary,qa_scores,violations,recommendations}]

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric	Value
Token reduction	72–80%
Latency improvement	Up to 56%
Semantic preservation	Validated via Shannon Entropy
Languages	EN, PT, ES, FR
Schema version	v2.0
Language detection	`detect_lang` (default: on)
Context values	`include_ctx_values` — emit raw NER values alongside context tokens
Duration estimation	`estimate_thread_duration` — infer duration from content
Built-in summary	`include_summary` + optional `custom_summary_template` (Jinja2)
Custom redaction	`redaction_pattern` — regex for PII placeholder detection

Structured Data Encoder

Metric	Value
Token reduction	40–85%
Supports	Single objects, arrays, nested structures
Field filtering	Importance threshold + required/excluded
Per-field truncation	Configurable

System Prompt Encoder

Metric	Value
Token reduction	65–90%
Output	Hierarchical CLM token vocabulary
Type inference	Optional (`infer_types=True`)
Attribute preservation	Optional (`add_attrs=True`)

Documentation

Official documentation: https://yanickjair.github.io/cllm

Topic	Link
Getting started	docs/index.md
Thread Encoder	docs/thread_encoder/index.md
Transcript encoding	docs/thread_encoder/transcript_encoder.md
Free-Form Encoder	docs/thread_encoder/free_form_encoder.md
Structured Data Encoder	docs/sd_encoder.md
System Prompt Encoder	docs/sys_prompt/index.md
CLM Configuration	docs/advanced/clm_configuration.md
Token hierarchy	docs/advanced/clm_tokenization.md
Output reference	docs/advanced/clm_output.md

License

Dual-licensed:

AGPL-3.0 — free for open source use (LICENSE-AGPL)
Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.9

Mar 24, 2026

This version

1.0.8

Mar 8, 2026

1.0.7

Mar 1, 2026

1.0.6

Feb 22, 2026

1.0.5

Feb 22, 2026

1.0.4

Feb 21, 2026

1.0.3

Feb 17, 2026

1.0.2

Feb 14, 2026

1.0.1

Feb 14, 2026

1.0.0

Feb 7, 2026

0.0.9

Jan 26, 2026

0.0.8

Jan 26, 2026

0.0.7

Jan 25, 2026

0.0.5

Jan 25, 2026

0.0.4

Jan 25, 2026

0.0.3.2

Jan 22, 2026

0.0.3.1

Jan 19, 2026

0.0.3

Jan 25, 2026

0.0.3a0 pre-release

Jan 19, 2026

0.0.2

Jan 11, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.0.8.tar.gz (244.1 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clm_core-1.0.8-py3-none-any.whl (239.8 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file clm_core-1.0.8.tar.gz.

File metadata

Download URL: clm_core-1.0.8.tar.gz
Upload date: Mar 8, 2026
Size: 244.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`01f4dda9da273723a42abf19aec3299ad8146e60b25420e4b5fc569d26bf1abd`
MD5	`3658fae84e6bca72195e04a3a670ae91`
BLAKE2b-256	`410d06c2517e6eee4716056c7bd4bae98ac5061dde766358489e07ee20ee6470`

See more details on using hashes here.

File details

Details for the file clm_core-1.0.8-py3-none-any.whl.

File metadata

Download URL: clm_core-1.0.8-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 239.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ccf4dddc82d79e36f0f3aaa0cad63f72649a5b0695face60e445d45aaf56e5a`
MD5	`ab125c0a57d05ff9bd533c2361775325`
BLAKE2b-256	`274786f8f6a258b0d25662568be244a168a42e51c87df1cabd9e1b606b720e3d`

See more details on using hashes here.

clm-core 1.0.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

CLM

Semantic Token Encoding for LLMs

Installation

Usage

Thread Encoder — Transcripts

Structured Data Encoder

System Prompt Encoder

Performance

Thread Encoder

Structured Data Encoder

System Prompt Encoder

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes