clm-core

Thread and prompt encoder components for Compressed Language Model workflows.

These details have not been verified by PyPI

Project description

CLM

Semantic Token Encoding for LLMs

Compress transcripts, structured data, and system prompts — 60–95% fewer tokens, no model retraining.

CLM is an open-source semantic compression library. It encodes verbose content into compact structured token sequences that LLMs interpret with equal or better accuracy, at a fraction of the token cost.

Three targets, one encoder:

Encoder	Input	Typical Compression
Thread	Support calls, chat transcripts, email threads	62–80%
Structured Data	Product catalogs, knowledge bases, business rules	40–85%
System Prompt	Task instructions, role definitions, agent configs	65–90%

Installation

pip install clm-core

Install the spaCy model for your language:

python -m spacy download en_core_web_sm   # English
python -m spacy download pt_core_news_sm  # Portuguese
python -m spacy download es_core_news_sm  # Spanish
python -m spacy download fr_core_news_sm  # French

If you want the structured-data encoder as part of the same install, add the extra:

pip install "clm-core[sd_encoder]"

Usage

All three encoders share the same interface. CLM auto-detects the input type.

from clm_core import CLMConfig, CLMEncoder

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

Thread Encoder — Transcripts

result = encoder.encode(input_=transcript, metadata={"channel": "voice"})
print(result.compressed)

[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN]
[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] [CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:REFUND_INITIATED] [STATE:PENDING_CUSTOMER]
[COMMITMENT:REFUND_3-5_BUSINESS_DAYS] [ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Parse into a structured dict for downstream use:

data = result.to_dict()
# {"channel": "VOICE", "domain": "BILLING", "customerIntent": "REPORT_DUPLICATE_CHARGE",
#  "state": "PENDING_CUSTOMER", "agentActions": [...], "commitments": [...], ...}

Structured Data Encoder (SDE)

SDE was moved to a standalone sub-library. You can find more about it here

catalog = [{"article_id": "KB-001", "title": "Reset Password", "content": "...", "tags": ["security"]}]
result = encoder.encode(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security]

System Prompt Encoder

System prompts are encoded through the same CLMEncoder interface used for the other components. CLM usually classifies the prompt for you, but it helps to think of them as either task prompts or configuration prompts.

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

task_prompt = """
You are a customer service quality analyst.
Analyze call transcripts for compliance issues and sentiment problems.
Return the result as JSON.
"""

task_result = encoder.encode(task_prompt)
print(task_result.metadata["prompt_mode"])
print(task_result.compressed)

If you need the step-by-step guide, start with docs/sys_prompt/index.md.

Task prompts are usually compressed into a single CL token sequence. Configuration prompts can also be bound after compression:

config_prompt = """
<role>You are a helpful support agent</role>

<custom_rules>
Always greet the customer as {{customer_name}}.
</custom_rules>
"""

result = encoder.encode(config_prompt)
bound_prompt = encoder.bind(result, customer_name="Melissa")

print(result.metadata["prompt_mode"])
print(result.compressed)
print(bound_prompt)

Performance

Based on a dataset test across 5,000+ samples:

Thread Encoder

Metric	Value
Token reduction	72–80%
Latency improvement	Up to 56%
Semantic preservation	Validated via Shannon Entropy
Languages	EN, PT, ES, FR
Schema version	v2.0
Language detection	`detect_lang` (default: on)
Context values	`include_ctx_values` — emit raw NER values alongside context tokens
Duration estimation	`estimate_thread_duration` — infer duration from content
Built-in summary	`include_summary` + optional `custom_summary_template` (Jinja2)
Custom redaction	`redaction_pattern` — regex for PII placeholder detection

Structured Data Encoder

Metric	Value
Token reduction	40–85%
Supports	Single objects, arrays, nested structures
Field filtering	Importance threshold + required/excluded
Per-field truncation	Configurable

System Prompt Encoder

Metric	Value
Token reduction	65–90%
Output	Hierarchical CLM token vocabulary
Type inference	Optional (`infer_types=True`)
Attribute preservation	Optional (`add_attrs=True`)

Documentation

Official documentation: https://yanickjair.github.io/cllm

Topic	Link
Getting started	docs/index.md
Thread Encoder	docs/thread_encoder/index.md
Transcript encoding	docs/thread_encoder/transcript_encoder.md
Free-Form Encoder	docs/thread_encoder/free_form_encoder.md
Structured Data Encoder	docs/sd_encoder.md
System Prompt Encoder	docs/sys_prompt/index.md
CLM Configuration	docs/advanced/clm_configuration.md
Token hierarchy	docs/advanced/clm_tokenization.md
Output reference	docs/advanced/clm_output.md

Release

This repository publishes two independent Python packages from one main branch:

Package	Source	Workflow	PyPI trigger
`clm-core`	`clm_core/`	`.github/workflows/publish.yml`	`clm_core-v*` tags
`sd_encoder`	`crates/sd_encoder/python/`	`.github/workflows/publish-sd-encoder.yml`	`sd_encoder-v*` tags

Use package-specific tags instead of release branches.

Release `clm-core`

Update clm_core/__version__.py.
Commit and push the change to main.
Create and push a matching tag:

git tag clm_core-v1.0.9
git push origin clm_core-v1.0.9

Release `sd_encoder`

Update the version in crates/sd_encoder/Cargo.toml.
Commit and push the change to main.
Create and push a matching tag:

git tag sd_encoder-v0.1.0
git push origin sd_encoder-v0.1.0

Pushes to main publish changed packages to TestPyPI. Release tags publish the matching package to PyPI. Both workflows also support manual dispatch with none, testpypi, or pypi.

Star History

License

MIT — see LICENSE.

Issues · Discussions · Contact

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.3.3

Jul 19, 2026

1.3.2

Jul 16, 2026

1.3.1

Jul 1, 2026

1.3.0

Jun 30, 2026

1.2.0

May 26, 2026

1.1.0

May 23, 2026

1.0.9

Mar 24, 2026

1.0.8

Mar 8, 2026

1.0.7

Mar 1, 2026

1.0.6

Feb 22, 2026

1.0.5

Feb 22, 2026

1.0.4

Feb 21, 2026

1.0.3

Feb 17, 2026

1.0.2

Feb 14, 2026

1.0.1

Feb 14, 2026

1.0.0

Feb 7, 2026

0.0.9

Jan 26, 2026

0.0.8

Jan 26, 2026

0.0.7

Jan 25, 2026

0.0.5

Jan 25, 2026

0.0.4

Jan 25, 2026

0.0.3.2

Jan 22, 2026

0.0.3.1

Jan 19, 2026

0.0.3

Jan 25, 2026

0.0.3a0 pre-release

Jan 19, 2026

0.0.2

Jan 11, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-1.3.3.tar.gz (248.6 kB view details)

Uploaded Jul 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clm_core-1.3.3-py3-none-any.whl (242.5 kB view details)

Uploaded Jul 19, 2026 Python 3

File details

Details for the file clm_core-1.3.3.tar.gz.

File metadata

Download URL: clm_core-1.3.3.tar.gz
Upload date: Jul 19, 2026
Size: 248.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for clm_core-1.3.3.tar.gz
Algorithm	Hash digest
SHA256	`31a46f83276ad41e8b99ac36362b780d06247a1de6bf6f1d7e102d1316e99d96`
MD5	`8e8d617fabd9d8741a2848a3f5efcb23`
BLAKE2b-256	`c2d82a5d50b369bfa9bebb2f2cbdfa82bc0854d90ff75586229106376f63b7ff`

See more details on using hashes here.

File details

Details for the file clm_core-1.3.3-py3-none-any.whl.

File metadata

Download URL: clm_core-1.3.3-py3-none-any.whl
Upload date: Jul 19, 2026
Size: 242.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for clm_core-1.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fdd8b2f42ae4ffa169b330cb21530cba2e87e9c8dc9e88c0e7a01460fd07552b`
MD5	`36e1f05d73d4ecf12c90e3ad7c0ac488`
BLAKE2b-256	`db5df0ce4fe39142275c59a440377f31e5e2c9839d79c508a9bc890194d64a68`

See more details on using hashes here.

clm-core 1.3.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

CLM

Semantic Token Encoding for LLMs

Installation

Usage

Thread Encoder — Transcripts

Structured Data Encoder (SDE)

System Prompt Encoder

Performance

Thread Encoder

Structured Data Encoder

System Prompt Encoder

Documentation

Release

Release clm-core

Release sd_encoder

Star History

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Release `clm-core`

Release `sd_encoder`