clm-core

Natural Language compressor for LLMs (Compressed Language Model).

These details have not been verified by PyPI

Project description

CLLM

Compressed Language Models via Semantic Token Encoding

Enterprise-grade compression for transcripts, structured data, and system prompts - achieving 60-95% token reduction.

🚀 Overview

CLLM is a patent-pending compression technology that dramatically reduces LLM token consumption through semantic encoding. Unlike simple abbreviation or character-level compression, CLLM preserves the meaning of your content using structured token vocabularies.

Three Core Compression Targets

1. Transcripts (Contact Centers)

Customer service conversations
Support call logs
Agent-customer interactions
95.9% redundancy reduction (Shannon Entropy validated)

2. Structured Data (Enterprise)

NBA (Next Best Action) catalogs
Product configurations
Business rule sets
Metadata and taxonomies

3. System Prompts (Enterprise)

Agent instructions
Role definitions
Operational guidelines
Task specifications

Benefits

60-95% token reduction across all three targets
Equal or better LLM responses with compressed inputs
Up to 73% faster processing with reduced latency
Massive cost savings for high-volume applications
No model training required - works with existing LLMs

The Problem

In high-volume LLM environments, verbose content creates significant challenges:

Thousands of API calls per user per day
Rapidly escalating token costs at scale
Infrastructure bottlenecks under heavy load
Deployment blocked by scalability concerns
Conversation data consuming excessive context windows

The Solution

Transcript Compression:

Customer: Hi, I need help with my account balance.
Agent: I'd be happy to help. Can I have your account number?
Customer: It's 12345678.
Agent: Your current balance is $1,450.32.

↓

[CTX:CUSTOMER_SERVICE][TOPIC:ACCOUNT_BALANCE][DATA:ACC=12345678,BAL=1450.32]

System Prompt Compression:

You are a customer service quality analyst. Analyze transcripts for compliance 
violations and sentiment issues in agent responses.

↓

[REQ:ANALYZE][TARGET:TRANSCRIPT:DOMAIN=SERVICE][EXTRACT:COMPLIANCE,SENTIMENT:SOURCE=AGENT]

Result: 85-92% token reduction, identical semantic meaning, faster processing.

✨ Key Features

Three Compression Targets: Transcripts, Structured Data, System Prompts
Contact Center Focused: Built for high-volume customer service operations
Semantic Compression: Preserves meaning, not just characters
Hierarchical Token Vocabulary: REQ, TARGET, EXTRACT, CTX, OUT, REF
Multilingual Support: English, Portuguese, Spanish, French
High Accuracy: 91.5% validation rate on 5,000+ dataset
Zero Training: Works with GPT-4, Claude, and other modern LLMs out-of-the-box
Production Ready: Battle-tested on real contact center transcripts and enterprise catalogs

📦 Installation

Install CLLM using pip:

pip install clm-core

Required: Install spaCy Language Model

CLLM uses spaCy for natural language processing. Install the appropriate language model:

# English
python -m spacy download en_core_web_sm

# Portuguese
python -m spacy download pt_core_news_sm

# Spanish
python -m spacy download es_core_news_sm

# French
python -m spacy download fr_core_news_sm

🏗️ Architecture

Semantic Token Categories

Token	Purpose	Example
`REQ`	Actions/operations	`[REQ:ANALYZE]`, `[REQ:EXTRACT]`
`TARGET`	Objects/data sources	`[TARGET:TRANSCRIPT]`, `[TARGET:DOCUMENT]`
`EXTRACT`	Fields to extract	`[EXTRACT:SENTIMENT,INTENT]`
`CTX`	Contextual information	`[CTX:CUSTOMER_SERVICE]`
`OUT`	Output formats	`[OUT:JSON]`, `[OUT:TABLE]`
`REF`	References/IDs	`[REF:CASE=12345]`

Compression Strategy

Intent Detection: Identifies the primary action (analyze, extract, summarize)
Target Extraction: Determines the data source and domain
Pattern Recognition: Maps verbose phrases to semantic tokens
Redundancy Removal: Eliminates 95.9% redundant information (Shannon Entropy validated)
Structure Preservation: Maintains relationships between concepts

📊 Performance Metrics

Based on production testing with 5,000+ samples across all three targets:

Metric	Result
Average Compression	75-92%
Validation Accuracy	91.5%
Test Pass Rate	88.2%
Processing Speed Improvement	Up to 73%
Multilingual Coverage	4 languages

Compression by Target

Target	Average Compression	Use Case
Transcripts	85-92%	Customer service calls
Structured Data	70-85%	NBA catalogs, configs
System Prompts	75-90%	Agent instructions

Real-World Example: Contact Center NBA

Original: System prompt (2,847 tokens) + NBA catalog (uncompressed)

Compressed: 966 tokens (66% reduction)
Latency: 1.88 seconds
Quality: Identical recommendations
Cost per 1000 calls: $2.40 → $0.82

🔧 API Reference

CLLMConfig

CLLMConfig(
    lang: str = "en",           # Language code: en, pt, es, fr
    ds_config: SDCompressionConfig = SDCompressionConfig(),  # Configuration for Structured Data compression
    sys_prompt_config: SysPromptConfig = SysPromptConfig(), # Configuration for System Prompt compression
)

CLMEncoder (for Transcripts)

encoder = CLMEncoder(cfg=CLLMConfig(...))

result = encoder.encode(
    input_: Any = "transcript",
    metadata: dict = {},
    verbose: bool = True
) -> CLMOutput

CLLMEncoder (for System Prompts)

encoder = CLLMEncoder(cfg=CLLMConfig(...))

result = encoder.encode(
    input_: Any = "system prompt",
    verbose: bool = False,
) -> CLMOutput

StructuredDataEncoder (for Structured Data)

encoder = CLLMEncoder(cfg=CLLMConfig(...))

result = encoder.encode(
    input_: Any = "system prompt",
    verbose: bool = False,
) -> CLMOutput

Result Objects

# All result types include:
result.compressed               # Compressed text string
result.original                 # Original token count
result.component                # Transcript, Structured Data, System Prompt
result.compression_ratio        # Ratio as decimal (0.0-1.0)
result.metadata                 # Optional: encoding details

🎓 Use Cases

1. Transcript Compression (Contact Centers)

Compress customer service conversations for analysis and AI processing:

from clm_core import CLMEncoder, CLMConfig

# Billing Issue - Mocking CX Transcript
transcript = "Customer: Hi Raj, I noticed an extra charge on my card for my plan this month. It looks like I was billed twice for the same subscription.\nAgent: I'm sorry to hear that, let’s take a look together. Can I have your account email or billing ID to verify your record?\nCustomer: Sure, it’s melissa.jordan@example.com.\nAgent: Thanks, Melissa. Give me just a moment... alright, I can see two transactions on your file — one processed on the 2nd and another on the 3rd. It seems the system retried payment even after the first one succeeded.\nCustomer: Oh wow, that explains it. So I’m not crazy then.\nAgent: Not at all. It’s a known issue we had earlier this week with duplicate processing. The good news is, you’re eligible for a full refund on the second charge.\nCustomer: Great. How long will it take to show up?\nAgent: Once I file the refund, it usually reflects within 3–5 business days depending on your bank. I’ll also send you a confirmation email with the reference number.\nCustomer: That works. Thank you for sorting it out so quickly.\nAgent: My pleasure. I’ve just submitted the refund request now — your reference number is RFD-908712. You should see that update later today.\nCustomer: Perfect. I appreciate your help, Raj.\nAgent: Anytime! Is there anything else I can check for you today?\nCustomer: No, that’s all. Thanks again!\nAgent: Thank you for calling us, Melissa. Have a great day ahead!"
cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)
compressed = encoder.encode(input_=transcript, metadata={'call_id': 'CX-0001', 'agent': 'Raj', 'duration': '9m', 'channel': 'voice', 'issue_type': 'Billing Dispute'})

↓

[CALL:SUPPORT:AGENT=Raj:DURATION=7m:CHANNEL=voice] 
[CUSTOMER] [CONTACT:EMAIL=MELISSA.JORDAN@EXAMPLE.COM] 
[ISSUE:BILLING_DISPUTE:SEVERITY=LOW] [ACTION:TROUBLESHOOT:RESULT=COMPLETED] 
[ACTION:REFUND:REFERENCE=RFD-908712:TIMELINE=TODAY:RESULT=COMPLETED] 
[RESOLUTION:RESOLVED:TIMELINE=TODAY] [SENTIMENT:NEUTRAL→SATISFIED→GRATEFUL]

2. Structured Data Compression (Enterprise)

Optimize NBA catalogs, product configs, and business rules:

from clm_core import CLMEncoder, CLMConfig
from clm_core.types import SDCompressionConfig

# Knowledge Base structured data
kb_catalog = [
    {
        "article_id": "KB-001",
        "title": "How to Reset Password",
        "content": "To reset your password, go to the login page and click...",
        "category": "Account",
        "tags": ["password", "security", "account"],
        "views": 1523,
        "last_updated": "2024-10-15",
    }
]
config = CLMConfig(
    ds_config=SDCompressionConfig(
        dataset_name="ARTICLE",
        auto_detect=True,
        required_fields=["article_id", "title"],
        field_importance={"tags": 0.8, "content": 0.9},
        max_field_length=100,  # Longer for articles
    )
)

compressor = CLMEncoder(cfg=config)
compressed = compressor.encode(kb_catalog)

↓

[KB_CATALOG:1]{ARTICLE_ID,TITLE,CONTENT,CATEGORY,VIEWS,LAST_UPDATED}
[KB-001,HOW_TO_RESET_PASSWORD,TO_RESET_YOUR_PASSWORD,GO_TO_THE_LOGIN_PAGE_AND_CLICK...,ACCOUNT,1523,2024-10-15]

3. System Prompt Compression (Enterprise)

Streamline agent instructions and role definitions:

compressed = encoder.encode(
    "You are a Call QA & Compliance Scoring System for customer service operations.\n\nTASK:\nAnalyze the transcript and score the agent’s compliance across required QA categories.\n\nANALYSIS CRITERIA:\n\nMandatory disclosures and verification steps\n\nPolicy adherence\n\nSoft-skill behaviors (empathy, clarity, ownership)\n\nProcess accuracy\n\nCompliance violations or risks\n\nCustomer sentiment trajectory\n\nOUTPUT FORMAT:\n\n{\n  \"summary\": \"short_summary\",\n  \"qa_scores\": {\n    \"verification\": 0.0,\n    \"policy_adherence\": 0.0,\n    \"soft_skills\": 0.0,\n    \"accuracy\": 0.0,\n    \"compliance\": 0.0\n  },\n  \"violations\": [\"list_any_detected\"],\n  \"recommendations\": [\"improvement_suggestions\"]\n}\n\n\nSCORING:\n0.00–0.49: Fail\n0.50–0.74: Needs Improvement\n0.75–0.89: Good\n0.90–1.00: Excellent"
)

↓

[REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA] 
[EXTRACT:COMPLIANCE,DISCLOSURES,VERIFICATION,POLICY,SOFT_SKILLS,ACCURACY,SENTIMENT:TYPE=LIST,DOMAIN=LEGAL] 
[OUT_JSON:{summary,qa_scores:{verification,policy_adherence,soft_skills,accuracy,compliance},violations,recommendations}:ENUMS={"ranges": [{"min": 0.0, "max": 0.49, "label": "FAIL"}, {"min": 0.5, "max": 0.74, "label": "NEEDS_IMPROVEMENT"}, {"min": 0.75, "max": 0.89, "label": "GOOD"}, {"min": 0.9, "max": 1.0, "label": "EXCELLENT"}]}]

🧪 Testing

Run the test suite:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=cllm --cov-report=html

# Run specific test category
pytest tests/test_encoder.py -v

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/YanickJar/cllm.git
cd cllm

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

📄 License

CLLM is dual-licensed to give you flexibility:

1️⃣ AGPL-3.0 (Open Source)

For open source projects, research, and evaluation, CLLM is available under the GNU Affero General Public License v3.0.

You can freely use CLLM if you:

✅ Keep your project open source (AGPL-compatible)
✅ Share all modifications and derivative works
✅ Open source any SaaS/web service that uses CLLM

Important: If you offer CLLM functionality over a network (SaaS, API, web service), the AGPL requires you to make your complete application source code available to users.

2️⃣ Commercial License

For commercial use without AGPL restrictions, we offer commercial licenses:

Commercial license includes:

❌ No requirement to open source your application
✅ Use in proprietary/closed-source products
✅ SaaS and API services without source disclosure
✅ Full patent grants for CLLM technology
✅ Priority support and consulting
✅ Custom integrations and features

Pricing:

💡 Startup: <$1M revenue - Contact for pricing
🏢 Enterprise: Custom pricing based on scale
🤝 OEM/Integration: Volume licensing available

📧 Get a commercial license: license@cllm.io

Patent Notice

CLLM includes patent-pending technology:

Application Number: [Pending]
Technology: Semantic Token Encoding for LLM Compression

Patent Grant:

AGPL-3.0 users receive a royalty-free patent license for AGPL-compliant use
Commercial licensees receive full patent rights per license agreement

For questions about patents or licensing: yanick.jair.ta@gmail.com

Which License Should I Choose?

Use Case	Recommended License
Open source project	AGPL-3.0 (Free)
Research/Academic	AGPL-3.0 (Free)
Internal tools (not distributed)	AGPL-3.0 (Free)
Closed-source product	Commercial
SaaS/API service	Commercial
Enterprise deployment	Commercial

Not sure? Contact us at yanick.jair.ta@gmail.com - we're happy to help!

🔗 Links

Documentation: docs.cllm.io (coming soon)
PyPI: pypi.org/project/cllm
Issues: GitHub Issues
Changelog: CHANGELOG.md

💡 Citation

If you use CLLM in your research or production systems, please cite:

@software{cllm2025,
  title = {CLLM: Compressed Language Models via Semantic Token Encoding},
  author = {Andrade, Yanick},
  year = {2025},
  url = {https://github.com/YanickJar/cllm}
}

🙏 Acknowledgments

CLLM was developed to solve real-world scalability challenges in enterprise contact center operations, where high-volume LLM usage creates significant cost and infrastructure barriers.

Built with: Python, spaCy, Pydantic

Made with ❤️ for the LLM community

⭐ Star us on GitHub • 🐛 Report Bug • 💬 Discussions

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.9

Mar 24, 2026

1.0.8

Mar 8, 2026

1.0.7

Mar 1, 2026

1.0.6

Feb 22, 2026

1.0.5

Feb 22, 2026

1.0.4

Feb 21, 2026

1.0.3

Feb 17, 2026

1.0.2

Feb 14, 2026

1.0.1

Feb 14, 2026

1.0.0

Feb 7, 2026

0.0.9

Jan 26, 2026

0.0.8

Jan 26, 2026

0.0.7

Jan 25, 2026

0.0.5

Jan 25, 2026

This version

0.0.4

Jan 25, 2026

0.0.3.2

Jan 22, 2026

0.0.3.1

Jan 19, 2026

0.0.3

Jan 25, 2026

0.0.3a0 pre-release

Jan 19, 2026

0.0.2

Jan 11, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clm_core-0.0.4.tar.gz (176.6 kB view details)

Uploaded Jan 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clm_core-0.0.4-py3-none-any.whl (179.7 kB view details)

Uploaded Jan 25, 2026 Python 3

File details

Details for the file clm_core-0.0.4.tar.gz.

File metadata

Download URL: clm_core-0.0.4.tar.gz
Upload date: Jan 25, 2026
Size: 176.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`2bd74a55665e9ca4dea19a0b7f1e9b92b2752f95d7dd84a07c923b841642c4d6`
MD5	`8ea5189fe9de2587e8118d44ff70de29`
BLAKE2b-256	`d7524c9a4d804ed485b5521535f10885d68856c31b230996b7c149d9fff1a436`

See more details on using hashes here.

File details

Details for the file clm_core-0.0.4-py3-none-any.whl.

File metadata

Download URL: clm_core-0.0.4-py3-none-any.whl
Upload date: Jan 25, 2026
Size: 179.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for clm_core-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49e34a3f845af9f7378c503869b0b87885e1a078296414d565876f3e7a330ec8`
MD5	`01689f104373e19c2275ede4f2c83a1d`
BLAKE2b-256	`93ecd4382332feb8c9361b33c15899553c19b9f9a72e2f5ed3f54f322223ad0a`

See more details on using hashes here.

clm-core 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

CLLM

Compressed Language Models via Semantic Token Encoding

🚀 Overview

Three Core Compression Targets

Benefits

The Problem

The Solution

✨ Key Features

📦 Installation

Required: Install spaCy Language Model

🏗️ Architecture

Semantic Token Categories

Compression Strategy

📊 Performance Metrics

Compression by Target

Real-World Example: Contact Center NBA

🔧 API Reference

CLLMConfig

CLMEncoder (for Transcripts)

CLLMEncoder (for System Prompts)

StructuredDataEncoder (for Structured Data)

Result Objects

🎓 Use Cases

1. Transcript Compression (Contact Centers)

2. Structured Data Compression (Enterprise)

3. System Prompt Compression (Enterprise)

🧪 Testing

🤝 Contributing

Development Setup

📄 License

1️⃣ AGPL-3.0 (Open Source)

2️⃣ Commercial License

Patent Notice

Which License Should I Choose?

🔗 Links

💡 Citation

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes