Official Python SDK for Concise - Token compression for LLMs
Project description
Concise Python SDK
Official Python client for Concise - Token compression for LLMs.
Full-stack LLM cost optimization platform
- Input compression: 50% reduction
- Output optimization: 60-70% reduction
- Combined: 70% total cost savings
Works with GPT-4, Claude, Gemini, all LLMs.
Installation
pip install concise-sdk
Quick Start
Direct Compression API
from concise import Concise
client = Concise(api_key="your-api-key")
result = client.compress(
"Your long prompt here...",
level="auto"
)
print(f"Original: {result.original_tokens} tokens")
print(f"Compressed: {result.compressed_tokens} tokens")
print(f"Saved: {result.tokens_saved} tokens ({(1-result.compression_ratio)*100:.1f}%)")
print(f"Compressed text: {result.compressed_text}")
OpenAI Drop-in Replacement
Replace your OpenAI import with Concise for automatic compression:
# Before:
# from openai import OpenAI
# After:
from concise import OpenAI
client = OpenAI(api_key="your-concise-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
compression_enabled=True, # Automatic token compression
compression_level="balanced"
)
print(response.choices[0].message.content)
Features
Input Optimization (Compression)
- Direct Compression API - Compress any text before sending to LLMs
- OpenAI Drop-in - Replace
from openai import OpenAIwithfrom concise import OpenAI - Automatic Strategy Selection - Detects Python code vs natural language
- GPU-Accelerated - 285ms compression time (or instant with caching)
- Zero Context Loss - Preserves semantic meaning
Output Optimization (NEW - TALE)
- Token Budget Prompting - Reduce output tokens by 60-70%
- Budget Estimation - 3 strategies: fixed, zero_shot, adaptive
- Output Validation - Check if LLM stayed within budget
- LLM-Agnostic - Works with all models (GPT-4, Claude, Gemini, etc.)
- Quality Retention - 95%+ accuracy maintained
Developer Experience
- Type Hints - Full type annotations for better IDE support
- Error Handling - Comprehensive exception types
Compression Levels
| Level | Reduction | Use Case |
|---|---|---|
auto |
30-50% | Automatic strategy (recommended) |
aggressive |
50% | Maximum compression, natural language |
balanced |
30% | Good trade-off |
conservative |
20% | Light compression, preserve structure |
Examples
Full-Stack Optimization (Input + Output)
Combine compression and TALE for maximum savings:
from concise import Concise
import openai
client = Concise(api_key="your-api-key")
prompt = "Write a function to implement binary search"
# 1. Compress INPUT tokens (50% reduction)
compressed = client.compress(prompt, level="auto")
print(f"Input: {compressed.tokens_saved} tokens saved")
# 2. Optimize for OUTPUT tokens (60-70% reduction)
optimized = client.optimize_for_output(
compressed.compressed_text,
strategy="fixed"
)
print(f"Expected output budget: {optimized.estimated_budget} tokens")
# 3. Send to LLM
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": optimized.optimized_prompt}]
)
# 4. Validate savings
validation = client.validate_output(
output=response.choices[0].message.content,
budget=optimized.estimated_budget
)
print(f"Output: {validation.tokens_saved} tokens saved")
# Result: ~70% total cost reduction!
Output Optimization (TALE)
Reduce output tokens by 60-70%:
from concise import Concise
client = Concise(api_key="your-api-key")
# Optimize prompt to reduce output
result = client.optimize_for_output(
"Explain how binary search works",
strategy="fixed" # or "zero_shot", "adaptive"
)
print(f"Estimated budget: {result.estimated_budget} tokens")
print(f"Optimized prompt: {result.optimized_prompt}")
# Send to your LLM...
# The LLM will generate 60-70% fewer tokens while maintaining quality
Strategies:
fixed: Fast heuristic (70% confidence, <10ms)zero_shot: LLM self-estimation (85% confidence, 1 extra call)adaptive: User history-based (85% confidence)
Python Code Compression
from concise import Concise
client = Concise(api_key="your-api-key")
code = """
def fibonacci(n):
'''Calculate fibonacci number'''
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
result = client.compress(code, level="auto")
# Strategy: token_compression_code
# Reduction: 39%
# Time: 27ms
Natural Language Compression
result = client.compress(
"FastAPI is a modern, fast web framework for building APIs with Python 3.8+",
level="aggressive"
)
# Strategy: token_compression_text
# Reduction: 50%
# Time: 285ms (or 0ms if cached)
Using with OpenAI
from concise import OpenAI
client = OpenAI(api_key="your-concise-key")
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a Python expert. Help users write clean, efficient code."
},
{
"role": "user",
"content": "Write a function to validate email addresses using regex"
}
],
compression_enabled=True,
compression_level="balanced"
)
print(response.choices[0].message.content)
Context Manager
from concise import Concise
with Concise(api_key="your-api-key") as client:
result = client.compress("Long text here...")
print(f"Saved {result.tokens_saved} tokens")
Environment Variable
Set CONCISE_API_KEY environment variable:
export CONCISE_API_KEY=your-api-key
from concise import Concise
# API key loaded from environment
client = Concise()
Error Handling
from concise import Concise, AuthenticationError, APIError, RateLimitError
client = Concise(api_key="your-api-key")
try:
result = client.compress("text")
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Rate limit exceeded")
except APIError as e:
print(f"API error: {e} (status: {e.status_code})")
Performance
| Type | Strategy | Reduction | Time |
|---|---|---|---|
| Python code | python-minifier | 39% | 27ms |
| Natural language | LLMLingua-2 GPU | 50% | 285ms |
| Cached requests | Cache hit | 50% | 0ms |
Caching
Concise automatically caches compression results:
- First request: GPU compression (285ms)
- Repeated requests: Instant (0ms)
- 240,000x speedup for cached requests
API Reference
Concise
Main client for direct compression API.
__init__(api_key, base_url, timeout)
Initialize client.
Parameters:
api_key(str, optional): Your Concise API keybase_url(str, optional): API base URL (default: https://api.concise.dev/v1)timeout(int, optional): Request timeout in seconds (default: 30)
compress(text, level)
Compress text to reduce token count.
Parameters:
text(str): Text to compresslevel(str): Compression level ("auto", "aggressive", "balanced", "conservative")
Returns:
CompressionResult: Object with compression metrics
optimize_for_output(prompt, strategy, target_budget)
Optimize prompt to reduce output tokens using TALE.
Parameters:
prompt(str): Prompt to optimizestrategy(str): Estimation strategy ("fixed", "zero_shot", "adaptive")target_budget(int, optional): Manual token budget override
Returns:
TALEOptimizeResult: Optimized prompt and budget info
validate_output(output, budget, tolerance)
Validate that LLM output stayed within token budget.
Parameters:
output(str): LLM's generated outputbudget(int): Token budget from optimize_for_outputtolerance(float): Allow budget to exceed by this % (default: 0.2)
Returns:
TALEValidateResult: Compliance status and metrics
health()
Check API health status.
Returns:
dict: Status and version info
OpenAI
OpenAI-compatible client with automatic compression.
chat.completions.create()
Create chat completion with compression.
Additional Parameters:
compression_enabled(bool): Enable compression (default: True)compression_level(str): Compression level (default: "auto")
Types
CompressionResult
@dataclass
class CompressionResult:
original_text: str
compressed_text: str
original_tokens: int
compressed_tokens: int
tokens_saved: int
compression_ratio: float
strategy: str
compression_time_ms: float
cache_hit: Optional[bool]
TALEOptimizeResult
@dataclass
class TALEOptimizeResult:
optimized_prompt: str
original_prompt: str
estimated_budget: int
budget_metadata: dict
prompt_additions: dict
TALEValidateResult
@dataclass
class TALEValidateResult:
within_budget: bool
actual_tokens: int
budget_tokens: int
max_allowed_tokens: int
budget_utilization: float
tokens_saved: int
exceeded_by: int
Requirements
- Python 3.8+
- httpx>=0.25.0
Getting Your API Key
- Sign up at concise.dev
- Create an API key in the dashboard
- Use the key with this SDK
Support
- Documentation: docs.concise.dev
- Issues: github.com/concise/python-sdk/issues
- Email: support@concise.dev
License
MIT License - see LICENSE file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file concise_sdk-1.1.0.tar.gz.
File metadata
- Download URL: concise_sdk-1.1.0.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55ae8849c0ff6a547427c8a8b171dd0b278ac4cf1493d586191820059811cdee
|
|
| MD5 |
88530c0d3f582f56e35724a30930d9c8
|
|
| BLAKE2b-256 |
ec857a2519160c7b1ca0d932e5a70c183124d5e018d15db33f51b6b353ae928f
|
File details
Details for the file concise_sdk-1.1.0-py3-none-any.whl.
File metadata
- Download URL: concise_sdk-1.1.0-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad1737010f29dded462f4e1a11d43d77c39b08437ad6411d424301854c036b7a
|
|
| MD5 |
647685c015325e6f1957d81573061851
|
|
| BLAKE2b-256 |
50c588cafe1b6acad3b55337059d3a1cd3915ce775161d779934751eee644b15
|