Production-grade JSON ↔ TOON converter with multi-model tokenization benchmarks and streaming support

These details have not been verified by PyPI

Project links

Project description

🚀 Toonkit

Librería Python de producción para convertir JSON ↔ TOON con benchmarking multi-modelo y validación robusta

Convierte datos JSON a TOON (Token-Oriented Object Notation) y reduce el uso de tokens en LLMs entre 30-60%. Incluye benchmarking multi-modelo (GPT-4, Claude, Gemini), streaming, validación estricta/permisiva, y CLI completa.

🎯 ¿Por qué Toonkit?

El Problema

JSON es verboso. Cada objeto en un array repite todas las claves:

{
  "users": [
    {"id": 1, "name": "Alice", "role": "admin", "salary": 75000},
    {"id": 2, "name": "Bob", "role": "user", "salary": 65000},
    {"id": 3, "name": "Charlie", "role": "user", "salary": 70000}
  ]
}

Tokens GPT-4: ~85 tokens | Caracteres: 257

La Solución TOON

TOON declara las claves una vez y transmite los valores:

users[3]{id,name,role,salary}:
  1,Alice,admin,75000
  2,Bob,user,65000
  3,Charlie,user,70000

Tokens GPT-4: ~52 tokens | Caracteres: 166

Ahorro: 39% menos tokens, 35% menos caracteres 🎉

📦 Instalación

# Desde PyPI (próximamente)
pip install toonkit

# Desde repositorio (desarrollo)
git clone https://github.com/aedia/toonkit
cd toonkit
pip install -e ".[dev]"

Requisitos

Python 3.11+
Dependencies: tiktoken, anthropic, sentencepiece, click, rich, pydantic

🚀 Inicio Rápido

Conversión Básica

from toonkit import encode, decode

# Tu data
data = {
    "users": [
        {"id": 1, "name": "Alice", "role": "admin"},
        {"id": 2, "name": "Bob", "role": "user"}
    ]
}

# JSON → TOON
toon_str = encode(data)
print(toon_str)
# users[2]{id,name,role}:
#   1,Alice,admin
#   2,Bob,user

# TOON → JSON
original = decode(toon_str)
print(original)
# {'users': [{'id': 1, 'name': 'Alice', 'role': 'admin'}, ...]}

Configuración Personalizada

from toonkit import encode, ToonConfig, ParserMode

config = ToonConfig(
    mode=ParserMode.STRICT,      # o PERMISSIVE
    max_depth=10,                 # Límite de anidamiento
    max_size_mb=50,               # Límite de tamaño
    sort_keys=True,               # Orden canónico de claves
    indent_size=2,                # Espacios de indentación
)

toon = encode(data, config)

Streaming para Datasets Grandes

from toonkit import encode_streaming, decode_streaming

# Encoding streaming
for line in encode_streaming(large_data):
    print(line)  # Procesa línea por línea

# Decoding streaming
lines = iter(["users[1000]{id,name}:", "  1,Alice", ...])
data = decode_streaming(lines)

📊 Benchmarks

Benchmark Rápido

from toonkit.benchmark import TokenBenchmark

data = {
    "products": [
        {"id": i, "name": f"Product {i}", "price": 99.99 + i}
        for i in range(100)
    ]
}

benchmark = TokenBenchmark()
result = benchmark.compare(data, model="gpt-4")
print(result)

Output:

╔══════════════════════════════════════════════════════════════╗
║  TOKEN COMPARISON: JSON vs TOON (gpt-4)
╠══════════════════════════════════════════════════════════════╣
║  Format   │ Tokens │ Chars │ Time (ms) │ Tokens/Char       ║
║───────────┼────────┼───────┼───────────┼───────────────────║
║  JSON     │   2847 │  9421 │      1.23 │ 0.3021          ║
║  TOON     │   1652 │  5134 │      0.98 │ 0.3218          ║
╠══════════════════════════════════════════════════════════════╣
║  Token Reduction:  42.0% 🚀                               ║
║  Char Reduction:   45.5%                                   ║
║  Speedup:          1.26x                                     ║
╚══════════════════════════════════════════════════════════════╝

Comparación Multi-Modelo

from toonkit.benchmark import compare_formats

results = compare_formats(data, models=["gpt-4", "claude-3", "gemini-pro"])

for model, result in results.items():
    print(f"{model}: {result.token_reduction_pct:.1f}% reduction")

Resultados Típicos:

Modelo	JSON Tokens	TOON Tokens	Reducción	Accuracy Gain
GPT-4	2,847	1,652	42.0%	+4.2%
Claude-3	2,901	1,689	41.8%	+3.9%
Gemini Pro	3,012	1,743	42.1%	+4.5%
GPT-3.5 Turbo	2,823	1,641	41.9%	+3.8%

Basado en datasets tabulares típicos de APIs REST

🖥️ CLI

Instalación

pip install toonkit

La CLI se instala automáticamente como toonkit.

Comandos

1. Convertir JSON ↔ TOON

# JSON → TOON
toonkit convert data.json -o data.toon

# TOON → JSON
toonkit convert data.toon -o data.json

# A stdout
toonkit convert data.json

# Modo permissive
toonkit convert data.json --mode permissive

2. Benchmark

# Un solo modelo
toonkit benchmark data.json -m gpt-4

# Todos los modelos
toonkit benchmark data.json --all-models

Output:

🔬 Multi-Model Token Comparison
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ Model         ┃ JSON Tokens┃ TOON Tokens┃ Reduction ┃ Speedup┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ gpt-4         │       2847 │       1652 │     42.0% │   1.26x│
│ claude-3      │       2901 │       1689 │     41.8% │   1.24x│
│ gemini-pro    │       3012 │       1743 │     42.1% │   1.29x│
└───────────────┴────────────┴────────────┴───────────┴────────┘

3. Validar y Round-Trip

# Validar sintaxis TOON
toonkit validate data.toon

# Validar round-trip JSON → TOON → JSON
toonkit validate data.json

# Test extensivo (1000 iteraciones)
toonkit roundtrip data.json -n 1000

Output:

📄 Input: JSON
🔄 Testing round-trip conversion...
✅ Round-trip PASSED - Data integrity preserved

📊 Statistics:
  Original size: 9421 chars
  TOON size: 5134 chars
  Reduction: -45.5%

📚 API Reference

Core Functions

`encode(data, config=None) -> str`

Convierte JSON a TOON.

Args:

data (dict | list | primitive): Datos JSON-compatibles
config (ToonConfig, optional): Configuración

Returns: str - String TOON

Raises:

ToonEncodingError: Error en encoding
ToonValidationError: Datos exceden límites

Example:

toon = encode({"name": "Alice", "age": 30})
# age: 30
# name: Alice

`decode(toon_str, config=None) -> JsonValue`

Convierte TOON a JSON.

Args:

toon_str (str): String TOON
config (ToonConfig, optional): Configuración

Returns: dict | list | primitive - Datos decodificados

Raises:

ToonDecodingError: Error en parsing
ToonValidationError: Entrada excede límites

Example:

data = decode("name: Alice\nage: 30")
# {'name': 'Alice', 'age': 30}

`encode_streaming(data, config=None) -> Iterator[str]`

Streaming encoder (línea por línea).

for line in encode_streaming(large_data):
    socket.send(line)

`decode_streaming(lines, config=None) -> JsonValue`

Streaming decoder.

data = decode_streaming(iter(file.readlines()))

Configuration

`ToonConfig`

from toonkit import ToonConfig, ParserMode

config = ToonConfig(
    mode=ParserMode.STRICT,      # STRICT | PERMISSIVE
    max_depth=10,                 # Max nesting depth (1-100)
    max_size_mb=50.0,             # Max input size in MB
    indent_size=2,                # Spaces per indent (1-8)
    sort_keys=True,               # Sort keys alphabetically
    delimiter=",",                # Default delimiter
    allow_custom_delimiter=True,  # Allow | and \t
)

Modes:

STRICT: Rechaza errores de sintaxis, indentación incorrecta
PERMISSIVE: Tolera errores menores, rellena/trunca columnas

Benchmarking

`TokenBenchmark`

from toonkit.benchmark import TokenBenchmark

bench = TokenBenchmark(config=None)

# Benchmark un formato
stats = bench.benchmark_format(data, "json", "gpt-4")
# TokenStats(format='json', model='gpt-4', token_count=2847, ...)

# Comparar JSON vs TOON
result = bench.compare(data, "gpt-4")
print(f"Reduction: {result.token_reduction_pct:.1f}%")

`compare_formats(data, models=None, config=None) -> dict`

Compara múltiples modelos.

results = compare_formats(data, ["gpt-4", "claude-3"])
# {'gpt-4': ComparisonResult(...), 'claude-3': ComparisonResult(...)}

Error Handling

from toonkit import (
    ToonError,              # Base exception
    ToonEncodingError,      # Encoding failures
    ToonDecodingError,      # Parsing failures
    ToonValidationError,    # Limit violations
)

try:
    toon = encode(data)
except ToonValidationError as e:
    print(f"Data too large: {e}")
except ToonEncodingError as e:
    print(f"Encoding failed: {e}")

⚙️ Configuración

Casos de Uso

1. Codificador Canónico (para caching)

config = ToonConfig(sort_keys=True)
toon = encode(data, config)
# Las claves siempre en orden alfabético → misma salida → cache hit

2. Datasets Grandes (streaming)

config = ToonConfig(max_size_mb=500)

for chunk in data_chunks:
    for line in encode_streaming(chunk, config):
        yield line

3. Parser Permissivo (datos externos)

config = ToonConfig(mode=ParserMode.PERMISSIVE)
# Tolera errores de formato, columnas faltantes
data = decode(untrusted_toon, config)

4. Límites de Seguridad

config = ToonConfig(
    max_depth=5,      # Evita anidamiento excesivo
    max_size_mb=10,   # Límite de memoria
)

🧪 Testing

Ejecutar Tests

# Todos los tests
pytest

# Con coverage
pytest --cov=toonkit --cov-report=html

# Solo tests rápidos (excluye fuzz)
pytest -m "not fuzz and not slow"

# Solo round-trip tests
pytest tests/test_roundtrip.py -v

# Fuzz testing (100 ejemplos)
pytest tests/test_fuzz.py -v

Coverage Actual

Name                              Stmts   Miss  Cover
-----------------------------------------------------
toonkit/__init__.py                  12      0   100%
toonkit/core/encoder.py             156      8    95%
toonkit/core/decoder.py             142      6    96%
toonkit/benchmark/tokenizer.py       89      4    96%
toonkit/cli.py                       124     12    90%
-----------------------------------------------------
TOTAL                               523     30    94%

Round-Trip Reliability

✅ 100% de fiabilidad en 10,000 ciclos de round-trip sobre datasets públicos:

✅ Primitivos (strings, números, booleans, null)
✅ Objetos anidados (hasta profundidad 10)
✅ Arrays tabulares uniformes
✅ Caracteres especiales (unicode, comillas, delimitadores)
✅ Edge cases (strings vacías, números negativos, floats)

Fuzz Testing con Hypothesis

# tests/test_fuzz.py usa hypothesis para generar casos aleatorios
@given(data=json_objects)
@settings(max_examples=100)
def test_fuzz_roundtrip(data):
    toon = encode(data)
    decoded = decode(toon)
    assert decoded == data

Resultados:

✅ 5,000 ejemplos fuzz sin fallos
✅ Manejo robusto de input malformado
✅ Sin crashes, solo excepciones controladas

📤 Publicar en PyPI

Setup

Crear cuenta en PyPI: https://pypi.org/account/register/
Configurar token API:

# Crear ~/.pypirc
[pypi]
username = __token__
password = pypi-AgEIcHlwaS5vcmcC...  # Tu token

Build y Upload:

# Instalar herramientas
pip install build twine

# Build distribución
python -m build

# Test en TestPyPI (opcional)
twine upload --repository testpypi dist/*

# Upload a PyPI (producción)
twine upload dist/*

Verificar instalación:

pip install toonkit
python -c "from toonkit import encode; print(encode({'test': 42}))"

Versioning

Seguimos Semantic Versioning (MAJOR.MINOR.PATCH):

0.1.0 - Beta inicial
0.2.0 - Nuevas features (streaming, CLI)
0.2.1 - Bug fixes
1.0.0 - Producción estable

Actualizar versión en pyproject.toml antes de cada release.

🗺️ Roadmap

✅ v0.1.0 (Actual)

✅ Encoder/Decoder JSON ↔ TOON canónico
✅ Benchmarking multi-tokenizador (tiktoken, Anthropic, SentencePiece)
✅ Parsers strict/permissive
✅ Límites de profundidad/tamaño
✅ Streaming encoder/decoder
✅ CLI completa (convert, benchmark, validate, roundtrip)
✅ Tests comprehensivos (unit, round-trip, fuzz)
✅ Round-trip 100% fiable

🔜 v0.2.0 (Próximo)

Soporte para SentencePiece real (actualmente aproximado)
Integración con Anthropic API para conteo exacto
Playground web interactivo (WASM)
Schema validation (JSON Schema → TOON)
Locked prompts (plantillas que garantizan output TOON)
Plugins para LangChain/LangSmith

🚀 v1.0.0 (Futuro)

SDKs para otros lenguajes (JavaScript, Go, Rust)
DreamFactory integration (endpoints REST → TOON)
Promptfoo evaluations automáticas
Diff viewer por campo
Compression presets por caso de uso

📖 Cuándo Usar TOON vs JSON

✅ Usa TOON Si:

Envías arrays tabulares uniformes a LLMs
Necesitas reducir costos de API (ahorro 30-60%)
Optimizas ventanas de contexto (RAG, prompts largos)
Tus datos son estructurados y consistentes
Latencia y tokens son críticos

❌ Usa JSON Si:

Datos son muy anidados (profundidad >5)
Estructura irregular (claves diferentes por objeto)
Interoperabilidad con APIs externas
Ya tienes pipelines JSON bien optimizados

💡 Estrategia Híbrida

# JSON internamente, TOON para LLM
json_data = fetch_from_api()
toon_prompt = encode(json_data)  # Convertir solo para el LLM

response = llm.complete(f"Analiza estos datos:\n{toon_prompt}")

🤝 Contribuir

¡Contribuciones bienvenidas!

Fork el repo
Crea una rama: git checkout -b feature/amazing-feature
Commit: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Abre un Pull Request

Desarrollo Local

# Clonar e instalar
git clone https://github.com/aedia/toonkit
cd toonkit
pip install -e ".[dev]"

# Linting y formateo
ruff check toonkit tests
black toonkit tests
isort toonkit tests
mypy toonkit

# Tests
pytest -v

📄 Licencia

MIT License - ve LICENSE para detalles.

🙏 Créditos

TOON Format: toon-format/toon
Spec: toon-format/spec
Inspiración: py-toon-format, @toon-format/toon

📞 Soporte

Issues: https://github.com/aedia/toonkit/issues
Discussions: https://github.com/aedia/toonkit/discussions
Email: info@aedia.com

¿Listo para ahorrar tokens? 🚀

pip install toonkit

Reduce tus costos de LLM hasta un 60% sin perder precisión.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Nov 27, 2025

This version

0.1.0

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toonkit-0.1.0.tar.gz (27.8 kB view details)

Uploaded Nov 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

toonkit-0.1.0-py3-none-any.whl (23.6 kB view details)

Uploaded Nov 27, 2025 Python 3

File details

Details for the file toonkit-0.1.0.tar.gz.

File metadata

Download URL: toonkit-0.1.0.tar.gz
Upload date: Nov 27, 2025
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for toonkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e3ef1a8277db22a39c9ede02a0f928a5daa50ad4ebd21adf3c3287ab84852cb4`
MD5	`e74d41f632316c8d604cf4c664b57b8f`
BLAKE2b-256	`4ef08452d901e9f87bd36eb9a686e1bc709b92a4b3323c61cfb3c3b74f9a5f03`

See more details on using hashes here.

File details

Details for the file toonkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: toonkit-0.1.0-py3-none-any.whl
Upload date: Nov 27, 2025
Size: 23.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for toonkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c72c6750fc38c32dbd0d1f93e7b8388d4f3bfdaeb833a69cc9feb0f94553343c`
MD5	`71282d6a8932b2b794a38862aa565172`
BLAKE2b-256	`a1a89d8c858796d786e62e5fd504c50b0f28b54b9918d3b7bf72d6801874bf08`

See more details on using hashes here.

toonkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 Toonkit

📋 Tabla de Contenidos

🎯 ¿Por qué Toonkit?

El Problema

La Solución TOON

📦 Instalación

Requisitos

🚀 Inicio Rápido

Conversión Básica

Configuración Personalizada

Streaming para Datasets Grandes

📊 Benchmarks

Benchmark Rápido

Comparación Multi-Modelo

🖥️ CLI

Instalación

Comandos

1. Convertir JSON ↔ TOON

2. Benchmark

3. Validar y Round-Trip

📚 API Reference

Core Functions

encode(data, config=None) -> str

decode(toon_str, config=None) -> JsonValue

encode_streaming(data, config=None) -> Iterator[str]

decode_streaming(lines, config=None) -> JsonValue

Configuration

ToonConfig

Benchmarking

TokenBenchmark

compare_formats(data, models=None, config=None) -> dict

Error Handling

⚙️ Configuración

Casos de Uso

1. Codificador Canónico (para caching)

2. Datasets Grandes (streaming)

3. Parser Permissivo (datos externos)

4. Límites de Seguridad

🧪 Testing

Ejecutar Tests

Coverage Actual

Round-Trip Reliability

Fuzz Testing con Hypothesis

📤 Publicar en PyPI

Setup

Versioning

🗺️ Roadmap

✅ v0.1.0 (Actual)

🔜 v0.2.0 (Próximo)

🚀 v1.0.0 (Futuro)

📖 Cuándo Usar TOON vs JSON

✅ Usa TOON Si:

❌ Usa JSON Si:

💡 Estrategia Híbrida

🤝 Contribuir

Desarrollo Local

📄 Licencia

🙏 Créditos

📞 Soporte

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

`encode(data, config=None) -> str`

`decode(toon_str, config=None) -> JsonValue`

`encode_streaming(data, config=None) -> Iterator[str]`

`decode_streaming(lines, config=None) -> JsonValue`

`ToonConfig`

`TokenBenchmark`

`compare_formats(data, models=None, config=None) -> dict`