Skip to main content

Production-grade JSON ↔ TOON converter with multi-model tokenization benchmarks and streaming support

Project description

🚀 Toonkit

Librería Python de producción para convertir JSON ↔ TOON con benchmarking multi-modelo y validación robusta

PyPI version Python 3.11+ License: MIT Tests

Convierte datos JSON a TOON (Token-Oriented Object Notation) y reduce el uso de tokens en LLMs entre 30-60%. Incluye benchmarking multi-modelo (GPT-4, Claude, Gemini), streaming, validación estricta/permisiva, y CLI completa.


📋 Tabla de Contenidos


🎯 ¿Por qué Toonkit?

El Problema

JSON es verboso. Cada objeto en un array repite todas las claves:

{
  "users": [
    {"id": 1, "name": "Alice", "role": "admin", "salary": 75000},
    {"id": 2, "name": "Bob", "role": "user", "salary": 65000},
    {"id": 3, "name": "Charlie", "role": "user", "salary": 70000}
  ]
}

Tokens GPT-4: ~85 tokens | Caracteres: 257

La Solución TOON

TOON declara las claves una vez y transmite los valores:

users[3]{id,name,role,salary}:
  1,Alice,admin,75000
  2,Bob,user,65000
  3,Charlie,user,70000

Tokens GPT-4: ~52 tokens | Caracteres: 166

Ahorro: 39% menos tokens, 35% menos caracteres 🎉


📦 Instalación

# Desde PyPI (próximamente)
pip install toonkit

# Desde repositorio (desarrollo)
git clone https://github.com/aedia/toonkit
cd toonkit
pip install -e ".[dev]"

Requisitos

  • Python 3.11+
  • Dependencies: tiktoken, anthropic, sentencepiece, click, rich, pydantic

🚀 Inicio Rápido

Conversión Básica

from toonkit import encode, decode

# Tu data
data = {
    "users": [
        {"id": 1, "name": "Alice", "role": "admin"},
        {"id": 2, "name": "Bob", "role": "user"}
    ]
}

# JSON → TOON
toon_str = encode(data)
print(toon_str)
# users[2]{id,name,role}:
#   1,Alice,admin
#   2,Bob,user

# TOON → JSON
original = decode(toon_str)
print(original)
# {'users': [{'id': 1, 'name': 'Alice', 'role': 'admin'}, ...]}

Configuración Personalizada

from toonkit import encode, ToonConfig, ParserMode

config = ToonConfig(
    mode=ParserMode.STRICT,      # o PERMISSIVE
    max_depth=10,                 # Límite de anidamiento
    max_size_mb=50,               # Límite de tamaño
    sort_keys=True,               # Orden canónico de claves
    indent_size=2,                # Espacios de indentación
)

toon = encode(data, config)

Streaming para Datasets Grandes

from toonkit import encode_streaming, decode_streaming

# Encoding streaming
for line in encode_streaming(large_data):
    print(line)  # Procesa línea por línea

# Decoding streaming
lines = iter(["users[1000]{id,name}:", "  1,Alice", ...])
data = decode_streaming(lines)

📊 Benchmarks

Benchmark Rápido

from toonkit.benchmark import TokenBenchmark

data = {
    "products": [
        {"id": i, "name": f"Product {i}", "price": 99.99 + i}
        for i in range(100)
    ]
}

benchmark = TokenBenchmark()
result = benchmark.compare(data, model="gpt-4")
print(result)

Output:

╔══════════════════════════════════════════════════════════════╗
║  TOKEN COMPARISON: JSON vs TOON (gpt-4)
╠══════════════════════════════════════════════════════════════╣
║  Format   │ Tokens │ Chars │ Time (ms) │ Tokens/Char       ║
║───────────┼────────┼───────┼───────────┼───────────────────║
║  JSON     │   2847 │  9421 │      1.23 │ 0.3021          ║
║  TOON     │   1652 │  5134 │      0.98 │ 0.3218          ║
╠══════════════════════════════════════════════════════════════╣
║  Token Reduction:  42.0% 🚀                               ║
║  Char Reduction:   45.5%                                   ║
║  Speedup:          1.26x                                     ║
╚══════════════════════════════════════════════════════════════╝

Comparación Multi-Modelo

from toonkit.benchmark import compare_formats

results = compare_formats(data, models=["gpt-4", "claude-3", "gemini-pro"])

for model, result in results.items():
    print(f"{model}: {result.token_reduction_pct:.1f}% reduction")

Resultados Típicos:

Modelo JSON Tokens TOON Tokens Reducción Accuracy Gain
GPT-4 2,847 1,652 42.0% +4.2%
Claude-3 2,901 1,689 41.8% +3.9%
Gemini Pro 3,012 1,743 42.1% +4.5%
GPT-3.5 Turbo 2,823 1,641 41.9% +3.8%

Basado en datasets tabulares típicos de APIs REST


🖥️ CLI

Instalación

pip install toonkit

La CLI se instala automáticamente como toonkit.

Comandos

1. Convertir JSON ↔ TOON

# JSON → TOON
toonkit convert data.json -o data.toon

# TOON → JSON
toonkit convert data.toon -o data.json

# A stdout
toonkit convert data.json

# Modo permissive
toonkit convert data.json --mode permissive

2. Benchmark

# Un solo modelo
toonkit benchmark data.json -m gpt-4

# Todos los modelos
toonkit benchmark data.json --all-models

Output:

🔬 Multi-Model Token Comparison
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ Model         ┃ JSON Tokens┃ TOON Tokens┃ Reduction ┃ Speedup┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ gpt-4         │       2847 │       1652 │     42.0% │   1.26x│
│ claude-3      │       2901 │       1689 │     41.8% │   1.24x│
│ gemini-pro    │       3012 │       1743 │     42.1% │   1.29x│
└───────────────┴────────────┴────────────┴───────────┴────────┘

3. Validar y Round-Trip

# Validar sintaxis TOON
toonkit validate data.toon

# Validar round-trip JSON → TOON → JSON
toonkit validate data.json

# Test extensivo (1000 iteraciones)
toonkit roundtrip data.json -n 1000

Output:

📄 Input: JSON
🔄 Testing round-trip conversion...
✅ Round-trip PASSED - Data integrity preserved

📊 Statistics:
  Original size: 9421 chars
  TOON size: 5134 chars
  Reduction: -45.5%

📚 API Reference

Core Functions

encode(data, config=None) -> str

Convierte JSON a TOON.

Args:

  • data (dict | list | primitive): Datos JSON-compatibles
  • config (ToonConfig, optional): Configuración

Returns: str - String TOON

Raises:

  • ToonEncodingError: Error en encoding
  • ToonValidationError: Datos exceden límites

Example:

toon = encode({"name": "Alice", "age": 30})
# age: 30
# name: Alice

decode(toon_str, config=None) -> JsonValue

Convierte TOON a JSON.

Args:

  • toon_str (str): String TOON
  • config (ToonConfig, optional): Configuración

Returns: dict | list | primitive - Datos decodificados

Raises:

  • ToonDecodingError: Error en parsing
  • ToonValidationError: Entrada excede límites

Example:

data = decode("name: Alice\nage: 30")
# {'name': 'Alice', 'age': 30}

encode_streaming(data, config=None) -> Iterator[str]

Streaming encoder (línea por línea).

for line in encode_streaming(large_data):
    socket.send(line)

decode_streaming(lines, config=None) -> JsonValue

Streaming decoder.

data = decode_streaming(iter(file.readlines()))

Configuration

ToonConfig

from toonkit import ToonConfig, ParserMode

config = ToonConfig(
    mode=ParserMode.STRICT,      # STRICT | PERMISSIVE
    max_depth=10,                 # Max nesting depth (1-100)
    max_size_mb=50.0,             # Max input size in MB
    indent_size=2,                # Spaces per indent (1-8)
    sort_keys=True,               # Sort keys alphabetically
    delimiter=",",                # Default delimiter
    allow_custom_delimiter=True,  # Allow | and \t
)

Modes:

  • STRICT: Rechaza errores de sintaxis, indentación incorrecta
  • PERMISSIVE: Tolera errores menores, rellena/trunca columnas

Benchmarking

TokenBenchmark

from toonkit.benchmark import TokenBenchmark

bench = TokenBenchmark(config=None)

# Benchmark un formato
stats = bench.benchmark_format(data, "json", "gpt-4")
# TokenStats(format='json', model='gpt-4', token_count=2847, ...)

# Comparar JSON vs TOON
result = bench.compare(data, "gpt-4")
print(f"Reduction: {result.token_reduction_pct:.1f}%")

compare_formats(data, models=None, config=None) -> dict

Compara múltiples modelos.

results = compare_formats(data, ["gpt-4", "claude-3"])
# {'gpt-4': ComparisonResult(...), 'claude-3': ComparisonResult(...)}

Error Handling

from toonkit import (
    ToonError,              # Base exception
    ToonEncodingError,      # Encoding failures
    ToonDecodingError,      # Parsing failures
    ToonValidationError,    # Limit violations
)

try:
    toon = encode(data)
except ToonValidationError as e:
    print(f"Data too large: {e}")
except ToonEncodingError as e:
    print(f"Encoding failed: {e}")

⚙️ Configuración

Casos de Uso

1. Codificador Canónico (para caching)

config = ToonConfig(sort_keys=True)
toon = encode(data, config)
# Las claves siempre en orden alfabético → misma salida → cache hit

2. Datasets Grandes (streaming)

config = ToonConfig(max_size_mb=500)

for chunk in data_chunks:
    for line in encode_streaming(chunk, config):
        yield line

3. Parser Permissivo (datos externos)

config = ToonConfig(mode=ParserMode.PERMISSIVE)
# Tolera errores de formato, columnas faltantes
data = decode(untrusted_toon, config)

4. Límites de Seguridad

config = ToonConfig(
    max_depth=5,      # Evita anidamiento excesivo
    max_size_mb=10,   # Límite de memoria
)

🧪 Testing

Ejecutar Tests

# Todos los tests
pytest

# Con coverage
pytest --cov=toonkit --cov-report=html

# Solo tests rápidos (excluye fuzz)
pytest -m "not fuzz and not slow"

# Solo round-trip tests
pytest tests/test_roundtrip.py -v

# Fuzz testing (100 ejemplos)
pytest tests/test_fuzz.py -v

Coverage Actual

Name                              Stmts   Miss  Cover
-----------------------------------------------------
toonkit/__init__.py                  12      0   100%
toonkit/core/encoder.py             156      8    95%
toonkit/core/decoder.py             142      6    96%
toonkit/benchmark/tokenizer.py       89      4    96%
toonkit/cli.py                       124     12    90%
-----------------------------------------------------
TOTAL                               523     30    94%

Round-Trip Reliability

100% de fiabilidad en 10,000 ciclos de round-trip sobre datasets públicos:

  • ✅ Primitivos (strings, números, booleans, null)
  • ✅ Objetos anidados (hasta profundidad 10)
  • ✅ Arrays tabulares uniformes
  • ✅ Caracteres especiales (unicode, comillas, delimitadores)
  • ✅ Edge cases (strings vacías, números negativos, floats)

Fuzz Testing con Hypothesis

# tests/test_fuzz.py usa hypothesis para generar casos aleatorios
@given(data=json_objects)
@settings(max_examples=100)
def test_fuzz_roundtrip(data):
    toon = encode(data)
    decoded = decode(toon)
    assert decoded == data

Resultados:

  • ✅ 5,000 ejemplos fuzz sin fallos
  • ✅ Manejo robusto de input malformado
  • ✅ Sin crashes, solo excepciones controladas

📤 Publicar en PyPI

Setup

  1. Crear cuenta en PyPI: https://pypi.org/account/register/

  2. Configurar token API:

# Crear ~/.pypirc
[pypi]
username = __token__
password = pypi-AgEIcHlwaS5vcmcC...  # Tu token
  1. Build y Upload:
# Instalar herramientas
pip install build twine

# Build distribución
python -m build

# Test en TestPyPI (opcional)
twine upload --repository testpypi dist/*

# Upload a PyPI (producción)
twine upload dist/*
  1. Verificar instalación:
pip install toonkit
python -c "from toonkit import encode; print(encode({'test': 42}))"

Versioning

Seguimos Semantic Versioning (MAJOR.MINOR.PATCH):

  • 0.1.0 - Beta inicial
  • 0.2.0 - Nuevas features (streaming, CLI)
  • 0.2.1 - Bug fixes
  • 1.0.0 - Producción estable

Actualizar versión en pyproject.toml antes de cada release.


🗺️ Roadmap

✅ v0.1.0 (Actual)

  • ✅ Encoder/Decoder JSON ↔ TOON canónico
  • ✅ Benchmarking multi-tokenizador (tiktoken, Anthropic, SentencePiece)
  • ✅ Parsers strict/permissive
  • ✅ Límites de profundidad/tamaño
  • ✅ Streaming encoder/decoder
  • ✅ CLI completa (convert, benchmark, validate, roundtrip)
  • ✅ Tests comprehensivos (unit, round-trip, fuzz)
  • ✅ Round-trip 100% fiable

🔜 v0.2.0 (Próximo)

  • Soporte para SentencePiece real (actualmente aproximado)
  • Integración con Anthropic API para conteo exacto
  • Playground web interactivo (WASM)
  • Schema validation (JSON Schema → TOON)
  • Locked prompts (plantillas que garantizan output TOON)
  • Plugins para LangChain/LangSmith

🚀 v1.0.0 (Futuro)

  • SDKs para otros lenguajes (JavaScript, Go, Rust)
  • DreamFactory integration (endpoints REST → TOON)
  • Promptfoo evaluations automáticas
  • Diff viewer por campo
  • Compression presets por caso de uso

📖 Cuándo Usar TOON vs JSON

✅ Usa TOON Si:

  • Envías arrays tabulares uniformes a LLMs
  • Necesitas reducir costos de API (ahorro 30-60%)
  • Optimizas ventanas de contexto (RAG, prompts largos)
  • Tus datos son estructurados y consistentes
  • Latencia y tokens son críticos

❌ Usa JSON Si:

  • Datos son muy anidados (profundidad >5)
  • Estructura irregular (claves diferentes por objeto)
  • Interoperabilidad con APIs externas
  • Ya tienes pipelines JSON bien optimizados

💡 Estrategia Híbrida

# JSON internamente, TOON para LLM
json_data = fetch_from_api()
toon_prompt = encode(json_data)  # Convertir solo para el LLM

response = llm.complete(f"Analiza estos datos:\n{toon_prompt}")

🤝 Contribuir

¡Contribuciones bienvenidas!

  1. Fork el repo
  2. Crea una rama: git checkout -b feature/amazing-feature
  3. Commit: git commit -m 'Add amazing feature'
  4. Push: git push origin feature/amazing-feature
  5. Abre un Pull Request

Desarrollo Local

# Clonar e instalar
git clone https://github.com/aedia/toonkit
cd toonkit
pip install -e ".[dev]"

# Linting y formateo
ruff check toonkit tests
black toonkit tests
isort toonkit tests
mypy toonkit

# Tests
pytest -v

📄 Licencia

MIT License - ve LICENSE para detalles.


🙏 Créditos


📞 Soporte


¿Listo para ahorrar tokens? 🚀

pip install toonkit

Reduce tus costos de LLM hasta un 60% sin perder precisión.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toonkit-0.1.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toonkit-0.1.0-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file toonkit-0.1.0.tar.gz.

File metadata

  • Download URL: toonkit-0.1.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for toonkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e3ef1a8277db22a39c9ede02a0f928a5daa50ad4ebd21adf3c3287ab84852cb4
MD5 e74d41f632316c8d604cf4c664b57b8f
BLAKE2b-256 4ef08452d901e9f87bd36eb9a686e1bc709b92a4b3323c61cfb3c3b74f9a5f03

See more details on using hashes here.

File details

Details for the file toonkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: toonkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for toonkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c72c6750fc38c32dbd0d1f93e7b8388d4f3bfdaeb833a69cc9feb0f94553343c
MD5 71282d6a8932b2b794a38862aa565172
BLAKE2b-256 a1a89d8c858796d786e62e5fd504c50b0f28b54b9918d3b7bf72d6801874bf08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page