Skip to main content

Rust-accelerated structured data encoder for LLM token compression

Project description

CLM Encoder

Structured Data Encoder

Rust-Accelerated SDE for LLMs

Test Suite PyPI License

Compress structured data into compact token sequences — 40–85% fewer tokens, no model retraining, no heavy dependencies.


sd-encoder is the standalone Structured Data Encoder from CLM, compiled in Rust and exposed as a Python extension. It encodes dicts, lists, and nested objects into compact token sequences that LLMs interpret with equal or better accuracy at a fraction of the token cost.

Install it on its own if you only need structured data encoding — no spaCy, no NLP stack, no unnecessary overhead.

Input Typical Compression
Product catalogs 55–85%
Knowledge bases 40–75%
Business rules 50–80%
API responses 45–70%

Installation

pip install sd-encoder

No additional downloads required. Pre-built wheels are available for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows.


Quick Start

from sd_encoder import SDEncoderV2, SDCompressionConfig

config = SDCompressionConfig(preserve_structure=True, auto_detect=True)
encoder = SDEncoderV2(config)

catalog = [
    {"article_id": "KB-001", "title": "Reset Password", "content": "To reset your password...", "tags": ["security"]},
    {"article_id": "KB-002", "title": "Update Billing",  "content": "To update your billing...",  "tags": ["billing"]},
]

result = encoder.encode_validated(catalog)
print(result.compressed)
# {article_id,title,content,tags}[KB-001,Reset Password,To reset your password...,security][KB-002,Update Billing,To update your billing...,billing]

print(f"{result.compression_ratio():.1f}% reduction")
print(f"{result.n_tokens()}{result.c_tokens()} tokens")

Configuration

SDCompressionConfig controls field selection, truncation, and structure preservation. All parameters are optional.

from sd_encoder import SDCompressionConfig, FieldImportance

config = SDCompressionConfig(
    # Field selection
    required_fields=["id", "title", "status"],      # always include these
    excluded_fields=["internal_notes", "raw_log"],  # always drop these
    drop_non_required_fields=False,                 # if True, emit only required_fields

    # Importance filtering
    auto_detect=True,                               # infer importance from field name/value
    importance_threshold=FieldImportance.MEDIUM,    # drop fields below this level
    field_importance={                              # explicit overrides
        "summary": FieldImportance.HIGH,
        "version": FieldImportance.LOW,
    },

    # Truncation
    max_truncation_length=300,                      # global string truncation
    max_truncation_mapping={                        # per-field truncation
        "description": 150,
        "content": 500,
    },

    # Structure
    preserve_structure=True,                        # encode nested objects inline
    default_fields_order=["id", "title", "status"], # pin ordering of known fields
)

Field Importance

FieldImportance controls the auto-detection threshold. Values are ordered — NEVER < LOW < MEDIUM < HIGH < CRITICAL.

from sd_encoder import FieldImportance

FieldImportance.LOW      # drop when filtering
FieldImportance.MEDIUM   # include by default
FieldImportance.HIGH     # always include unless explicitly excluded
FieldImportance.CRITICAL # never drop (ids, names, titles)

# Comparable
FieldImportance.HIGH >= FieldImportance.MEDIUM  # True
int(FieldImportance.HIGH)                       # 3

Auto-detection applies heuristics to field names and values when auto_detect=True:

Pattern Detected importance
id, uuid, name, title CRITICAL
status, priority, details HIGH
description, type, channel MEDIUM
source, version, metadata LOW
_*, *_at, *_date NEVER

Output

encode_validated runs compression then strips redundant whitespace and falls back to the original if the compressed output is larger.

result = encoder.encode_validated(data)

result.compressed        # str — the encoded token sequence
result.original          # original input, returned as Python dict/list
result.component         # "ds_compression"
result.n_tokens()        # estimated token count of original
result.c_tokens()        # estimated token count of compressed
result.compression_ratio()  # float — percentage reduction

# Validate manually if needed
result = encoder.encode(data)
result.validate_compression_ratio()  # fall back to original if compressed is larger
result.validate_compressed()         # strip redundant whitespace

Use encode directly when you want to inspect the output before deciding whether to validate.


Benchmarks

Run the Rust load benchmarks with:

make bench

The load benchmark measures:

  • encode_payload_size: latency and rows/sec for catalog-style table payloads and nested ticket payloads at 10, 100, 1,000, and 5,000 rows.
  • encode_parallel_load: aggregate throughput with 1, 2, 4, and 8 concurrent workers encoding 100-row payloads.

Criterion writes detailed reports under target/criterion/. Use the thrpt line for capacity estimates and the time interval for latency bounds on the tested machine.

For a compact table that answers "how long does compression take and what ratio do I get?", run:

make profile-load

This prints average latency, p95 latency, estimated original/compressed tokens, compression ratio, and compressed bytes for flat records, catalog tables, nested ticket bundles, and API-style responses.


Encoding Examples

Single object:

encoder.encode_validated({"id": "T-42", "title": "Login fails", "status": "open", "priority": "high"})
# {id,title,status,priority}[T-42,Login fails,open,high]

Nested object:

encoder.encode_validated({
    "user": {"id": "U-1", "name": "Ana"},
    "ticket": {"id": "T-42", "status": "open"}
})
# {user:{id,name},ticket:{id,status}}[U-1,Ana][T-42,open]

List of dicts (table encoding):

encoder.encode_validated([
    {"id": 1, "name": "Laptop",  "status": "active"},
    {"id": 2, "name": "Monitor", "status": "active"},
])
# {id,name,status}[1,Laptop,active][2,Monitor,active]

With field filtering:

config = SDCompressionConfig(
    required_fields=["id", "title"],
    drop_non_required_fields=True,
)
encoder = SDEncoderV2(config)
encoder.encode_validated({"id": 1, "title": "Test", "internal_log": "...", "raw": "..."})
# {id,title}[1,Test]

Relationship to CLM

sd-encoder is the engine behind the Structured Data encoder in CLM. If you need thread or system prompt encoding alongside structured data, install the full library instead:

pip install clm-core

sd-encoder is the right choice when:

  • You only need structured data encoding
  • You want to avoid the spaCy dependency
  • You're deploying in a constrained environment
  • You're integrating encoding into a Rust or polyglot pipeline

License

Dual-licensed:

  • AGPL-3.0 — free for open source use (LICENSE-AGPL)
  • Commercial — for proprietary products and SaaS (contact)

Issues · Discussions · Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sd_encoder-0.1.2.tar.gz (67.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sd_encoder-0.1.2-cp314-cp314-win_amd64.whl (863.4 kB view details)

Uploaded CPython 3.14Windows x86-64

sd_encoder-0.1.2-cp314-cp314-macosx_11_0_arm64.whl (966.1 kB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

sd_encoder-0.1.2-cp314-cp314-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.14macOS 10.12+ x86-64

sd_encoder-0.1.2-cp313-cp313-win_amd64.whl (863.5 kB view details)

Uploaded CPython 3.13Windows x86-64

sd_encoder-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (966.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

sd_encoder-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sd_encoder-0.1.2-cp312-cp312-win_amd64.whl (863.7 kB view details)

Uploaded CPython 3.12Windows x86-64

sd_encoder-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (967.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

sd_encoder-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sd_encoder-0.1.2-cp311-cp311-win_amd64.whl (865.2 kB view details)

Uploaded CPython 3.11Windows x86-64

sd_encoder-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (970.2 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

sd_encoder-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sd_encoder-0.1.2-cp310-cp310-win_amd64.whl (865.1 kB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file sd_encoder-0.1.2.tar.gz.

File metadata

  • Download URL: sd_encoder-0.1.2.tar.gz
  • Upload date:
  • Size: 67.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5fcfe70936e11f9ae355757c3e2f20cd41e8707fcd024256fbdf64e1decb9a90
MD5 49b112944913b5d19d67f08fc745aca1
BLAKE2b-256 c5ea7669f052946295cd469966bb26da47071fbb382128c5f1836fcd6058931e

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: sd_encoder-0.1.2-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 863.4 kB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 face71e8579d7cc614ddd720340bfaf96c4273a9b8c14e46c2fc3bbfb0313299
MD5 0cd6478689edbf5ad873f09b9dd2b03b
BLAKE2b-256 5f14399c5d69fc639aa27731c85f969d7012e9402cfbc7a6b9be0a0d74479474

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 138ae209f3dd0bacf64a6ecff857292ed83834b0cf14c414264f81e73f7d5145
MD5 68b346d9f40ea992d5604eb4bc1fd4f9
BLAKE2b-256 1a1d40f00ffaf5a20a671d969e895776206a2ea0ddec5272e31c5f9929c869df

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp314-cp314-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp314-cp314-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1ad253b2dfe9c7b32ef1f362d8c5a8b45bf0954cfdd9e77e3ec702b67864022b
MD5 8372cb4731748a67488bbe4cafa81ea4
BLAKE2b-256 09deae0d28634b977ba2f6af025b252db489d6dd42706fa668f2923e5895beed

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sd_encoder-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 863.5 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 8e7d71eaebf04e1571977b29212acf06af842307727ad588b98820b156df7b79
MD5 2e15e3a9c15bbf201b3813f228b8d29f
BLAKE2b-256 c46baa6a2a6605174d713fe57722c1f7f7b2cadd6fafbb1b632fef5d04e4493f

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b62fdb2a1e9153ca62013f2757b904aafc8b74734db9c53d3ad38424bae6b768
MD5 6751ae27440a1a315d5962b6fc6480bb
BLAKE2b-256 4f7a103a5f95897d2f521c02205af02aa405308397736ae95d498e135b5f1482

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4edcba8c23b7101b7437b1938da473a7a90b5c00e05aa07e84fa1dbe9f605574
MD5 76284011a9c5665664781012329c7c71
BLAKE2b-256 cc0b10e45b6afb4dd3b32d5383836674f345c1907846f2b6144622808217defd

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sd_encoder-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 863.7 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a7f119fd90696bdaf4f5f744a015a1d0d0ec3c0e6b423a35d993482c805a73c5
MD5 af7818be40391e4267880fd44eeab29e
BLAKE2b-256 16ded2066e298702da88ceeb1fb1ef73885ccd25021256ba36d3b6e3c59250d4

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c6d7654a9237495b4976ee910f795614d01214b2da6668b2d59c86f46ce6450e
MD5 389a087f805ffc1d2b4987b57dcb490e
BLAKE2b-256 26a20a8a0d0f66f1bc5a70ebb721657c1d45fc7199249a3bde7f43f461306e37

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5df6e0d87aedd8848c17c4ea02bee6bebd675885cef446df6f1f08adbc416cb4
MD5 3a65962a64e228c9967c4ff522940c41
BLAKE2b-256 de0fb9d083ec90215e9dbca7396f5442dfab481c44d8ab2f93d98dd6fa6dffb8

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sd_encoder-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 865.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4761ab3ea92728befcc9b5d035d313e76ca9d5ab08bb8e38ea166c30aae3f0c2
MD5 e5aa30f52826294823a5f19a14b05e47
BLAKE2b-256 c6ab27f0946fe69857f8e32c6d319ab7702e2139b9612be7c8685dfa3476d522

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 93f9092feafa2b7a4a58af183aaa6f39a716d3482c528d141337a6d62c51b090
MD5 69fcb3c10da9020e34102141b25f9511
BLAKE2b-256 9af805784dc93e9aaf87a6af7e5276273b11a891d155195d7502f0890442881a

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sd_encoder-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 254e08fe628b7ce0dd0d297c733f73330f54c8386f6e5a492d1c504e37289f4f
MD5 1f7ca39073aaefc96f5d4a008f0614e9
BLAKE2b-256 98e59e903531adf20b1929270ea1734d37b9151d4cd0cd552cc494e04eee2e54

See more details on using hashes here.

File details

Details for the file sd_encoder-0.1.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sd_encoder-0.1.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 865.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sd_encoder-0.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 36e6461499905a05aadcc75e2128be56c66845b0f34cff4756e7237f7234d340
MD5 2247abf7450d6ff74b5de299442019a5
BLAKE2b-256 bce6ee1cd389b9fada16a873e1b70f56dc4fafb66524c73699e0f84a2206cf46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page