Skip to main content

Dictionary-based text compression optimized for short strings and URLs

Project description

wrdz

License Python Versions PyPI version

Dictionary-based compression for short text strings and URLs.

Key Features

  • Optimized for Short Text: Designed for strings under 140 characters
  • URL Compression: Special dictionary trained on URL patterns
  • UTF-8 Support: Handles any valid UTF-8 text
  • Domain Adaptable: Train custom dictionaries for your use case
  • Pure Python: No external dependencies for runtime

Quick Start

pip install wrdz
from wrdz import compress, decompress

# English text compression
text = "The quick brown fox jumps over the lazy dog"
compressed = compress(text)
print(f"Original: {len(text)} bytes")
print(f"Compressed: {len(compressed)} bytes")
print(f"Ratio: {len(compressed)/len(text):.2%}")

# URL compression
from wrdz import compress_urls, decompress_urls

url = "https://github.com/pjwerneck/wrdz"
compressed = compress_urls(url)
print(f"Original: {len(url)} bytes")
print(f"Compressed: {len(compressed)} bytes")
print(f"Ratio: {len(compressed)/len(url):.2%}")

Training Custom Dictionaries

from wrdz.train import train_dictionary, save_dictionary

# Train on domain-specific text
cbook, dbook = train_dictionary(
    text="your training data",
    max_sub_len=4,    # Max sequence length
    dict_size=8192    # Dictionary entries
)

# Save for reuse
save_dictionary(cbook, dbook, "domain.dict")

# Use in compression
from wrdz.base import base_compress, base_decompress

compressed = base_compress("text", cbook)
original = base_decompress(compressed, dbook)

Compression Benchmarks

The tables below show compression ratios for different dictionary sizes and maximum sequence lengths. The Δ% column shows the improvement over the baseline compression ratio.

US English

The en_US dictionary is trained on a 1M lines subset of the cnn_dailymail dataset.

Dict Size Max Seq Short wrdz Short smaz Short Δ% Long wrdz Long smaz Long Δ%
16384 4 0.671 0.907 +26.0 0.521 0.621 +16.1
8192 4 0.704 0.907 +22.4 0.526 0.621 +15.4
4096 4 0.736 0.907 +18.9 0.540 0.621 +13.1
2048 4 0.799 0.907 +11.9 0.559 0.621 +10.1
1024 4 0.867 0.907 +4.4 0.591 0.621 +4.8
512 4 0.906 0.907 +0.1 0.627 0.621 -1.0
256 4 0.919 0.907 -1.3 0.651 0.621 -4.9

URLs

The urls dictionary is trained on the ada-url dataset.

Dict Size Max Seq wrdz Ratio smaz Ratio Improvement %
8192 4 0.552 0.830 +33.5
16384 4 0.552 0.830 +33.5
4096 4 0.562 0.830 +32.2
2048 4 0.587 0.830 +29.2
1024 4 0.611 0.830 +26.3
512 4 0.641 0.830 +22.7
256 4 0.666 0.830 +19.8

Technical Details

  1. Dictionary Training

    • Analyzes frequency of character sequences in training data
    • Selects sequences that maximize compression
    • Assigns variable-length binary codes based on frequency
  2. Compression Format

    • 1-bit flag per token (dictionary/raw)
    • Variable-length dictionary codes (4-14 bits)
    • Raw UTF-8 encoding with 2-bit length prefix

License

MIT License. See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wrdz-0.1.1.tar.gz (365.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wrdz-0.1.1-py3-none-any.whl (365.6 kB view details)

Uploaded Python 3

File details

Details for the file wrdz-0.1.1.tar.gz.

File metadata

  • Download URL: wrdz-0.1.1.tar.gz
  • Upload date:
  • Size: 365.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.13

File hashes

Hashes for wrdz-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9dd5b46d86108036e9fcdf0bca54a306e969692acfdc78637598f26282c60b73
MD5 5e876468c22368d66374c2722af0d55b
BLAKE2b-256 f9e5e16c51ad97588d505e97c64e3fae1c4036330074f57b23734b5582ef6043

See more details on using hashes here.

File details

Details for the file wrdz-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: wrdz-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 365.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.13

File hashes

Hashes for wrdz-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 152487d3783bbae3cd5be30b7e6bc5339743fa35a4b574ed14861c63e68a28e1
MD5 df8b6308358804ec8e97a7a3b6e99a95
BLAKE2b-256 c684ad6e340d551e399933c78de31b1f986c6def302b3e5f58eb48983f10f398

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page