Dictionary-based text compression optimized for short strings and URLs
Project description
wrdz
Dictionary-based compression for short text strings and URLs.
Key Features
- Optimized for Short Text: Designed for strings under 140 characters
- URL Compression: Special dictionary trained on URL patterns
- UTF-8 Support: Handles any valid UTF-8 text
- Domain Adaptable: Train custom dictionaries for your use case
- Pure Python: No external dependencies for runtime
Quick Start
pip install wrdz
from wrdz import compress, decompress
# English text compression
text = "The quick brown fox jumps over the lazy dog"
compressed = compress(text)
print(f"Original: {len(text)} bytes")
print(f"Compressed: {len(compressed)} bytes")
print(f"Ratio: {len(compressed)/len(text):.2%}")
# URL compression
from wrdz import compress_urls, decompress_urls
url = "https://github.com/pjwerneck/wrdz"
compressed = compress_urls(url)
print(f"Original: {len(url)} bytes")
print(f"Compressed: {len(compressed)} bytes")
print(f"Ratio: {len(compressed)/len(url):.2%}")
Training Custom Dictionaries
from wrdz.train import train_dictionary, save_dictionary
# Train on domain-specific text
cbook, dbook = train_dictionary(
text="your training data",
max_sub_len=4, # Max sequence length
dict_size=8192 # Dictionary entries
)
# Save for reuse
save_dictionary(cbook, dbook, "domain.dict")
# Use in compression
from wrdz.base import base_compress, base_decompress
compressed = base_compress("text", cbook)
original = base_decompress(compressed, dbook)
Compression Benchmarks
The tables below show compression ratios for different dictionary sizes and
maximum sequence lengths. The Δ% column shows the improvement over the
baseline compression ratio.
US English
The en_US dictionary is trained on a 1M lines subset of the cnn_dailymail dataset.
| Dict Size | Max Seq | Short wrdz | Short smaz | Short Δ% | Long wrdz | Long smaz | Long Δ% |
|---|---|---|---|---|---|---|---|
| 16384 | 4 | 0.671 | 0.907 | +26.0 | 0.521 | 0.621 | +16.1 |
| 8192 | 4 | 0.704 | 0.907 | +22.4 | 0.526 | 0.621 | +15.4 |
| 4096 | 4 | 0.736 | 0.907 | +18.9 | 0.540 | 0.621 | +13.1 |
| 2048 | 4 | 0.799 | 0.907 | +11.9 | 0.559 | 0.621 | +10.1 |
| 1024 | 4 | 0.867 | 0.907 | +4.4 | 0.591 | 0.621 | +4.8 |
| 512 | 4 | 0.906 | 0.907 | +0.1 | 0.627 | 0.621 | -1.0 |
| 256 | 4 | 0.919 | 0.907 | -1.3 | 0.651 | 0.621 | -4.9 |
URLs
The urls dictionary is trained on the ada-url dataset.
| Dict Size | Max Seq | wrdz Ratio | smaz Ratio | Improvement % |
|---|---|---|---|---|
| 8192 | 4 | 0.552 | 0.830 | +33.5 |
| 16384 | 4 | 0.552 | 0.830 | +33.5 |
| 4096 | 4 | 0.562 | 0.830 | +32.2 |
| 2048 | 4 | 0.587 | 0.830 | +29.2 |
| 1024 | 4 | 0.611 | 0.830 | +26.3 |
| 512 | 4 | 0.641 | 0.830 | +22.7 |
| 256 | 4 | 0.666 | 0.830 | +19.8 |
Technical Details
-
Dictionary Training
- Analyzes frequency of character sequences in training data
- Selects sequences that maximize compression
- Assigns variable-length binary codes based on frequency
-
Compression Format
- 1-bit flag per token (dictionary/raw)
- Variable-length dictionary codes (4-14 bits)
- Raw UTF-8 encoding with 2-bit length prefix
License
MIT License. See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wrdz-0.1.1.tar.gz.
File metadata
- Download URL: wrdz-0.1.1.tar.gz
- Upload date:
- Size: 365.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dd5b46d86108036e9fcdf0bca54a306e969692acfdc78637598f26282c60b73
|
|
| MD5 |
5e876468c22368d66374c2722af0d55b
|
|
| BLAKE2b-256 |
f9e5e16c51ad97588d505e97c64e3fae1c4036330074f57b23734b5582ef6043
|
File details
Details for the file wrdz-0.1.1-py3-none-any.whl.
File metadata
- Download URL: wrdz-0.1.1-py3-none-any.whl
- Upload date:
- Size: 365.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
152487d3783bbae3cd5be30b7e6bc5339743fa35a4b574ed14861c63e68a28e1
|
|
| MD5 |
df8b6308358804ec8e97a7a3b6e99a95
|
|
| BLAKE2b-256 |
c684ad6e340d551e399933c78de31b1f986c6def302b3e5f58eb48983f10f398
|