Standard Compression Scheme for Unicode
Project description
Standard Compression Scheme for Unicode
This package implements SCSU as a Python text codec.
Benefits of Unicode compression
Short strings can be compressed with less overhead than general compression algorithms and with fewer bytes than popular Unicode transformations like UTF-8 or UTF-16:
¿Qué es Unicode?("What is Unicode?" in Spanish) is encoded as 18 bytes in UTF-8, but only 16 bytes in SCSU, the same length when encoded in ISO-8859-1.ユニコードとは何か?("What is Unicode?" in Japanese) is encoded as 30 bytes in UTF-8, 20 bytes in Shift JIS and EUC-JP, but only 15 bytes in SCSU.𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛?("What is Unicode?" in the Shavian alphabet) is encoded as 47 bytes in UTF-8, but only 17 bytes in SCSU.
In an extreme case, SCSU can compress long strings of emoji:
emoji = "".join(chr(0x1F600 + n) for n in range(0x50))
sms_data = emoji.encode("UTF-16BE") # 320 bytes
scsu_data = emoji.encode("SCSU") # 83 bytes
Requirements
This package requires Python 3.10 or above.
Usage
Simply import the library and the SCSU codec is ready to use:
import scsu
b = s.encode("SCSU")
To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.
Errata
CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:
with open(file, mode="w", encoding="SCSU-SIG") as f:
f.write(s) # Never flushes the encoding buffer.
A workaround is to import the codecs module, then replace open with codecs.open:
import codecs
with codecs.open(file, "w", encoding="SCSU-SIG") as f:
f.write(s) # Always flushes the encoding buffer.
However, reading an encoded file with the given code will work:
with open(file, mode="r", encoding="SCSU-SIG") as f:
print(f.read())
Credits
Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified, but with some encoding optimizations:
- A two-character lookahead buffer.
- This avoids a case where switching from Unicode to single-byte mode requires two window switches.
- Compression of sequential static window characters into a single new dynamic window.
- This avoids a case where a long string of punctuation is encoded as multiple quoted characters.
- Uses the Latin-1 Supplement window whenever possible.
- When encoding a string that only contains ASCII and Latin-1 Supplement characters, this results in a string that is both valid in SCSU and ISO-8859-1.
Decoding logic, however, is entirely original.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scsu-0.3.tar.gz.
File metadata
- Download URL: scsu-0.3.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
342049c15e5be34ba044905e3beea78df7f33dc1caaf77465c1c36eb5eff217c
|
|
| MD5 |
96dfb385bfcdc146911614a32aea0bff
|
|
| BLAKE2b-256 |
da8628cfcc0585911bb1c572afae149f93e89d1d048b4608f61f235a198b7578
|
File details
Details for the file scsu-0.3-py3-none-any.whl.
File metadata
- Download URL: scsu-0.3-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92511526212dca955da3f7b1055bc05c12ca0c2332dbe2c28f7949e8e05a6e16
|
|
| MD5 |
fa8dddad9a3e89ea4b2ae93853b7130e
|
|
| BLAKE2b-256 |
fd2bb3bf1697a732416ccff8d133a6fef89632e98d38839c4465ffb770cd5320
|