Skip to main content

Standard Compression Scheme for Unicode

Project description

Standard Compression Scheme for Unicode

This package implements SCSU as a Python text codec.

Benefits of Unicode compression

Short strings can be compressed with less overhead than general compression algorithms and with fewer bytes than popular Unicode transformations like UTF-8 or UTF-16:

  • ¿Qué es Unicode? ("What is Unicode?" in Spanish) is encoded as 18 bytes in UTF-8, but only 16 bytes in SCSU, the same length when encoded in ISO-8859-1.
  • ユニコードとは何か? ("What is Unicode?" in Japanese) is encoded as 30 bytes in UTF-8, 20 bytes in Shift JIS and EUC-JP, but only 15 bytes in SCSU.
  • 𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛? ("What is Unicode?" in the Shavian alphabet) is encoded as 47 bytes in UTF-8, but only 17 bytes in SCSU.

In an extreme case, SCSU can compress long strings of emoji:

emoji = "".join(chr(0x1F600 + n) for n in range(0x50))
sms_data = emoji.encode("UTF-16BE")  # 320 bytes
scsu_data = emoji.encode("SCSU")  # 83 bytes

Requirements

This package requires Python 3.10 or above.

Usage

Simply import the library and the SCSU codec is ready to use:

import scsu

b = s.encode("SCSU")

To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.

Errata

CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:

with open(file, mode="w", encoding="SCSU-SIG") as f:
    f.write(s)  # Never flushes the encoding buffer.

A workaround is to import the codecs module, then replace open with codecs.open:

import codecs

with codecs.open(file, "w", encoding="SCSU-SIG") as f:
    f.write(s)  # Always flushes the encoding buffer.

However, reading an encoded file with the given code will work:

with open(file, mode="r", encoding="SCSU-SIG") as f:
    print(f.read())

Credits

Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified, but with some encoding optimizations:

  • A two-character lookahead buffer.
    • This avoids a case where switching from Unicode to single-byte mode requires two window switches.
  • Compression of sequential static window characters into a single new dynamic window.
    • This avoids a case where a long string of punctuation is encoded as multiple quoted characters.
  • Uses the Latin-1 Supplement window whenever possible.
    • When encoding a string that only contains ASCII and Latin-1 Supplement characters, this results in a string that is both valid in SCSU and ISO-8859-1.

Decoding logic, however, is entirely original.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsu-0.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scsu-0.2-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file scsu-0.2.tar.gz.

File metadata

  • Download URL: scsu-0.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.2.tar.gz
Algorithm Hash digest
SHA256 dfae12081ecf9cd354f2f1d27ba29ae146a1439a6456e6a9bc7ec51dd75d21fa
MD5 83c8d22f522ab34335ba41f1d7f8d140
BLAKE2b-256 ffe1d2e0b070736d04a9de7beb248ad6ccf5d122044f1aa4e47c81cea628bb62

See more details on using hashes here.

File details

Details for the file scsu-0.2-py3-none-any.whl.

File metadata

  • Download URL: scsu-0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0cf40e150b1be327489a5bb6ab809dfe6a8b5e4546ff0aa96a5a274c0c752e56
MD5 1e66972484fb853ee98341d72f7bb6cd
BLAKE2b-256 cb5af7243d754715ca0129337d62a29383414b49ecd79f4c3bca9ef5645f29d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page