Skip to main content

Standard Compression Scheme for Unicode

Project description

Standard Compression Scheme for Unicode

This package implements SCSU as a Python text codec.

Benefits of Unicode compression

Short strings can be compressed with less overhead than general compression algorithms and with fewer bytes than popular Unicode transformations like UTF-8 or UTF-16:

  • ¿Qué es Unicode? ("What is Unicode?" in Spanish) is encoded as 18 bytes in UTF-8, but only 16 bytes in SCSU, the same length when encoded in ISO-8859-1.
  • ユニコードとは何か? ("What is Unicode?" in Japanese) is encoded as 30 bytes in UTF-8, 20 bytes in Shift JIS and EUC-JP, but only 15 bytes in SCSU.
  • 𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛? ("What is Unicode?" in the Shavian alphabet) is encoded as 47 bytes in UTF-8, but only 17 bytes in SCSU.

In an extreme case, SCSU can compress long strings of emoji:

emoji = "".join(chr(0x1F600 + n) for n in range(0x50))
sms_data = emoji.encode("UTF-16BE")  # 320 bytes
scsu_data = emoji.encode("SCSU")  # 83 bytes

Requirements

This package requires Python 3.10 or above.

Usage

Simply import the library and the SCSU codec is ready to use:

import scsu

b = s.encode("SCSU")

To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.

Errata

CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:

with open(file, mode="w", encoding="SCSU-SIG") as f:
    f.write(s)  # Never flushes the encoding buffer.

A workaround is to import the codecs module, then replace open with codecs.open:

import codecs

with codecs.open(file, "w", encoding="SCSU-SIG") as f:
    f.write(s)  # Always flushes the encoding buffer.

However, reading an encoded file with the given code will work:

with open(file, mode="r", encoding="SCSU-SIG") as f:
    print(f.read())

Credits

Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified, but with some encoding optimizations:

  • A two-character lookahead buffer.
    • This avoids a case where switching from Unicode to single-byte mode requires two window switches.
  • Compression of sequential static window characters into a single new dynamic window.
    • This avoids a case where a long string of punctuation is encoded as multiple quoted characters.
  • Uses the Latin-1 Supplement window whenever possible.
    • When encoding a string that only contains ASCII and Latin-1 Supplement characters, this results in a string that is both valid in SCSU and ISO-8859-1.

Decoding logic, however, is entirely original.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsu-0.3.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scsu-0.3-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file scsu-0.3.tar.gz.

File metadata

  • Download URL: scsu-0.3.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.3.tar.gz
Algorithm Hash digest
SHA256 342049c15e5be34ba044905e3beea78df7f33dc1caaf77465c1c36eb5eff217c
MD5 96dfb385bfcdc146911614a32aea0bff
BLAKE2b-256 da8628cfcc0585911bb1c572afae149f93e89d1d048b4608f61f235a198b7578

See more details on using hashes here.

File details

Details for the file scsu-0.3-py3-none-any.whl.

File metadata

  • Download URL: scsu-0.3-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 92511526212dca955da3f7b1055bc05c12ca0c2332dbe2c28f7949e8e05a6e16
MD5 fa8dddad9a3e89ea4b2ae93853b7130e
BLAKE2b-256 fd2bb3bf1697a732416ccff8d133a6fef89632e98d38839c4465ffb770cd5320

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page