Skip to main content

Standard Compression Scheme for Unicode

Project description

Standard Compression Scheme for Unicode

This package implements SCSU as a Python text codec.

Benefits of Unicode compression

Short strings can be compressed with less overhead than general compression algorithms and with fewer bytes than popular Unicode transformations like UTF-8 or UTF-16:

  • "¿Qué es Unicode?" ("What is Unicode?" in Spanish) is encoded as 18 bytes in UTF-8, but only 16 bytes in SCSU, the same length when encoded in ISO-8859-1.
  • "ユニコードとは何か?" ("What is Unicode?" in Japanese) is encoded as 30 bytes in UTF-8, 20 bytes in Shift JIS and EUC-JP, but only 15 bytes in SCSU.
  • "𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛?" ("What is Unicode?" in the Shavian alphabet) is encoded as 47 bytes in UTF-8, but only 17 bytes in SCSU.

In an extreme case, SCSU can compress long strings of emoji:

emoji = "".join(chr(0x1F600 + n) for n in range(80))
sms_data = emoji.encode("UTF-16BE")  # 320 bytes
scsu_data = emoji.encode("SCSU")  # 83 bytes

Requirements

This package requires Python 3.10 or above.

Usage

Simply import the library and the SCSU codec is ready to use:

import scsu

b = s.encode("SCSU")

To automatically add and remove a byte-order mark signature, use "SCSU-SIG" instead of "SCSU".

Errata

CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:

with open(file, mode="w", encoding="SCSU-SIG") as f:
    f.write(s)  # Never flushes the encoding buffer.

A workaround is to import the codecs module, then replace open with codecs.open:

import codecs

with codecs.open("output.txt", "w", encoding="SCSU-SIG") as f:
    f.write(s)  # Always flushes the encoding buffer.

However, reading an encoded file with the given code will work:

with open(file, mode="r", encoding="SCSU-SIG") as f:
    print(f.read())

Credits

Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified, but with some encoding optimizations:

  • A two-character lookahead buffer.
    • This avoids a case where switching from Unicode to single-byte mode requires two window switches.
  • Compression of sequential static window characters into a single new dynamic window.
    • This avoids a case where a long string of punctuation is encoded as multiple quoted characters.
  • Uses the Latin-1 Supplement window whenever possible.
    • When encoding a string that only contains ASCII and Latin-1 Supplement characters, this results in a string that is both valid in SCSU and ISO-8859-1.

Decoding logic, however, is entirely original.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsu-0.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scsu-0.1-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file scsu-0.1.tar.gz.

File metadata

  • Download URL: scsu-0.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.1.tar.gz
Algorithm Hash digest
SHA256 b1a0a4c234dcbdf1bf2d772c0b5d1b2123add855f505a1d8d136d74d315e0cbc
MD5 0af0fbe3132ed9f8fd276eb5e02e229d
BLAKE2b-256 a11d33e6ad65bf2bbe651fee8591be36cdb3a31107277c55e3fc282c270e096c

See more details on using hashes here.

File details

Details for the file scsu-0.1-py3-none-any.whl.

File metadata

  • Download URL: scsu-0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b4b96af6237eb1125035b9f005b3a2fa07513ec31e44169b71de49e7f77ec490
MD5 9d3c508f57a06fd236aa5c12b9fba942
BLAKE2b-256 684e8c7a4f4b5b5f8c8c42b0faa75f19c4eedd49208c9e10c30938396886acd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page