Skip to main content

Standard Compression Scheme for Unicode

Project description

Standard Compression Scheme for Unicode

This package implements SCSU as a Python text codec.

Benefits of Unicode compression

Compressed strings typically have fewer bytes than strings encoded as UTF-8 or UTF-16.

Sample Text In UTF-8 In UTF-16 In SCSU
¿Qué es Unicode? 18 bytes 32 bytes 16 bytes
Що таке Юнікод? 27 bytes 30 bytes 16 bytes
Ի՞նչ է Յունիկոդը ? 32 bytes 36 bytes 20 bytes
यूनिकोड क्या है? 42 bytes 32 bytes 17 bytes
ユニコードとは何か? 30 bytes 20 bytes 15 bytes
什麼是Unicode(統一碼/標準萬國碼)? 44 bytes 44 bytes 38 bytes
𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛? 47 bytes 50 bytes 17 bytes
😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏 64 bytes 64 bytes 19 bytes

Requirements

This package requires Python 3.10 or above.

Usage

Source code

Simply import the module and the SCSU codec is ready to use:

import scsu

b = s.encode("SCSU")

To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.

Command line interface

To compress a file, use the "encode" command: python3 -m scsu encode unicode.txt > scsu.txt

To decompress a file, use the "decode" command: python3 -m scsu decode scsu.txt > unicode.txt

To automatically add and remove a byte-order mark signature, add the -s option after the encode/decode command.

Errata

CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:

with open(file, mode="w", encoding="SCSU-SIG") as f:
    f.write(s)  # Never flushes the encoding buffer.

A workaround is to import the codecs module, then replace open with codecs.open:

import codecs

with codecs.open(file, "w", encoding="SCSU-SIG") as f:
    f.write(s)  # Always flushes the encoding buffer.

Credits

Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified.

Enhancements to the encoding logic include:

  • A two-character lookahead buffer to avoid a case where switching from Unicode to single-byte mode requires two window switches.
  • Compression of sequential static window characters into a single new dynamic window, to avoid a case where a long string of punctuation is encoded as multiple quoted characters.
  • Uses the Latin-1 Supplement window whenever possible so transcoding text encoded as ISO-8859-1 results in a valid SCSU and ISO-8859-1 byte string.

Decoding logic, however, is entirely original.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsu-1.1.1.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scsu-1.1.1-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file scsu-1.1.1.tar.gz.

File metadata

  • Download URL: scsu-1.1.1.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-1.1.1.tar.gz
Algorithm Hash digest
SHA256 4a9011b56d8ca49b414c2872af489ab5af3060e9f79340fd20b9a00ee99da528
MD5 3a580903065ea314366025a4b7cc7267
BLAKE2b-256 9502bf973184098372e6f5d93607ef1096221981e51c870cddf8e43318daa03d

See more details on using hashes here.

File details

Details for the file scsu-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: scsu-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 66708da33f7868358d9ec879f217b3a53befd0c558cc7ce46a35909ce9cf59b3
MD5 70e5ecac8bfb603bf0493c0bb04b4abe
BLAKE2b-256 1ac689dbcc99230b8c275e2aa675347aed2649e0c75436f9f328836b8a2c185e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page