Skip to main content

Standard Compression Scheme for Unicode

Project description

Standard Compression Scheme for Unicode

This package implements SCSU as a Python text codec.

Benefits of Unicode compression

Compressed strings typically have fewer bytes than strings encoded as UTF-8 or UTF-16.

Sample Text In UTF-8 In UTF-16 In SCSU
¿Qué es Unicode? 18 bytes 32 bytes 16 bytes
Що таке Юнікод? 27 bytes 30 bytes 16 bytes
Ի՞նչ է Յունիկոդը ? 32 bytes 36 bytes 20 bytes
यूनिकोड क्या है? 42 bytes 32 bytes 17 bytes
ユニコードとは何か? 30 bytes 20 bytes 15 bytes
什麼是Unicode(統一碼/標準萬國碼)? 44 bytes 44 bytes 38 bytes
𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛? 47 bytes 50 bytes 17 bytes
😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏 64 bytes 64 bytes 19 bytes

Requirements

This package requires Python 3.10 or above.

Usage

Source code

Simply import the module and the SCSU codec is ready to use:

import scsu

b = s.encode("SCSU")

To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.

Command line interface

To compress a file, use the "encode" command: python3 -m scsu encode unicode.txt > scsu.txt

To decompress a file, use the "decode" command: python3 -m scsu decode scsu.txt > unicode.txt

To automatically add and remove a byte-order mark signature, add the -s option after the encode/decode command.

Errata

CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:

with open(file, mode="w", encoding="SCSU-SIG") as f:
    f.write(s)  # Never flushes the encoding buffer.

A workaround is to import the codecs module, then replace open with codecs.open:

import codecs

with codecs.open(file, "w", encoding="SCSU-SIG") as f:
    f.write(s)  # Always flushes the encoding buffer.

Credits

Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified.

Enhancements to the encoding logic include:

  • A two-character lookahead buffer to avoid a case where switching from Unicode to single-byte mode requires two window switches.
  • Compression of sequential static window characters into a single new dynamic window, to avoid a case where a long string of punctuation is encoded as multiple quoted characters.
  • Uses the Latin-1 Supplement window whenever possible so transcoding text encoded as ISO-8859-1 results in a valid SCSU and ISO-8859-1 byte string.

Decoding logic, however, is entirely original.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scsu-1.1.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scsu-1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file scsu-1.1.tar.gz.

File metadata

  • Download URL: scsu-1.1.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-1.1.tar.gz
Algorithm Hash digest
SHA256 68a43d7eff9fa22ab680b206b769d90dad91ae6d8fe27d68101c0fa3b1e7e289
MD5 edf5c77182fccfac065384e4a5863c9d
BLAKE2b-256 54f93016f8af8fde0411f734fb8f81d0da4ec6144b92de5929f85912431b669a

See more details on using hashes here.

File details

Details for the file scsu-1.1-py3-none-any.whl.

File metadata

  • Download URL: scsu-1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for scsu-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0c410bc01bde8fd48dc145aa09fe5f0d295cabdd19c5ec06e2d7ed4de5fcd275
MD5 a41a299ef9e61d82538a18aa980580a6
BLAKE2b-256 d3493ecda95928952d5c381725d554eb9c773784a48e2baa48295d1f106557f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page