Standard Compression Scheme for Unicode
Project description
Standard Compression Scheme for Unicode
This package implements SCSU as a Python text codec.
Benefits of Unicode compression
Compressed strings typically have fewer bytes than strings encoded as UTF-8 or UTF-16.
| Sample Text | In UTF-8 | In UTF-16 | In SCSU |
|---|---|---|---|
¿Qué es Unicode? |
18 bytes | 32 bytes | 16 bytes |
Що таке Юнікод? |
27 bytes | 30 bytes | 16 bytes |
Ի՞նչ է Յունիկոդը ? |
32 bytes | 36 bytes | 20 bytes |
यूनिकोड क्या है? |
42 bytes | 32 bytes | 17 bytes |
ユニコードとは何か? |
30 bytes | 20 bytes | 15 bytes |
什麼是Unicode(統一碼/標準萬國碼)? |
44 bytes | 44 bytes | 38 bytes |
𐑢𐑳𐑑 𐑦𐑟 𐑿𐑯𐑦𐑒𐑴𐑛? |
47 bytes | 50 bytes | 17 bytes |
😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏 |
64 bytes | 64 bytes | 19 bytes |
Requirements
This package requires Python 3.10 or above.
Usage
Source code
Simply import the module and the SCSU codec is ready to use:
import scsu
b = s.encode("SCSU")
To automatically add and remove a byte-order mark signature, use SCSU-SIG instead of SCSU.
Command line interface
To compress a file, use the "encode" command: python3 -m scsu encode unicode.txt > scsu.txt
To decompress a file, use the "decode" command: python3 -m scsu decode scsu.txt > unicode.txt
To automatically add and remove a byte-order mark signature, add the -s option after the encode/decode command.
Errata
CPython bug #79792 causes the sample code (below) to not flush the encoding buffer:
with open(file, mode="w", encoding="SCSU-SIG") as f:
f.write(s) # Never flushes the encoding buffer.
A workaround is to import the codecs module, then replace open with codecs.open:
import codecs
with codecs.open(file, "w", encoding="SCSU-SIG") as f:
f.write(s) # Always flushes the encoding buffer.
Credits
Encoding logic is heavily based on a sample encoder described in "A survey of Unicode compression" by Doug Ewell and originally written by Richard Gillam in his book Unicode Demystified.
Enhancements to the encoding logic include:
- A two-character lookahead buffer to avoid a case where switching from Unicode to single-byte mode requires two window switches.
- Compression of sequential static window characters into a single new dynamic window, to avoid a case where a long string of punctuation is encoded as multiple quoted characters.
- Uses the Latin-1 Supplement window whenever possible so transcoding text encoded as ISO-8859-1 results in a valid SCSU and ISO-8859-1 byte string.
Decoding logic, however, is entirely original.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scsu-1.1.1.tar.gz.
File metadata
- Download URL: scsu-1.1.1.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a9011b56d8ca49b414c2872af489ab5af3060e9f79340fd20b9a00ee99da528
|
|
| MD5 |
3a580903065ea314366025a4b7cc7267
|
|
| BLAKE2b-256 |
9502bf973184098372e6f5d93607ef1096221981e51c870cddf8e43318daa03d
|
File details
Details for the file scsu-1.1.1-py3-none-any.whl.
File metadata
- Download URL: scsu-1.1.1-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66708da33f7868358d9ec879f217b3a53befd0c558cc7ce46a35909ce9cf59b3
|
|
| MD5 |
70e5ecac8bfb603bf0493c0bb04b4abe
|
|
| BLAKE2b-256 |
1ac689dbcc99230b8c275e2aa675347aed2649e0c75436f9f328836b8a2c185e
|