Skip to main content

A package for Unicode-friendly string compression using Unishox2

Project description

unishox2-py3

License Downloads Code Style

This package enables Python projects to easily use Unishox2 from siara-cc/Unishox, which is a C library for compressing short strings. Unishox2 has many potential applications, and this package can enable developers to make use of it for several of those:

  • ✅ Unicode-native text compression. Unishox2 is NOT English- or ASCII-only!
  • ✅ Bandwidth and storage cost reduction for databases and/or cloud services.
  • ⚠️ Byte columns can result in a faster retrieval speed when used as join keys in RDBMSes.
    • Author's note: haven't tested this, but I'd trust the claim generally
  • ⛔️ Compression for low memory devices such as Arduino and ESP8266.
    • Author's note: just use the C bindings for this, they're very approachable

Want to learn more about Unishox2? Read the source paper here.

How to Use

This package is available on PyPI, and can be installed with pip3 install unishox2-py3. Please note that this package only supports Python3, and does not work for Python2 or below. You can see its CI status and testing matrix in this repository's Actions tab.

Getting started with unishox2-py3 is easy. If you want to give it a try via the command line, you can use demo.py to compress some sample strings or try one of your own.

If you're looking to integrate, unishox2-py3 currently provides two APIs that pass data to Unishox2's corresponding simple APIs - accepting the default optimization preset, which is good for most data. These are:

  • unishox2.compress(str)
    • Arguments:
      • str - This requires a Unicode string as input (generally, this is your default in Python).
    • Returns a tuple:
      • bytes - The compressed data.
      • int - The original length of the string.
  • unishox2.decompress(bytes, int)
    • Takes two arguments:
      • bytes - The compressed data.
      • int - The original length of the string.
    • Returns:
      • str - A string, the original data.

Taken together, this looks like:

import unishox2

# the string we want to compress
original_data = "What the developers know:\n1. Whole codebase is spaghetti\n2. Also, spaghetti is delicious."

# drop that in as-is, nothing else is needed for compression
compressed_data, original_size = unishox2.compress(original_data)
# compressed_data now holds bytes, such as b'\x87\xbfi\x85\x1d\x9a\xe9\xfd ...'
# original_size now holds an integer, such as 89

# to get the original string back, we need compressed_data AND original_size
decompressed_data = unishox2.decompress(compressed_data, original_size)
# decompressed_data now holds a string, such as "What the developers know:\n..."

Important Notes

First, you have to have the original_size, or know what the maximum original_size can be for your data, as Unishox2 does not dynamically allocate memory for the resultant string when decompressing. If you need to track the exact size (ex. if some documents are KB, where others are GB), and you are saving the Unishox2-compressed data to a database, you must store the original_size value as well.

As mentioned before, any reasonable maximum for the resultant data also works. So if you are storing usernames that must be 3-20 characters in length, you can skip saving the original_size and use 20 as the original_size for all values during decompression.

Conversely, if you give an original_size value that is too small, too little memory will be allocated. This means that Unishox2 will write past the memory boundary (as it's C under the hood, which is just bound to Python via a module), and your Python program will crash.

Performance

While Unishox doesn't provide guaranteed compression for all short strings (see the test cases for some examples where the output is larger than the input), it tends to provide better compression than many competitors in real-world usecases for short string compression. In addition, as unishox2-py3 is using a C module instead of reimplementing Unishox2 in Python, there is acceptable performance loss across most applications.

When tested on Reddit data (technical subreddits, mostly English-oriented, 3.3m entries), the average number of bytes required for storing each post's title was:

  • Original: 60.34
  • zlib(1): 61.83 (+2.47%)
  • zlib(9): 61.80 (+2.42%)
  • smaz: 43.46 (-27.98%)
  • Unishox2: 40.08 (-33.58%)

And the average number of bytes required for storing each text post's body was:

  • Original: 561.07
  • zlib(1): 319.93 (-42.98%)
  • zlib(9): 312.87 (-44.23%)
  • smaz: 369.04 (-34.23%)
  • Unishox2: 310.56 (-44.65%)

And the average number of bytes required for storing the URL that any link posts pointed to:

  • Original: 25.72
  • zlib(1): 30.08 (+16.96%)
  • zlib(9): 30.08 (+16.96%)
  • smaz: 20.78 (-19.21%)
  • Unishox2: 19.76 (-23.16%)

Unishox2 shows clear benefits over traditional compressors when compressing short strings, and maintains comparable performance even to moderate-length documents. Unishox2 would be expected to pull farther ahead of smaz for non-English posts as well, though I don't have data to test that. I welcome a PR with additional tests.

Credits

First and foremost, thank you to Arun of Siara Logics for the incredible compression library - Unishox2 is lean, fast, and versatile. I am looking forward to using this in my projects and hope it benefits others as well.

In addition, this package is largely based on work from originell/smaz-py3, and would not have been as quickly-developed or strongly-tested without Luis' lead.

Finally, I would like to thank Josh Bicking for his debugging insights and pragmatic thoughts on C as a whole.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

unishox2_py3-0.9.6-cp39-cp39-win_amd64.whl (23.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

unishox2_py3-0.9.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (52.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

unishox2_py3-0.9.6-cp39-cp39-macosx_10_14_x86_64.whl (23.9 kB view hashes)

Uploaded CPython 3.9 macOS 10.14+ x86-64

unishox2_py3-0.9.6-cp38-cp38-win_amd64.whl (23.2 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

unishox2_py3-0.9.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (52.9 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

unishox2_py3-0.9.6-cp38-cp38-macosx_10_14_x86_64.whl (23.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.14+ x86-64

unishox2_py3-0.9.6-cp37-cp37m-win_amd64.whl (23.2 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

unishox2_py3-0.9.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (53.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

unishox2_py3-0.9.6-cp37-cp37m-macosx_10_14_x86_64.whl (23.9 kB view hashes)

Uploaded CPython 3.7m macOS 10.14+ x86-64

unishox2_py3-0.9.6-cp36-cp36m-win_amd64.whl (23.2 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

unishox2_py3-0.9.6-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (53.0 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

unishox2_py3-0.9.6-cp36-cp36m-macosx_10_14_x86_64.whl (23.9 kB view hashes)

Uploaded CPython 3.6m macOS 10.14+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page