Compression using an optimized algorithm (lzhw) developed from Lempel-Ziv, Huffman and LZ-Welch techniques
Project description
lzhw
Compression library to compress big lists and/or pandas dataframes using an optimized algorithm (lzhw) developed from Lempel-Ziv, Huffman and LZ-Welch techniques.
Quick Start
pip install lzhw
import lzhw
sample_data = ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast",
"Sunny", "Sunny", "Rain", "Sunny", "Overcast", "Overcast", "Rain",
"Rain", "Rain", "Sunny", "Sunny", "Overcaste"]
compressed = lzhw.LZHW(sample_data)
## let's see how the compressed object looks like:
print(compressed)
# 1111101101010011111101101010011100000010
## its size
print(compressed.size())
# 32
## size of original
from sys import getsizeof
print(getsizeof(sample_data))
# 216
print(compressed.space_saving())
# space saving from original to compressed is 85%
## Let's decompress and check whether there is any information loss
decomp = compressed.decompress()
print(decomp == sample_data)
# True
As we saw, the LZHW class has saved 85% of the space used to store the original list without any loss. The class has also some useful helper methods as space_saving, size, and decompress() to revert back to original.
Another example with numeric data.
from random import sample, choices
numbers = choices(sample(range(0, 5), 5), k = 20)
comp_num = lzhw.LZHW(numbers)
print(getsizeof(numbers) > comp_num.size())
# True
print(numbers == list(map(int, comp_num.decompress()))) ## make it int again
# True
print(comp_num.space_saving())
# space saving from original to compressed is 88%
Let's look at how the compressed object is stored and how it looks like when printed: LZHW class has an attribute called compressed which is the integer of the encoded bitstring
print(comp_num.compressed) # how the compressed is saved (as integer of the bit string)
# 103596881534874
print(comp_num)
# 10111100011100010000111010100101101111110011010
We can also write the compressed data to files using save_to_file method, and read it back and decompress it using decompress_from_file function.
status = ["Good", "Bad", "Bad", "Bad", "Good", "Good", "Average", "Average", "Good",
"Average", "Average", "Bad", "Average", "Good", "Bad", "Bad", "Good"]
comp_status = lzhw.LZHW(status)
comp_status.save_to_file("status.txt")
decomp_status = lzhw.decompress_from_file("status.txt")
print(status == decomp_status)
# True
Compressing DataFrames
lzhw doesn't work only on lists, it also compress pandas dataframes and save it into compressed files to decompress them later.
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2, 2, 1, 3, 4, 4],
"b": ["A", "A", "B", "B", "A", "C", "D", "D"]})
comp_df = lzhw.CompressedDF(df)
Let's check space saved by compression
comp_space = 0
for i in range(len(comp_df.compressed)):
comp_space += comp_df.compressed[i].size()
print(comp_space, getsizeof(df))
# 56 712
## Test information loss
print(comp_df.compressed[0].decompress() == list(map(str, df.a)))
# True
Saving and Loading Compressed DataFrames
With lzhw we can save a data frame into a compressed file and then read it again using save_to_file method and decompress_df_from_file function.
## Save to file
comp_df.save_to_file("comp_df.txt")
## Load the file
original = lzhw.decompress_df_from_file("comp_df.txt")
print(original)
# a b
#0 1 A
#1 1 A
#2 2 B
#3 2 B
#4 1 A
#5 3 C
#6 4 D
#7 4 D
Compressing Bigger DataFrames
Let's try to compress a real-world dataframe german_credit.xlsx file. Original txt file is 219 KB on desk.
gc_original = pd.read_excel("examples/german_credit.xlsx")
comp_gc = lzhw.CompressedDF(gc_original)
## Compare sizes in Python:
comp_space = 0
for i in range(len(comp_gc.compressed)):
comp_space += comp_gc.compressed[i].size()
print(comp_space, getsizeof(gc_original))
# 12932 548852
print(comp_gc.compressed[0].decompress() == list(map(str, gc_original.iloc[:, 0])))
# True
Huge space saving, 97%, with no information loss!
Let's now write the compressed dataframe into a file and compare the sizes of the files.
comp_gc.save_to_file("gc_compressed.txt")
Checking the size of the compressed file, it is 87 KB. Meaning that in total we saved around 60%. Future versions will be optimized to save more space.
Let's now check when we reload the file, will we lose any information or not.
## Load the file
gc_original2 = lzhw.decompress_df_from_file("gc_compressed.txt")
print(list(gc_original2.iloc[:, 13]) == list(map(str, gc_original.iloc[:, 13])))
# True
print(gc_original.shape == gc_original2.shape)
# True
Perfect! There is no information loss at all.
Using the lzhw Command Line Interface
In lzhw_cli folder, there is a python script that can work on command line to compress and decompress files.
$python lzhw_cli.py
usage: lzhw_cli.py [-h] [-d] -f INPUT -o OUTPUT
lzhw_cli.py: error: the following arguments are required: -f/--input, -o/--output
$python lzhw_cli.py -h
usage: lzhw_cli.py [-h] [-d] -f INPUT -o OUTPUT
Data Frame Compressor
optional arguments:
-h, --help show this help message and exit
-d, --decompress decompress input into output
-f INPUT, --input INPUT
input file to be (de)compressed
-o OUTPUT, --output OUTPUT
output where to save result
$python lzhw_cli.py -f "file_to_compress" -o "output"
compressed successfully
$python lzhw_cli.py -d -f "file_to_decompress" -o "output"
decompressed successfully
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lzhw-1.1.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ded09a0ec3b6d8c20ad64ad5308bd2556d0070fe5f2c4cde858f71ac596fb588 |
|
MD5 | 11b30542cf74ba1caf7845993ccf9f0b |
|
BLAKE2b-256 | d9be924901504b737e4f8b2b41cf159e726a55a84edc2ec17af10e963ec0a2f3 |