String Processing Tools
Project description
Jerome
A collection of functions that illustrate several techniques useful in text compression.
Table of Contents
About The Project
Jerome is a library of string functions written in pure Python. The library's name is taken from St. Jerome of Stridon who is considered the patron saint of archivists.
- Zero dependencies[^1]
- 100% test coverage
[^1]: Examples use augustine_text to generate material to compress
Getting Started
To get a local copy up and running follow these simple steps.
Installing with pip
pip install jerome
For information about cloning and dev setup see: Contributing
Usage
Here is an example showing basic usage.
from datetime import datetime
from jerome import (SymbolKeeper, common, forward_bw, replacer, reverse_bw,
runlength_decode, runlength_encode)
from jerome.sample_text import words
# 75K words of procedurally generated text
# This is about the length of novel.
text = words(75000)
text_length = len(text)
compression_start = datetime.now()
# SymbolKeeper is used to portion out un-used symbols
k = SymbolKeeper(
reserved=set(list(text))
) # These appear in our text so we don't want to use them as placeholders
# common is a utility function for finding commonly occuring words
# We're using k from above to create a dictionary where each key is a word
# and the value is a single symbol replacement for that word
replacements = {word: next(k) for word in common(text, min_length=4)}
# {'dolore': '\x00', 'elit,': '\x02', 'labore': '\x03', ...
# Run replacements
replaced = replacer(text, replacements)
# Burrows Wheeler transform the text to improve runlength result
transformed = forward_bw(replaced)
# Runlength encode
runcoded = runlength_encode(transformed)
print(
f"""| step | result |
| ---- | ------ |
| Original Text size | {text_length} |
| With words replaced | {len(replaced)} |
| Encoded | {(rlen := len(runcoded))} |
| Reasonable length | {(dlen := len(str([(k,v) for k,v in replacements.items()])))} |
| Compressed size % | {round(((rlen+dlen)/text_length)*100, 2)} |
"""
)
compression_end = datetime.now()
# Reverse the whole thing
assert (unruncoded := runlength_decode(runcoded)) == transformed
assert (untransformed := reverse_bw(unruncoded)) == replaced
assert replacer(untransformed, replacements, reverse=True) == text
print(
f"| Compression time | {round((compression_end-compression_start).total_seconds() * 1000.0)} ms |"
)
print(
f"| Decompression time | {round((datetime.now()-compression_end).total_seconds() * 1000.0)} ms |"
)
print(
f"| Total time | {round((datetime.now()-compression_start).total_seconds() * 1000.0)} ms |"
)
Example compression of randomized text:
step | result |
---|---|
Original Text length | 402270 |
With words replaced | 198228 |
Encoded | 74160 |
Reasonable length | 5724 |
Compressed size % | 19.86 |
Compression time | 1045 ms |
Decompression time | 172 ms |
Total time | 1217 ms |
NOTE Time was taken on a Ryzen 3600x @ 3.9Ghz.
This is only an example of how the text functions in Jerome work. Python's built in bz2 is both many times faster and many time better at compressing than the above example.
Additional Documentation
Roadmap
- In place BWT
- Huffman Coding
- Additional examples
See the open issues for a list of proposed features (and known issues).
Contributing
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Add tests, we aim for 100% test coverage Using Coverage
- Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Cloning / Development setup
- Clone the repo and install
git clone https://github.com/kajuberdut/Jerome.git cd Jerome pipenv install --dev
- Run tests
pipenv shell py.test
For more about pipenv see: Pipenv Github
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Patrick Shechet - patrick.shechet@gmail.com
Project Link: https://github.com/kajuberdut/Jerome
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file jerome-0.3.0.tar.gz
.
File metadata
- Download URL: jerome-0.3.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1174b9774d5ea66869bbc52645c3120c5e4d08b9d9d3a56b2fc6e8a4294fcebc |
|
MD5 | 47e3b6030f2b87eaa0e64bce113cf3a7 |
|
BLAKE2b-256 | bb71b055b57309bd56147ec3097c7b567c48d10f65aaddeb74aa1243d850f899 |
File details
Details for the file jerome-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: jerome-0.3.0-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76cc76b5be39672e0f27d27f6acc7c01de832a0cb5f8bee4334770d031d70677 |
|
MD5 | d3d77ac18af43798564ae5254927edd2 |
|
BLAKE2b-256 | e584d8876be4051e712afdd3c9e072a72260f693ae82feccfba5af2352ccca7c |