Skip to main content

String Processing Tools

Project description

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Jerome

A collection of functions that illustrate several techniques useful in text compression.

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About The Project

Jerome is a library of string functions written in pure Python. The library's name is taken from St. Jerome of Stridon who is considered the patron saint of archivists.

  • Zero dependencies[^1]
  • 100% test coverage

[^1]: Examples use augustine_text to generate material to compress

Getting Started

To get a local copy up and running follow these simple steps.

Installing with pip

pip install jerome

For information about cloning and dev setup see: Contributing

Usage

Here is an example showing basic usage.

from datetime import datetime

from jerome import (SymbolKeeper, common, forward_bw, replacer, reverse_bw,
                    runlength_decode, runlength_encode)
from jerome.sample_text import words


# 75K words of procedurally generated text
# This is about the length of novel.
text = words(75000)
text_length = len(text)

compression_start = datetime.now()
# SymbolKeeper is used to portion out un-used symbols
k = SymbolKeeper(
    reserved=set(list(text))
)  # These appear in our text so we don't want to use them as placeholders

# common is a utility function for finding commonly occuring words
# We're using k from above to create a dictionary where each key is a word
#  and the value is a single symbol replacement for that word
replacements = {word: next(k) for word in common(text, min_length=4)}
# {'dolore': '\x00', 'elit,': '\x02', 'labore': '\x03', ...

# Run replacements
replaced = replacer(text, replacements)
# Burrows Wheeler transform the text to improve runlength result
transformed = forward_bw(replaced)
# Runlength encode
runcoded = runlength_encode(transformed)

print(
    f"""| step | result |
| ---- | ------ |
| Original Text size | {text_length} |
| With words replaced | {len(replaced)} |
| Encoded | {(rlen := len(runcoded))} |
| Reasonable length | {(dlen := len(str([(k,v) for k,v in replacements.items()])))} |
| Compressed size % | {round(((rlen+dlen)/text_length)*100, 2)} |
"""
)
compression_end = datetime.now()


# Reverse the whole thing
assert (unruncoded := runlength_decode(runcoded)) == transformed
assert (untransformed := reverse_bw(unruncoded)) == replaced
assert replacer(untransformed, replacements, reverse=True) == text
print(
    f"| Compression time |  {round((compression_end-compression_start).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Decompression time |  {round((datetime.now()-compression_end).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Total time |  {round((datetime.now()-compression_start).total_seconds() * 1000.0)} ms |"
)

Example compression of randomized text:

step result
Original Text length 402270
With words replaced 198228
Encoded 74160
Reasonable length 5724
Compressed size % 19.86
Compression time 1045 ms
Decompression time 172 ms
Total time 1217 ms

NOTE Time was taken on a Ryzen 3600x @ 3.9Ghz.

This is only an example of how the text functions in Jerome work. Python's built in bz2 is both many times faster and many time better at compressing than the above example.

Additional Documentation

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Add tests, we aim for 100% test coverage Using Coverage
  4. Commit your Changes (git commit -m 'Add some AmazingFeature')
  5. Push to the Branch (git push origin feature/AmazingFeature)
  6. Open a Pull Request

Cloning / Development setup

  1. Clone the repo and install
    git clone https://github.com/kajuberdut/Jerome.git
    cd Jerome
    pipenv install --dev
    
  2. Run tests
    pipenv shell
    py.test
    

For more about pipenv see: Pipenv Github

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Patrick Shechet - patrick.shechet@gmail.com

Project Link: https://github.com/kajuberdut/Jerome

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jerome-0.3.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

jerome-0.3.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file jerome-0.3.0.tar.gz.

File metadata

  • Download URL: jerome-0.3.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for jerome-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1174b9774d5ea66869bbc52645c3120c5e4d08b9d9d3a56b2fc6e8a4294fcebc
MD5 47e3b6030f2b87eaa0e64bce113cf3a7
BLAKE2b-256 bb71b055b57309bd56147ec3097c7b567c48d10f65aaddeb74aa1243d850f899

See more details on using hashes here.

File details

Details for the file jerome-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: jerome-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for jerome-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76cc76b5be39672e0f27d27f6acc7c01de832a0cb5f8bee4334770d031d70677
MD5 d3d77ac18af43798564ae5254927edd2
BLAKE2b-256 e584d8876be4051e712afdd3c9e072a72260f693ae82feccfba5af2352ccca7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page