Add your description here
Project description
woolworm (Pre-Alpha State)
Hello Northwestern Digitization team (and anyone else who may be following along), welcome to woolworm, your new (hopefully) one-stop shop for digitization. I have attempted to abstract as much of the intricacies of image transformation in python. At least to the best of my ability. While we are working on this grant, I will be working on build automation and a CLI for you all so that it can be even easier to use. The point of this repo is in case I die, it can be developed and such. Here is my current feature list, where I am open to suggestions or requests, because I like this sort of thing:
Road to v0.1.0
- API
- Load image
- Deskew
- Intelligent document binarization/grayscale
- Tesseract OCR
- Standalone Ollama LLM OCR
- Marker Document Understanding LLM OCR
- HathiTrust (currently experimental in a standalone script)
Road to v1.0.0
- Pipelines
- Image processing
- OCR (do we need a pipeline for this? It is a single function)
- HathiTrust (Migrated Brendan's Ruby script to python)
- ???
- Profit
- CLI (To be done later)
- Figure out how the hell I publish a python package
Automation, supercomputing interfacing, remote directories will be handled in a different repository. This is to track one step of the data science process: data cleaning.
Prerequisites
You will want to familiarize yourself with the absolute basics of calling object-methods. If you want to use any LLM models, you will need to install Ollama. Feel free to contact me if you need assistance in setting up Ollama.
Quickstart
If you are extremely impatient, you can get started with two lines of code
from woolworm import Woolworm
Woolworm.Pipelines.process_image("inputfilename.jpg", "outputfilename.jpg")
In the backend, it looks like this. You can find this code in the cookbook directory
from woolworm import Woolworm
p = woolworm() # Creates the "woolworm" class
f = "filename.jpg"
base_name = f.replace(".jpg", "")
# Step 1: Load original
img = p.load(f)
# Step 2: de-skew
img = p.deskew_with_hough(img)
# Step 3: This is kinda weird, and currently fine-tuned for use with NU's environmental impact statements
# Long story short, the programming will use some heuristics to detect if the image is a diagram or mostly text
# If the program thinks it is text, it will binarize, if it thinks it is a diagram, it will not.
img = p.binarize_or_gray(img)
p.save_image(img)
Sample output:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file woolworm-0.0.10.tar.gz.
File metadata
- Download URL: woolworm-0.0.10.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62bee209e8091597b139b5de665e1c235e9f35c1dca87ea2b54d742b9aa793c9
|
|
| MD5 |
607a1edb8278ab912c61218b14ce1493
|
|
| BLAKE2b-256 |
3963877dbf5fe871dbf1ac89f67e772165a0e0cc28f797381dae3432a3b714e7
|
Provenance
The following attestation bundles were made for woolworm-0.0.10.tar.gz:
Publisher:
python-publish.yml on nulib-ds/woolworm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
woolworm-0.0.10.tar.gz -
Subject digest:
62bee209e8091597b139b5de665e1c235e9f35c1dca87ea2b54d742b9aa793c9 - Sigstore transparency entry: 583223336
- Sigstore integration time:
-
Permalink:
nulib-ds/woolworm@5598d20d5de79d006956ec7a69ed05a5ba255be0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nulib-ds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5598d20d5de79d006956ec7a69ed05a5ba255be0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file woolworm-0.0.10-py3-none-any.whl.
File metadata
- Download URL: woolworm-0.0.10-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22e20b5b4b30820fbb2a05aa7074cc9b440e6908ccfbc1eed0a2d34e175253de
|
|
| MD5 |
1872d4909cac5718a19264a50792e508
|
|
| BLAKE2b-256 |
98c87c48214c3c999dcf384035c304c3ad941f650bf6173eab88a7f452c6cbe8
|
Provenance
The following attestation bundles were made for woolworm-0.0.10-py3-none-any.whl:
Publisher:
python-publish.yml on nulib-ds/woolworm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
woolworm-0.0.10-py3-none-any.whl -
Subject digest:
22e20b5b4b30820fbb2a05aa7074cc9b440e6908ccfbc1eed0a2d34e175253de - Sigstore transparency entry: 583223340
- Sigstore integration time:
-
Permalink:
nulib-ds/woolworm@5598d20d5de79d006956ec7a69ed05a5ba255be0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nulib-ds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5598d20d5de79d006956ec7a69ed05a5ba255be0 -
Trigger Event:
push
-
Statement type: