Skip to main content

Add your description here

Project description

woolworm (Pre-Alpha State)

Hello Northwestern Digitization team (and anyone else who may be following along), welcome to woolworm, your new (hopefully) one-stop shop for digitization. I have attempted to abstract as much of the intricacies of image transformation in python. At least to the best of my ability. While we are working on this grant, I will be working on build automation and a CLI for you all so that it can be even easier to use. The point of this repo is in case I die, it can be developed and such. Here is my current feature list, where I am open to suggestions or requests, because I like this sort of thing:

Road to v0.1.0

  • API
    • Load image
    • Deskew
    • Intelligent document binarization/grayscale
    • Tesseract OCR
    • Standalone Ollama LLM OCR
    • Marker Document Understanding LLM OCR
    • HathiTrust (currently experimental in a standalone script)

Road to v1.0.0

  • Pipelines
    • Image processing
    • OCR (do we need a pipeline for this? It is a single function)
    • HathiTrust (Migrated Brendan's Ruby script to python)
    • ???
    • Profit
  • CLI (To be done later)
  • Figure out how the hell I publish a python package

Automation, supercomputing interfacing, remote directories will be handled in a different repository. This is to track one step of the data science process: data cleaning.

Prerequisites

You will want to familiarize yourself with the absolute basics of calling object-methods. If you want to use any LLM models, you will need to install Ollama. Feel free to contact me if you need assistance in setting up Ollama.

Quickstart

If you are extremely impatient, you can get started with two lines of code

from woolworm import Woolworm

Woolworm.Pipelines.process_image("inputfilename.jpg", "outputfilename.jpg")

In the backend, it looks like this. You can find this code in the cookbook directory

from woolworm import Woolworm

p = woolworm()  # Creates the "woolworm" class

f = "filename.jpg"
base_name = f.replace(".jpg", "")

# Step 1: Load original
img = p.load(f)

# Step 2: de-skew
img = p.deskew_with_hough(img)

# Step 3: This is kinda weird, and currently fine-tuned for use with NU's environmental impact statements
# Long story short, the programming will use some heuristics to detect if the image is a diagram or mostly text
# If the program thinks it is text, it will binarize, if it thinks it is a diagram, it will not.
img = p.binarize_or_gray(img)

p.save_image(img)

Sample output: Sample Output in a nicely formatted table

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

woolworm-0.0.6.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

woolworm-0.0.6-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file woolworm-0.0.6.tar.gz.

File metadata

  • Download URL: woolworm-0.0.6.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for woolworm-0.0.6.tar.gz
Algorithm Hash digest
SHA256 1bf87ba92bc8aa7b85db791e1103b1f72924870e18ad38d674152dfeb036c1d6
MD5 d2b6b4328182921d774e9c13d3180377
BLAKE2b-256 f3699374e6ae5a1d0ad4bbe3803f2749b3ec3f66aedf660c037de9b8db242653

See more details on using hashes here.

Provenance

The following attestation bundles were made for woolworm-0.0.6.tar.gz:

Publisher: python-publish.yml on nulib-ds/woolworm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file woolworm-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: woolworm-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for woolworm-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 7c2e5f91811d706e8424a2ab3e222ea674420a5a3b0e737dd896402dd4a5094b
MD5 7ab46d7659dd5312b5a7e05d01c0ac65
BLAKE2b-256 5d1d7b76bd20aa188930c93e87b4d7bf7053f58120f722c37497c1762a9d660d

See more details on using hashes here.

Provenance

The following attestation bundles were made for woolworm-0.0.6-py3-none-any.whl:

Publisher: python-publish.yml on nulib-ds/woolworm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page