Skip to main content

Archival document management as a reusable Django app

Project description

grime

Grime is an archival document management platform — a reusable Django app for ingesting scanned documents, running OCR / NER pipelines, and annotating pages with tagged regions.

Models

Model Role
Document A standalone archival document (PDF, scanned booklet, ledger).
DocumentPage A single page of a Document — the unit of OCR processing.
Word One OCR-extracted word on a DocumentPage, with bbox, confidence, optional human correction, and optional BIO NER label.
Tag A manually drawn region on a Document or DocumentPage with a label and word-level subcomponents.
OCRPass Audit record for one OCR run on a DocumentPage.
NERPass Audit record for one NER run on a DocumentPage.

Management commands

python manage.py ocr        --document 42 [--page N] [--textract] [--force] [--dry-run]
python manage.py ner        --document 42 [--page N] [--threshold 0.85] [--force] [--dry-run]
python manage.py match_tags --label "member entry" [--source-document 3] [--target-document 5] \
                            [--create-tags] [--force] [--min-score 0.5] [--tolerance 0.08]

Quick start

pip install -e ".[dev]"
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver
# then visit http://127.0.0.1:8000/admin/

Optional dependencies

Extra Adds
ocr Tesseract (pytesseract, opencv-python, numpy)
textract AWS Textract via boto3
hf HuggingFace historical NER (transformers, torch)
viz match_tags --video rendering (imageio[ffmpeg])
dev All of the above

System prerequisites for the ocr extra:

sudo apt install tesseract-ocr poppler-utils

Status

This is an initial scaffold. The admin loads and the management commands run end-to-end, but the embedded document viewer (templates/admin/grime/_document_viewer.html) is read-only: bboxes and tags render on the page image, but interactive editing (OCR correction, tag CRUD, NER label correction) needs AJAX endpoints that have not been implemented yet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grime-0.1.0.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grime-0.1.0-py3-none-any.whl (102.1 kB view details)

Uploaded Python 3

File details

Details for the file grime-0.1.0.tar.gz.

File metadata

  • Download URL: grime-0.1.0.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for grime-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2df1c98b0ae2a0033a8a6666781845af12fa583c218979cabb447a5416caf686
MD5 5ceaaf2d4db206996ecfb92299862150
BLAKE2b-256 be45128a19735172fe898f734a8e53c93905ebe7e73cd12439fe663f4096650b

See more details on using hashes here.

File details

Details for the file grime-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: grime-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 102.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for grime-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8536b5c7816b9e2d705e70e0cd3a23d10ddda4f8538ee27634d34491220cc901
MD5 d78313b2b485615cda508e62344c7293
BLAKE2b-256 d3c32cc591de4e53da8c4e46a8b16c7ac3e8da2ba175e383f8d8ee76ba2cec19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page