Skip to main content

Local OCR + Markdown + RAG toolkit with optional Hugging Face endpoints.

Project description

DocAI Toolkit

Local OCR + Markdown + RAG with optional Hugging Face/custom endpoints. Renamed to avoid PyPI name collisions (docai-toolkit package import is docai_toolkit).

  • pdf_viewer_app.py: Tkinter UI to open PDFs, run OCR → Markdown, and “chat” via retrieval + generation.
  • docai_toolkit/: library for OCR (local Tesseract or remote endpoint), embedding/indexing (local or remote), and simple chat over FAISS.
  • Status: under active development; APIs and defaults may change as the AI ecosystem moves quickly.

Requirements

  • Python 3.9+
  • Runtime deps vary by script:
    • Viewer: PyPDF2, reportlab (for saving)
    • RAG scripts: langchain, langchain-community, transformers, accelerate, bitsandbytes, sentence_transformers

Install everything:

pip install -r requirements.txt
# or editable install
pip install -e .

Usage

GUI Viewer

python pdf_viewer_app.py
  • Open: loads all pages of a PDF into the text area.
  • Save As: renders the text area content into a new PDF (requires reportlab).
  • OCR → Markdown: run OCR on a PDF and save Markdown to the configured output directory (local Tesseract or remote OCR endpoint via HF/custom).
  • Chat: build a quick FAISS index over a chosen Markdown file and query it with a selected HF model (remote endpoint or local HF pipeline).
  • Settings: set HF token, optional custom endpoints (OCR/embeddings/LLM), model choices, and output directory. Settings persist to ~/.docai/config.json. Env vars (HF_TOKEN, HUGGINGFACEHUB_API_TOKEN, DOC_AI_OUTPUT_DIR) are auto-read.

Hugging Face onboarding (fast path)

  1. Create a Hugging Face access token: https://huggingface.co/settings/tokens (choose “Read” or “Write” as needed).
  2. Export it so the app can auto-load it:
    export HF_TOKEN=your_token_here
    # or HUGGINGFACEHUB_API_TOKEN=your_token_here
    
  3. Pick models (examples):
    • OCR: point the OCR endpoint at a hosted OCR model (HF Inference API URL).
    • Embeddings: e.g., sentence-transformers/all-mpnet-base-v2 via Inference Endpoints (text-embeddings task) or local.
    • LLM: e.g., mistralai/Mistral-7B-Instruct-v0.1 via Inference Endpoints or local HF pipeline.
  4. Start the app, open Settings, and paste endpoints/models if you didn’t set env vars. Output dir can be set there as well.

Environment variables:

  • HF_TOKEN / HUGGINGFACEHUB_API_TOKEN / DOC_AI_HF_TOKEN: auth token (auto-loads into LLM + embeddings).
  • DOC_AI_OUTPUT_DIR: default output directory for OCR/Markdown.

Docker

Build:

docker build -t docai-toolkit .

Run (GUI requires X/Wayland forwarding; for headless tasks, override CMD):

docker run --rm -v $PWD:/data docai-toolkit python -m pytest -q
# or override to run OCR in batch using the library CLI you add

macOS GUI via XQuartz:

  1. Install/start XQuartz (brew install --cask xquartz; enable “Allow connections from network clients” in prefs and restart).
  2. Allow local clients: xhost +localhost
  3. Run:
docker run --rm -it \
  -e DISPLAY=host.docker.internal:0 \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  docai-toolkit

For day-to-day use, running natively is simpler; use the container when you need an isolated, reproducible environment.

Tests

Basic round-trip test for the viewer’s PDF writer:

pytest

reportlab must be installed for the test to run.

OCR + RAG (docai_toolkit/)

  • OCR: pluggable clients (RemoteOcrClient for HF/custom endpoints, TesseractOcrClient local fallback) that turn PDFs into Markdown (ocr/pipeline.py).
  • RAG: build a FAISS index from Markdown (rag/index.py), then chat using a chosen HF model (rag/chat.py).
  • Config: lightweight dataclasses in docai_toolkit/config.py for selecting providers/models; saved at ~/.docai/config.json.
  • Remote-friendly: use HF token + model ids by default; configs allow custom OCR/embedding/generation endpoints. FAISS runs locally for fast retrieval.

To experiment locally:

# OCR to Markdown (Tesseract fallback requires pytesseract + pdf2image installed)
python - <<'PY'
from pathlib import Path
from docai_toolkit.ocr import TesseractOcrClient, run_ocr_to_markdown
client = TesseractOcrClient()
md_path = run_ocr_to_markdown(Path("your.pdf"), Path("outputs"), client)
print("Saved:", md_path)
PY

# Build index + chat (requires sentence_transformers + transformers)
python - <<'PY'
from pathlib import Path
from docai_toolkit.rag import build_index_from_markdown, chat_over_corpus, load_index
index_path = Path("outputs/faiss_index")
db = build_index_from_markdown([Path("outputs/your.md")], persist_path=index_path)
print(chat_over_corpus(db, "What is this document about?", model_id="mistralai/Mistral-7B-Instruct-v0.1"))
# Later: db = load_index(index_path)
PY

License

CC BY-NC-SA 4.0 (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docai_toolkit-0.1.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docai_toolkit-0.1.0-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file docai_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: docai_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docai_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a29937c9776bd469879356412a4eebf5b395a103bc976635db4e67b2d9f64c46
MD5 79203b60ef547b839a1dee522e503d4a
BLAKE2b-256 87f22127fa483628b1b48cdfdcc6db0599cccd0d52a35a5b51740c843e369e34

See more details on using hashes here.

Provenance

The following attestation bundles were made for docai_toolkit-0.1.0.tar.gz:

Publisher: pypi.yml on 2pk03/docai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docai_toolkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docai_toolkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docai_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4bbe161a67bb587bd95685b5aa214cd8cab710a8de830108f4c601b6951d2ee
MD5 28a9b91b390480c89b673e53535e4f97
BLAKE2b-256 2520718dca979b2641e9a939b414596a3320727fe1a48c439ef66114a7284384

See more details on using hashes here.

Provenance

The following attestation bundles were made for docai_toolkit-0.1.0-py3-none-any.whl:

Publisher: pypi.yml on 2pk03/docai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page