Skip to main content

Local OCR + Markdown + RAG toolkit with optional Hugging Face endpoints.

Project description

DocAI Toolkit

Local OCR + Markdown + RAG with optional Hugging Face/custom endpoints. Renamed to avoid PyPI name collisions (docai-toolkit package import is docai_toolkit).

  • pdf_viewer_app.py: Tkinter UI to open PDFs, run OCR → Markdown, and “chat” via retrieval + generation.
  • docai_toolkit/: library for OCR (local Tesseract or remote endpoint), embedding/indexing (local or remote), and simple chat over FAISS.
  • Status: under active development; APIs and defaults may change as the AI ecosystem moves quickly.

Requirements

  • Python 3.9+
  • Runtime deps vary by script:
    • Viewer: PyPDF2, reportlab (for saving)
    • RAG scripts: langchain, langchain-community, transformers, accelerate, bitsandbytes, sentence_transformers

Install everything:

pip install -r requirements.txt
# or editable install
pip install -e .

Usage

GUI Viewer

python pdf_viewer_app.py
  • Open: loads all pages of a PDF into the text area.
  • Save As: renders the text area content into a new PDF (requires reportlab).
  • OCR → Markdown: run OCR on a PDF and save Markdown to the configured output directory (local Tesseract or remote OCR endpoint via HF/custom).
  • Chat: build a quick FAISS index over a chosen Markdown file and query it with a selected HF model (remote endpoint or local HF pipeline).
  • Settings: set HF token, optional custom endpoints (OCR/embeddings/LLM), model choices, and output directory. Settings persist to ~/.docai/config.json. Env vars (HF_TOKEN, HUGGINGFACEHUB_API_TOKEN, DOC_AI_OUTPUT_DIR) are auto-read.

Hugging Face onboarding (fast path)

  1. Create a Hugging Face access token: https://huggingface.co/settings/tokens (choose “Read” or “Write” as needed).
  2. Export it so the app can auto-load it:
    export HF_TOKEN=your_token_here
    # or HUGGINGFACEHUB_API_TOKEN=your_token_here
    
  3. Pick models (examples):
    • OCR: point the OCR endpoint at a hosted OCR model (HF Inference API URL).
    • Embeddings: e.g., sentence-transformers/all-mpnet-base-v2 via Inference Endpoints (text-embeddings task) or local.
    • LLM: e.g., mistralai/Mistral-7B-Instruct-v0.1 via Inference Endpoints or local HF pipeline.
  4. Start the app, open Settings, and paste endpoints/models if you didn’t set env vars. Output dir can be set there as well.

Environment variables:

  • HF_TOKEN / HUGGINGFACEHUB_API_TOKEN / DOC_AI_HF_TOKEN: auth token (auto-loads into LLM + embeddings).
  • DOC_AI_OUTPUT_DIR: default output directory for OCR/Markdown.

Docker

Build:

docker build -t docai-toolkit .

Run (GUI requires X/Wayland forwarding; for headless tasks, override CMD):

docker run --rm -v $PWD:/data docai-toolkit python -m pytest -q
# or override to run OCR in batch using the library CLI you add

macOS GUI via XQuartz:

  1. Install/start XQuartz (brew install --cask xquartz; enable “Allow connections from network clients” in prefs and restart).
  2. Allow local clients: xhost +localhost
  3. Run:
docker run --rm -it \
  -e DISPLAY=host.docker.internal:0 \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  docai-toolkit

For day-to-day use, running natively is simpler; use the container when you need an isolated, reproducible environment.

Tests

Basic round-trip test for the viewer’s PDF writer:

pytest

reportlab must be installed for the test to run.

OCR + RAG (docai_toolkit/)

  • OCR: pluggable clients (RemoteOcrClient for HF/custom endpoints, TesseractOcrClient local fallback) that turn PDFs into Markdown (ocr/pipeline.py).
  • RAG: build a FAISS index from Markdown (rag/index.py), then chat using a chosen HF model (rag/chat.py).
  • Config: lightweight dataclasses in docai_toolkit/config.py for selecting providers/models; saved at ~/.docai/config.json.
  • Remote-friendly: use HF token + model ids by default; configs allow custom OCR/embedding/generation endpoints. FAISS runs locally for fast retrieval.

To experiment locally:

# OCR to Markdown (Tesseract fallback requires pytesseract + pdf2image installed)
python - <<'PY'
from pathlib import Path
from docai_toolkit.ocr import TesseractOcrClient, run_ocr_to_markdown
client = TesseractOcrClient()
md_path = run_ocr_to_markdown(Path("your.pdf"), Path("outputs"), client)
print("Saved:", md_path)
PY

# Build index + chat (requires sentence_transformers + transformers)
python - <<'PY'
from pathlib import Path
from docai_toolkit.rag import build_index_from_markdown, chat_over_corpus, load_index
index_path = Path("outputs/faiss_index")
db = build_index_from_markdown([Path("outputs/your.md")], persist_path=index_path)
print(chat_over_corpus(db, "What is this document about?", model_id="mistralai/Mistral-7B-Instruct-v0.1"))
# Later: db = load_index(index_path)
PY

License

CC BY-NC-SA 4.0 (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docai_toolkit-0.1.3.1.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docai_toolkit-0.1.3.1-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file docai_toolkit-0.1.3.1.tar.gz.

File metadata

  • Download URL: docai_toolkit-0.1.3.1.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docai_toolkit-0.1.3.1.tar.gz
Algorithm Hash digest
SHA256 c3104eb5e59da7321ced1e6f41ffe4510df896518afe801579bd0d4d56e537d0
MD5 30c72054a40f8d3575be5430f4ebc488
BLAKE2b-256 d7fbcd9db0f793cb9049e80382b3d12ee2de6f52cabe05fef022ceca48bb5466

See more details on using hashes here.

Provenance

The following attestation bundles were made for docai_toolkit-0.1.3.1.tar.gz:

Publisher: pypi.yml on 2pk03/docai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docai_toolkit-0.1.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for docai_toolkit-0.1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f1509e78035705deced5482e760389c3d14b8ca17fe2b6b592dcd16a817d0205
MD5 f15da4fd4cafbf8af922b37a938741d8
BLAKE2b-256 38d3342cfa57776a476b891d35015d7c3cfdb7b3e0eac21610d635fc84d3e2a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for docai_toolkit-0.1.3.1-py3-none-any.whl:

Publisher: pypi.yml on 2pk03/docai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page