Local OCR + Markdown + RAG toolkit with optional Hugging Face endpoints.
Project description
DocAI Toolkit
Local OCR + Markdown + RAG with optional Hugging Face/custom endpoints. Renamed to avoid PyPI name collisions (docai-toolkit package import is docai_toolkit).
pdf_viewer_app.py: Tkinter UI to open PDFs, run OCR → Markdown, and “chat” via retrieval + generation.docai_toolkit/: library for OCR (local Tesseract or remote endpoint), embedding/indexing (local or remote), and simple chat over FAISS.- Status: under active development; APIs and defaults may change as the AI ecosystem moves quickly.
Requirements
- Python 3.9+
- Runtime deps vary by script:
- Viewer:
PyPDF2,reportlab(for saving) - RAG scripts:
langchain,langchain-community,transformers,accelerate,bitsandbytes,sentence_transformers
- Viewer:
Install everything:
pip install -r requirements.txt
# or editable install
pip install -e .
Usage
GUI Viewer
python pdf_viewer_app.py
- Open: loads all pages of a PDF into the text area.
- Save As: renders the text area content into a new PDF (requires
reportlab). - OCR → Markdown: run OCR on a PDF and save Markdown to the configured output directory (local Tesseract or remote OCR endpoint via HF/custom).
- Chat: build a quick FAISS index over a chosen Markdown file and query it with a selected HF model (remote endpoint or local HF pipeline).
- Settings: set HF token, optional custom endpoints (OCR/embeddings/LLM), model choices, and output directory. Settings persist to
~/.docai/config.json. Env vars (HF_TOKEN,HUGGINGFACEHUB_API_TOKEN,DOC_AI_OUTPUT_DIR) are auto-read.
Hugging Face onboarding (fast path)
- Create a Hugging Face access token: https://huggingface.co/settings/tokens (choose “Read” or “Write” as needed).
- Export it so the app can auto-load it:
export HF_TOKEN=your_token_here # or HUGGINGFACEHUB_API_TOKEN=your_token_here
- Pick models (examples):
- OCR: point the OCR endpoint at a hosted OCR model (HF Inference API URL).
- Embeddings: e.g.,
sentence-transformers/all-mpnet-base-v2via Inference Endpoints (text-embeddings task) or local. - LLM: e.g.,
mistralai/Mistral-7B-Instruct-v0.1via Inference Endpoints or local HF pipeline.
- Start the app, open Settings, and paste endpoints/models if you didn’t set env vars. Output dir can be set there as well.
Environment variables:
HF_TOKEN/HUGGINGFACEHUB_API_TOKEN/DOC_AI_HF_TOKEN: auth token (auto-loads into LLM + embeddings).DOC_AI_OUTPUT_DIR: default output directory for OCR/Markdown.
Docker
Build:
docker build -t docai-toolkit .
Run (GUI requires X/Wayland forwarding; for headless tasks, override CMD):
docker run --rm -v $PWD:/data docai-toolkit python -m pytest -q
# or override to run OCR in batch using the library CLI you add
macOS GUI via XQuartz:
- Install/start XQuartz (
brew install --cask xquartz; enable “Allow connections from network clients” in prefs and restart). - Allow local clients:
xhost +localhost - Run:
docker run --rm -it \
-e DISPLAY=host.docker.internal:0 \
-v /tmp/.X11-unix:/tmp/.X11-unix \
docai-toolkit
For day-to-day use, running natively is simpler; use the container when you need an isolated, reproducible environment.
Tests
Basic round-trip test for the viewer’s PDF writer:
pytest
reportlab must be installed for the test to run.
OCR + RAG (docai_toolkit/)
- OCR: pluggable clients (
RemoteOcrClientfor HF/custom endpoints,TesseractOcrClientlocal fallback) that turn PDFs into Markdown (ocr/pipeline.py). - RAG: build a FAISS index from Markdown (
rag/index.py), then chat using a chosen HF model (rag/chat.py). - Config: lightweight dataclasses in
docai_toolkit/config.pyfor selecting providers/models; saved at~/.docai/config.json. - Remote-friendly: use HF token + model ids by default; configs allow custom OCR/embedding/generation endpoints. FAISS runs locally for fast retrieval.
To experiment locally:
# OCR to Markdown (Tesseract fallback requires pytesseract + pdf2image installed)
python - <<'PY'
from pathlib import Path
from docai_toolkit.ocr import TesseractOcrClient, run_ocr_to_markdown
client = TesseractOcrClient()
md_path = run_ocr_to_markdown(Path("your.pdf"), Path("outputs"), client)
print("Saved:", md_path)
PY
# Build index + chat (requires sentence_transformers + transformers)
python - <<'PY'
from pathlib import Path
from docai_toolkit.rag import build_index_from_markdown, chat_over_corpus, load_index
index_path = Path("outputs/faiss_index")
db = build_index_from_markdown([Path("outputs/your.md")], persist_path=index_path)
print(chat_over_corpus(db, "What is this document about?", model_id="mistralai/Mistral-7B-Instruct-v0.1"))
# Later: db = load_index(index_path)
PY
License
CC BY-NC-SA 4.0 (see LICENSE).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docai_toolkit-0.1.3.1.tar.gz.
File metadata
- Download URL: docai_toolkit-0.1.3.1.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3104eb5e59da7321ced1e6f41ffe4510df896518afe801579bd0d4d56e537d0
|
|
| MD5 |
30c72054a40f8d3575be5430f4ebc488
|
|
| BLAKE2b-256 |
d7fbcd9db0f793cb9049e80382b3d12ee2de6f52cabe05fef022ceca48bb5466
|
Provenance
The following attestation bundles were made for docai_toolkit-0.1.3.1.tar.gz:
Publisher:
pypi.yml on 2pk03/docai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docai_toolkit-0.1.3.1.tar.gz -
Subject digest:
c3104eb5e59da7321ced1e6f41ffe4510df896518afe801579bd0d4d56e537d0 - Sigstore transparency entry: 729225091
- Sigstore integration time:
-
Permalink:
2pk03/docai@028c74244c95d930090578afb577f47f5cad85c0 -
Branch / Tag:
refs/tags/v0.1.3.1 - Owner: https://github.com/2pk03
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@028c74244c95d930090578afb577f47f5cad85c0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docai_toolkit-0.1.3.1-py3-none-any.whl.
File metadata
- Download URL: docai_toolkit-0.1.3.1-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1509e78035705deced5482e760389c3d14b8ca17fe2b6b592dcd16a817d0205
|
|
| MD5 |
f15da4fd4cafbf8af922b37a938741d8
|
|
| BLAKE2b-256 |
38d3342cfa57776a476b891d35015d7c3cfdb7b3e0eac21610d635fc84d3e2a4
|
Provenance
The following attestation bundles were made for docai_toolkit-0.1.3.1-py3-none-any.whl:
Publisher:
pypi.yml on 2pk03/docai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docai_toolkit-0.1.3.1-py3-none-any.whl -
Subject digest:
f1509e78035705deced5482e760389c3d14b8ca17fe2b6b592dcd16a817d0205 - Sigstore transparency entry: 729225130
- Sigstore integration time:
-
Permalink:
2pk03/docai@028c74244c95d930090578afb577f47f5cad85c0 -
Branch / Tag:
refs/tags/v0.1.3.1 - Owner: https://github.com/2pk03
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@028c74244c95d930090578afb577f47f5cad85c0 -
Trigger Event:
push
-
Statement type: