Document intelligence API and Python client for PDF OCR, PII detection, LLM structuring, matching, and summarization.

These details have not been verified by PyPI

Project links

Project description

Document Intelligence Platform

Production-ready document AI: PDF annotation, scanned-document PII detection (EasyOCR + Presidio), LLM PDF structuring, resume matching, and extractive summarization. Ship as a REST API, a Gradio upload GUI, or both via Docker.

Version: 1.0.0

Install from PyPI

pip install docintel

# Full stack (OCR, LLM, jobs, auth, UI)
pip install "docintel[all]"

Python client:

from docintel import DocintelClient

client = DocintelClient("http://127.0.0.1:5000", api_key="your-key")
result = client.match_resume(resume_text, job_description)
pdf_bytes = client.structure_pdf("scan.pdf", async_job=True)

Publish a release to PyPI (maintainers): tag v1.0.0 and push, or run make publish-pypi with TWINE_USERNAME / TWINE_PASSWORD or PyPI trusted publishing configured in GitHub Actions.

Deploy in one command

No local Python setup required.

git clone https://github.com/baban9/document-intelligence-platform.git
cd document-intelligence-platform
make docker-up

Service	URL	Use case
Gradio GUI	http://127.0.0.1:7860	Upload PDFs, no code
REST API	http://127.0.0.1:5000	Integrations, curl, apps
Health	http://127.0.0.1:5000/health	Load balancer probe
API docs	http://127.0.0.1:5000/docs	Swagger UI (OpenAPI)
OpenAPI	http://127.0.0.1:5000/openapi.json	Machine-readable contract
Metrics	http://127.0.0.1:5000/metrics	Request counts and latency

First startup can take a few minutes while EasyOCR and Presidio models download inside the container.

make docker-logs      # follow api + ui logs
make docker-down      # stop services

Optional overrides: copy .env.example to .env (ports, log level, worker count).

Gradio upload GUI

Open http://127.0.0.1:7860 after make docker-up (or make run-ui locally).

Tab	What it does
PDF regex annotate	Search by pattern, highlight or redact
Sensitive PDF (OCR + Presidio)	Scanned docs: OCR, detect PII, annotate boxes
PDF structure (LLM)	Scanned or messy PDFs to curated structured PDF
Resume matching	Score resume vs job description
Text summarization	Extractive summary with TextRank

The GUI calls the same REST API as external clients. Set DOCINTEL_API_URL if the API runs on a different host.

What you get

Capability	API	GUI
PDF regex search and annotation	`POST /v1/pdf/annotate`	PDF regex annotate tab
Scanned PDF PII detection	`POST /v1/pdf/detect-sensitive`	Sensitive PDF tab
LLM PDF structuring	`POST /v1/pdf/structure`	PDF structure tab
Presidio entity catalog	`GET /v1/pdf/entities`	-
Resume vs job matching	`POST /v1/match/resume`	Resume matching tab
Extractive summarization	`POST /v1/text/summarize`	Text summarization tab
Health and metrics	`GET /health`, `GET /metrics`	-

Why this exists

HR, compliance, and research teams often maintain separate tools:

a PDF highlighter or redaction script
a resume keyword matcher
a notebook for summarization

That split means duplicated config, no shared metrics, and broken workflows on scanned PDFs where text extraction returns empty. This platform unifies those flows behind one API and one upload GUI.

Problems it solves

HR and recruiting

Problem	Solution
Manual resume screening at scale	TF-IDF match score plus keyword overlap
Long ATS exports before phone screens	Extractive summary in seconds
Inconsistent reviewer shortlists	Same scoring logic every time

Compliance and legal

Problem	Solution
Regex search on digital contracts	`POST /v1/pdf/annotate`
Scanned contracts with no text layer	EasyOCR + Presidio on `POST /v1/pdf/detect-sensitive`
Redact SSN, email, phone before external share	Highlight or redact on exact bounding boxes
Audit trail	Structured JSON logs and `/metrics`

Research intake

Problem	Solution
Long reports need triage	TextRank summarization
Key terms buried in PDFs	Regex annotate or Presidio entity detection

Before vs after

Before	After
3 scripts, 3 configs	1 API + 1 GUI + 1 Docker deploy
Scanned PDFs fail regex tools	OCR fallback with Presidio PII boxes
Desktop-only redaction	Programmatic HTTP + downloadable output PDF
No observability	JSON logs, health check, metrics endpoint

Local development

git clone https://github.com/baban9/document-intelligence-platform.git
cd document-intelligence-platform
make setup
make setup-ocr    # EasyOCR + Presidio + spaCy en model
make setup-llm    # OpenAI client for PDF structuring
make setup-ui     # Gradio client

Terminal 1 (API):

make run
curl http://127.0.0.1:5000/health

Terminal 2 (GUI):

make run-ui
# open http://127.0.0.1:7860

Tests:

make test

Install extras:

pip install -e ".[dev]"        # tests
pip install -e ".[ocr]"        # scanned PDF pipeline
pip install -e ".[llm]"        # LLM PDF structuring
pip install -e ".[ui]"         # Gradio GUI
python -m spacy download en_core_web_sm

Architecture

Modular monolith: one Flask app, separate service modules, optional Gradio front end.

  Browser / curl                    Docker Compose
       |                                  |
       v                                  v
 +-----------+                    +-------+--------+
 |  Gradio   | -- HTTP :5000 -->  |  Flask API     |
 |  UI :7860 |                    |  (Gunicorn)    |
 +-----------+                    +-------+--------+
                                          |
            +-----------------------------+-----------------------------+
            |                             |                             |
      +-----v-----+               +-----v-----+               +-----v-----+
      |    PDF    |               |  Matching |               |  Summary  |
      |  service  |               |  service  |               |  service  |
      +-----------+               +-----------+               +-----------+
            |                             |                             |
      PyMuPDF regex               TF-IDF cosine                 TextRank graph
      EasyOCR (scanned)           keyword overlap               extractive output
      Presidio PII boxes          LLM PDF structure

Decision records: modular monolith, OCR + Presidio

API reference

OpenAPI spec: GET /openapi.json | Interactive docs: GET /docs

Sensitive PDF detection (scanned + digital)

When native PDF text is empty, the service runs EasyOCR (English), analyzes text with Microsoft Presidio, and returns a new PDF with highlights or redactions on bounding boxes. Optionally embeds an invisible text layer so the output stays searchable.

curl -X POST http://127.0.0.1:5000/v1/pdf/detect-sensitive \
  -F "file=@scanned_contract.pdf" \
  -F "action=Highlight" \
  -o marked_contract.pdf

JSON report with findings:

curl -X POST "http://127.0.0.1:5000/v1/pdf/detect-sensitive?format=json" \
  -F "file=@scanned_contract.pdf" \
  -F "action=Redact" \
  -F "entities=EMAIL_ADDRESS,PHONE_NUMBER,US_SSN,CREDIT_CARD,PERSON"

List Presidio entities (extend with custom recognizers):

curl http://127.0.0.1:5000/v1/pdf/entities

Field	Required	Description
`file`	Yes	PDF upload
`action`	No	`Highlight` (default), `Redact`, `Frame`, `Underline`, `Squiggly`, `Strikeout`
`entities`	No	Comma-separated Presidio types (default preset below)
`pattern`	No	Extra regex on top of Presidio
`force_ocr`	No	`true` to OCR every page
`add_text_layer`	No	`true` (default) adds searchable invisible text
`min_score`	No	Presidio confidence threshold (default `0.35`)
`async`	No	`true` queues the job (returns `202`); poll `GET /v1/jobs/<job_id>`
`callback_url`	No	Webhook URL when async job completes

Async mode:

curl -X POST "http://127.0.0.1:5000/v1/pdf/detect-sensitive?async=true" \
  -H "Authorization: Bearer your-key" \
  -F "file=@scanned_contract.pdf" \
  -F "action=Highlight"

Default Presidio entities: EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_PASSPORT, PERSON, LOCATION, DATE_TIME, IP_ADDRESS, IBAN_CODE, MEDICAL_LICENSE, URL.

LLM PDF structuring (scanned to curated PDF)

Turn unstructured or scanned PDFs into a clean digital PDF. EasyOCR extracts text when the native layer is missing. An OpenAI-compatible LLM cleans and structures the content, then the service returns a curated typeset PDF or a searchable layer on the original pages.

curl -X POST http://127.0.0.1:5000/v1/pdf/structure \
  -F "file=@scanned_notes.pdf" \
  -F "mode=curate" \
  -o structured_notes.pdf

Field	Required	Description
`file`	Yes	PDF upload
`mode`	No	`curate` (default, new typeset PDF) or `searchable` (invisible text on original pages)
`force_ocr`	No	`true` to OCR every page
`redact_before_llm`	No	`true` masks Presidio PII before text is sent to the LLM
`callback_url`	No	Webhook URL notified when an async job completes or fails
`async`	No	`true` queues the job in Redis (returns `202`); `false` waits in the request (default)

Async mode (recommended for scanned PDFs):

# 1) Queue the job
curl -X POST "http://127.0.0.1:5000/v1/pdf/structure?async=true" \
  -F "file=@scanned_notes.pdf" \
  -F "mode=curate"

# 2) Poll until job_status is completed
curl http://127.0.0.1:5000/v1/jobs/<job_id>

# 3) Download from download_url in the poll response

Start Redis and the worker locally: make setup-jobs, then make run-worker in a second terminal. Docker Compose starts redis, api, and worker automatically.

Model used: OpenAI gpt-4o-mini by default (set via DOCINTEL_LLM_MODEL). The service uses the official OpenAI Python client and any OpenAI-compatible endpoint if you set DOCINTEL_LLM_BASE_URL.

Install LLM extras:

pip install -e ".[ocr,llm,jobs]"

Get an OpenAI API key

Official guide: OpenAI API quickstart
Manage keys: platform.openai.com/api-keys

Create an account at platform.openai.com (or sign in).
Open API keys and click Create new secret key.
Copy the key once (it is shown only at creation time).
Add billing or credits on the OpenAI platform if required for your account.
Export the key before starting the API:

export DOCINTEL_LLM_API_KEY="sk-..."
export DOCINTEL_LLM_MODEL="gpt-4o-mini"   # optional; this is the default
make run

For Docker or persistent local use, copy .env.example to .env and set DOCINTEL_LLM_API_KEY there. Do not commit .env or share the key in git.

Optional: use another OpenAI-compatible provider by setting DOCINTEL_LLM_BASE_URL and the matching model name for that provider.

PDF regex annotation

For digital PDFs with a text layer.

curl -X POST http://127.0.0.1:5000/v1/pdf/annotate \
  -F "file=@contract.pdf" \
  -F "pattern=CONFIDENTIAL" \
  -F "action=Redact" \
  -o redacted_contract.pdf

Action	Description
`Highlight`	Yellow highlight (default)
`Redact`	Black out matched text
`Frame`	Red bounding box
`Underline` / `Squiggly` / `Strikeout`	Text markup
`Remove`	Delete existing annotations

Optional: pages (comma-separated, zero-based), ?format=json for metadata + download URL.

Resume matching

curl -X POST http://127.0.0.1:5000/v1/match/resume \
  -H "Content-Type: application/json" \
  -d '{
    "resume": "Python engineer with Flask, pytest, Docker, and NLP experience.",
    "job_description": "Seeking Python developer with Flask, Docker, API, and NLP skills.",
    "top_keywords": 10
  }'

{
  "status": "ok",
  "score": 42.15,
  "matched_keywords": ["python", "flask", "docker", "nlp"],
  "missing_keywords": ["developer", "api", "skills"]
}

Text summarization

curl -X POST http://127.0.0.1:5000/v1/text/summarize \
  -H "Content-Type: application/json" \
  -d '{"text": "Your long document here...", "sentences": 3}'

Metrics

curl http://127.0.0.1:5000/metrics

Returns request counts, error counts, average latency, and per-endpoint breakdown. Metrics are per Gunicorn worker; use WEB_CONCURRENCY=1 for OCR workloads (Docker default).

Configuration

Variable	Default	Purpose
`DOCINTEL_HOST`	`127.0.0.1`	API bind address (`0.0.0.0` in Docker)
`DOCINTEL_PORT`	`5000`	API port
`DOCINTEL_UPLOAD_DIR`	`uploads`	PDF job storage
`DOCINTEL_LOG_LEVEL`	`INFO`	JSON log verbosity
`WEB_CONCURRENCY`	`1`	Gunicorn workers (keep at 1 for OCR)
`DOCINTEL_API_URL`	`http://127.0.0.1:5000`	Gradio UI backend URL
`DOCINTEL_LLM_API_KEY`	unset	OpenAI-compatible API key for `/v1/pdf/structure`
`DOCINTEL_LLM_MODEL`	`gpt-4o-mini`	Model name for structuring
`DOCINTEL_LLM_BASE_URL`	unset	Optional compatible API base URL
`DOCINTEL_API_KEYS`	unset	Comma-separated API keys (`Authorization: Bearer ...`)
`DOCINTEL_AUTH_REQUIRED`	`false`	Require auth on `/v1/*` when `true` or keys are set
`DOCINTEL_RATE_LIMIT_ENABLED`	`true`	Per-key rate limits via Redis
`DOCINTEL_OIDC_ISSUER`	unset	Optional OIDC issuer for JWT bearer tokens
`DOCINTEL_OIDC_AUDIENCE`	unset	Expected JWT audience
`DOCINTEL_OIDC_JWKS_URL`	unset	JWKS URL (defaults to issuer `/.well-known/jwks.json`)
`DOCINTEL_API_KEY`	unset	API key used by the Gradio UI client
`GRADIO_SERVER_NAME`	`127.0.0.1`	Gradio bind (`0.0.0.0` in Docker)
`GRADIO_SERVER_PORT`	`7860`	Gradio port

API authentication

Protect /v1/* routes with API keys and optional OIDC JWTs.

export DOCINTEL_API_KEYS="dev-key-1,dev-key-2"
export DOCINTEL_AUTH_REQUIRED=true

curl -H "Authorization: Bearer dev-key-1" \
  http://127.0.0.1:5000/v1/pdf/entities

OIDC (enterprise SSO tokens):

export DOCINTEL_OIDC_ISSUER="https://your-idp.example.com"
export DOCINTEL_OIDC_AUDIENCE="docintel-api"
pip install -e ".[auth]"

Send Authorization: Bearer <jwt> from your identity provider. API keys still work when both are configured.

Install auth extras: pip install -e ".[auth]" (Flask-Limiter + PyJWT).

Project layout

document-intelligence-platform/
  src/docintel/
    app.py                 Flask factory
    ui.py                  Gradio upload GUI
    wsgi.py                Gunicorn entry
    routes/                HTTP endpoints
    services/
      pdf/                 PyMuPDF, EasyOCR, Presidio
      matching/            TF-IDF resume scoring
      summary/             TextRank summarizer
    ops/                   JSON logging, metrics
  run.py                   Start API locally
  run_ui.py                Start Gradio locally
  Dockerfile               API + OCR stack image
  docker-compose.yml       api + ui services
  docs/adr/                Architecture decisions
  tests/                   pytest suite
  Makefile
  pyproject.toml

Makefile commands

Command	Description
`make setup`	venv + core package
`make setup-ocr`	EasyOCR + Presidio + spaCy model
`make setup-llm`	OpenAI client for PDF structuring
`make build-dist`	Build PyPI wheel and sdist
`make publish-pypi`	Upload to PyPI with twine
`make setup-ui`	Gradio GUI dependencies
`make run`	Start API (:5000)
`make run-ui`	Start Gradio (:7860)
`make test`	Run pytest
`make docker-up`	Build and start API + UI containers
`make docker-down`	Stop containers
`make docker-logs`	Tail all service logs

Roadmap

Milestone	Scope	Status
M1	Project scaffold, health endpoint	Done
M2	PDF regex annotation	Done
M3	Resume matching	Done
M4	Extractive summarization	Done
M5	Docker, logging, metrics	Done
M5+	OCR + Presidio scanned PDF pipeline	Done
M5+	Gradio upload GUI	Done
M8	LLM PDF structuring	Done
M6	Offline eval harness	Planned
M7	Production checklist	Planned

Details: docs/ROADMAP.md

Limits and notes

OCR requests are CPU-heavy; expect higher latency on scanned PDFs.
Presidio entity types are extensible; defaults cover common US PII.
First EasyOCR run downloads models (~100MB+).
LLM structuring sends page text to your configured model provider when native OCR text is used.
Not intended for real-time collaborative editing or generative long-form writing.

License

MIT. See LICENSE.

Built by Babandeep Singh. Open an issue for bugs or feature requests.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docintel_platform-1.0.2.tar.gz (57.6 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docintel_platform-1.0.2-py3-none-any.whl (55.2 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file docintel_platform-1.0.2.tar.gz.

File metadata

Download URL: docintel_platform-1.0.2.tar.gz
Upload date: Jun 11, 2026
Size: 57.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for docintel_platform-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`9b21718b750c2d110420e49d95d9b6ec33829dd35c9e463ec01d227d2e808376`
MD5	`c4fed314c4db05bfe0f0f03390118eb6`
BLAKE2b-256	`4d14aa9ca925f4b064ab0619c9467d3e73d42678087c548e76109c451e887eb5`

See more details on using hashes here.

File details

Details for the file docintel_platform-1.0.2-py3-none-any.whl.

File metadata

Download URL: docintel_platform-1.0.2-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 55.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for docintel_platform-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b7e805ac8250625af9b715737763094a0715e43a6e327a9d79cf6a84d55e502`
MD5	`c9cc97463df51e77afa28385dfdb863d`
BLAKE2b-256	`dfed1c6608ffeafa5078aa53ee43ae8aacf57052e4ad9cda3446d35d25839e77`

See more details on using hashes here.

docintel-platform 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document Intelligence Platform

Install from PyPI

Deploy in one command

Gradio upload GUI

What you get

Why this exists

Problems it solves

HR and recruiting

Compliance and legal

Research intake

Before vs after

Local development

Architecture

API reference

Sensitive PDF detection (scanned + digital)

LLM PDF structuring (scanned to curated PDF)

PDF regex annotation

Resume matching

Text summarization

Metrics

Configuration

API authentication

Project layout

Makefile commands

Roadmap

Limits and notes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes