Skip to main content

Document extraction API with multi-provider VLM support

Project description

DocEx API

DocEx is a robust document extraction backend built with FastAPI and Docling. It provides a simple yet powerful API to convert documents (PDFs, etc.) into structured Markdown and table data.

DocEx-Serve

A powerful FastAPI-based document extraction service with multi-provider VLM support, batch processing, and multiple output formats.

Features

  • 📄 PDF Extraction - Convert PDFs to markdown with preserved structure
  • 🔍 OCR Support - Extract text from scanned documents
  • 📊 Table Extraction - Preserve table structure
  • 🖼️ Image Descriptions - AI-powered image descriptions via VLM
  • 🔄 Multi-Provider VLM - OpenAI, Groq, Anthropic, Google Gemini, Azure
  • 📦 Batch Processing - Process multiple PDFs in one request
  • 📋 Multiple Output Formats - Markdown, JSON, HTML, Plain Text
  • 📄 Page Numbers - Automatic page numbering in multi-page PDFs

Installation

Option 1: Install from PyPI (Recommended)

pip install docex-serve

Option 2: Install from Source

git clone https://github.com/ryyhan/docEx.git
cd docEx
pip install -r requirements.txt

Option 3: Docker

docker pull rehank25/docex-serve
docker run -p 8000:8000 docex-serve

Quick Start

Start the Server

After pip install:

docex-server
# Or with options
docex-server --host 0.0.0.0 --port 8080

Using Python:

from docex_serve import start_server
start_server(port=8080)

For development (from source):

python3 main.py

Visit http://localhost:8000/docs for interactive API documentation.

Extract Your First Document

curl -X POST http://localhost:8000/api/v1/extract \
  -F "file=@document.pdf" \
  -F "ocr_enabled=true"

Usage Examples

Basic Extraction

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Key Endpoints

POST /api/v1/extract

Upload a file to extract its content.

Request: Request:

  • file: The document file to upload (multipart/form-data).
  • ocr_enabled: (Optional) Enable OCR for scanned documents. Default: true. Set to false for faster processing of digital PDFs.
  • table_extraction_enabled: (Optional) Enable advanced table structure recognition. Default: true.
  • vlm_mode: (Optional) Enable Image Description. Options: none (default), local (uses SmolVLM), api (uses OpenAI GPT-4o).

POST /api/v1/extract-and-save

Same as /extract, but saves the resulting Markdown file to the server's storage directory.

Response:

{
  "message": "Extraction successful and file saved.",
  "saved_path": "/path/to/results/filename_timestamp.md",
  "extraction": { ... }
}

POST /api/v1/warmup

Triggers the download and loading of OCR and Table Extraction models. Call this once at startup to avoid delays on the first request.

Response:

{
  "message": "Warmup completed successfully"
**Response:**
```json
{
  "markdown": "## Page 1\n\n# Document Title\n\nContent...\n\n---\n## Page 2\n\nMore content...",
  "tables": [
    {
      "data": [["Row 1 Col 1", "Row 1 Col 2"], ["Row 2 Col 1", "Row 2 Col 2"]],
      "headers": ["Header 1", "Header 2"]
    }
  ],
  "metadata": {
    "filename": "example.pdf",
    "page_count": 5
  }
}

GET /health

Health check endpoint to verify the service is running.

Response:

{
  "status": "ok"
}

Performance Optimization

Docling uses powerful AI models for OCR and Table Extraction. These models are downloaded on the first run, which can take time.

  1. Warmup: Call POST /api/v1/warmup immediately after deployment to download models.
  2. Disable OCR: If you are processing digital-native PDFs (not scanned images), set ocr_enabled=false in your request to significantly speed up extraction.

Image Description (VLM)

You can enable image description to replace <!-- image --> tags with actual descriptions.

Modes

  1. Local (vlm_mode="local"):
    • Uses HuggingFaceTB/SmolVLM-256M-Instruct.
    • Pros: Free, private.
    • Cons: Requires ~1-2GB RAM, slower warmup.
  2. API (vlm_mode="api"):
    • Uses OpenAI GPT-4o.
    • Pros: Fast, high quality, no local model download.
    • Cons: Costs money, requires OPENAI_API_KEY.

Setup for API Mode

Set the OPENAI_API_KEY environment variable:

export OPENAI_API_KEY="sk-..."

Configuration

Configuration is managed via environment variables (or a .env file). Key settings include:

Variable Description Default
PROJECT_NAME Name of the project "DocEx API"
API_V1_STR API version prefix "/api/v1"
DEBUG Enable debug mode False
ALLOWED_ORIGINS CORS allowed origins ["*"]

Project Structure

docEx/
├── app/
│   ├── api/            # API route definitions
│   ├── core/           # Core config and logging
│   ├── schemas/        # Pydantic models
│   ├── services/       # Business logic (Docling integration)
│   └── main.py         # FastAPI app factory
├── tests/              # Test suite
├── Dockerfile          # Docker build instructions
├── main.py             # Entry point for running the app
└── requirements.txt    # Project dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docex_serve-1.0.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docex_serve-1.0.0-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file docex_serve-1.0.0.tar.gz.

File metadata

  • Download URL: docex_serve-1.0.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for docex_serve-1.0.0.tar.gz
Algorithm Hash digest
SHA256 01c9373f0c79ce0945a709444b83b20dfe5a944adf279a26f231ef10c8a2ba6b
MD5 d509c7ebd435b0070ed22387d8df8783
BLAKE2b-256 beb8da31f59334f572b5f0d41cc1898039fb7c11f506ecc6f1191655caf4d5c9

See more details on using hashes here.

File details

Details for the file docex_serve-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docex_serve-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for docex_serve-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b43b7f7abe98f955fed97342dea2d152f6f39eb17c9aed1b331f50fa41e47215
MD5 fcde3ff6feeb256b5cfcace512695eae
BLAKE2b-256 6d5db6a2b08aea1078d726836c3c71695e8e9669fbac137d8061b0d785e77a98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page