Document extraction API with multi-provider VLM support
Project description
DocEx API
DocEx is a robust document extraction backend built with FastAPI and Docling. It provides a simple yet powerful API to convert documents (PDFs, etc.) into structured Markdown and table data.
DocEx-Serve
A powerful FastAPI-based document extraction service with multi-provider VLM support, batch processing, and multiple output formats.
Features
- 📄 PDF Extraction - Convert PDFs to markdown with preserved structure
- 🔍 OCR Support - Extract text from scanned documents
- 📊 Table Extraction - Preserve table structure
- 🖼️ Image Descriptions - AI-powered image descriptions via VLM
- 🔄 Multi-Provider VLM - OpenAI, Groq, Anthropic, Google Gemini, Azure
- 📦 Batch Processing - Process multiple PDFs in one request
- 📋 Multiple Output Formats - Markdown, JSON, HTML, Plain Text
- 📄 Page Numbers - Automatic page numbering in multi-page PDFs
Installation
Option 1: Install from PyPI (Recommended)
pip install docex-serve
Option 2: Install from Source
git clone https://github.com/ryyhan/docEx.git
cd docEx
pip install -r requirements.txt
Option 3: Docker
docker pull rehank25/docex-serve
docker run -p 8000:8000 docex-serve
Quick Start
Start the Server
After pip install:
docex-server
# Or with options
docex-server --host 0.0.0.0 --port 8080
Using Python:
from docex_serve import start_server
start_server(port=8080)
For development (from source):
python3 main.py
Visit http://localhost:8000/docs for interactive API documentation.
Extract Your First Document
curl -X POST http://localhost:8000/api/v1/extract \
-F "file=@document.pdf" \
-F "ocr_enabled=true"
Usage Examples
Basic Extraction
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Key Endpoints
POST /api/v1/extract
Upload a file to extract its content.
Request: Request:
file: The document file to upload (multipart/form-data).ocr_enabled: (Optional) Enable OCR for scanned documents. Default:true. Set tofalsefor faster processing of digital PDFs.table_extraction_enabled: (Optional) Enable advanced table structure recognition. Default:true.vlm_mode: (Optional) Enable Image Description. Options:none(default),local(uses SmolVLM),api(uses OpenAI GPT-4o).
POST /api/v1/extract-and-save
Same as /extract, but saves the resulting Markdown file to the server's storage directory.
Response:
{
"message": "Extraction successful and file saved.",
"saved_path": "/path/to/results/filename_timestamp.md",
"extraction": { ... }
}
POST /api/v1/warmup
Triggers the download and loading of OCR and Table Extraction models. Call this once at startup to avoid delays on the first request.
Response:
{
"message": "Warmup completed successfully"
**Response:**
```json
{
"markdown": "## Page 1\n\n# Document Title\n\nContent...\n\n---\n## Page 2\n\nMore content...",
"tables": [
{
"data": [["Row 1 Col 1", "Row 1 Col 2"], ["Row 2 Col 1", "Row 2 Col 2"]],
"headers": ["Header 1", "Header 2"]
}
],
"metadata": {
"filename": "example.pdf",
"page_count": 5
}
}
GET /health
Health check endpoint to verify the service is running.
Response:
{
"status": "ok"
}
Performance Optimization
Docling uses powerful AI models for OCR and Table Extraction. These models are downloaded on the first run, which can take time.
- Warmup: Call
POST /api/v1/warmupimmediately after deployment to download models. - Disable OCR: If you are processing digital-native PDFs (not scanned images), set
ocr_enabled=falsein your request to significantly speed up extraction.
Image Description (VLM)
You can enable image description to replace <!-- image --> tags with actual descriptions.
Modes
- Local (
vlm_mode="local"):- Uses
HuggingFaceTB/SmolVLM-256M-Instruct. - Pros: Free, private.
- Cons: Requires ~1-2GB RAM, slower warmup.
- Uses
- API (
vlm_mode="api"):- Uses OpenAI GPT-4o.
- Pros: Fast, high quality, no local model download.
- Cons: Costs money, requires
OPENAI_API_KEY.
Setup for API Mode
Set the OPENAI_API_KEY environment variable:
export OPENAI_API_KEY="sk-..."
Configuration
Configuration is managed via environment variables (or a .env file). Key settings include:
| Variable | Description | Default |
|---|---|---|
PROJECT_NAME |
Name of the project | "DocEx API" |
API_V1_STR |
API version prefix | "/api/v1" |
DEBUG |
Enable debug mode | False |
ALLOWED_ORIGINS |
CORS allowed origins | ["*"] |
Project Structure
docEx/
├── app/
│ ├── api/ # API route definitions
│ ├── core/ # Core config and logging
│ ├── schemas/ # Pydantic models
│ ├── services/ # Business logic (Docling integration)
│ └── main.py # FastAPI app factory
├── tests/ # Test suite
├── Dockerfile # Docker build instructions
├── main.py # Entry point for running the app
└── requirements.txt # Project dependencies
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docex_serve-1.0.0.tar.gz.
File metadata
- Download URL: docex_serve-1.0.0.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01c9373f0c79ce0945a709444b83b20dfe5a944adf279a26f231ef10c8a2ba6b
|
|
| MD5 |
d509c7ebd435b0070ed22387d8df8783
|
|
| BLAKE2b-256 |
beb8da31f59334f572b5f0d41cc1898039fb7c11f506ecc6f1191655caf4d5c9
|
File details
Details for the file docex_serve-1.0.0-py3-none-any.whl.
File metadata
- Download URL: docex_serve-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b43b7f7abe98f955fed97342dea2d152f6f39eb17c9aed1b331f50fa41e47215
|
|
| MD5 |
fcde3ff6feeb256b5cfcace512695eae
|
|
| BLAKE2b-256 |
6d5db6a2b08aea1078d726836c3c71695e8e9669fbac137d8061b0d785e77a98
|