Skip to main content

Production-ready wrapper for DeepSeek-OCR - Convert documents to structured data for AI agents

Project description

DeepSeek Visor Agent

Production-ready wrapper for DeepSeek-OCR - Convert documents to structured data in 3 lines of code

PyPI version License: Apache 2.0 Python 3.9+

Keywords: DeepSeek OCR, DeepSeek-OCR wrapper, document OCR, AI agent vision tool, LangChain OCR, LlamaIndex OCR


⚠️ GPU Requirements (CRITICAL)

NVIDIA GPU with Turing+ architecture required

✅ Supported ❌ Not Supported
RTX 20/30/40 series (Turing/Ampere/Ada) GTX 10 series (Pascal - no FlashAttention)
Tesla T4, A10, A100 GTX 1080 Ti, GTX 1660
Minimum: RTX 2060 (6GB VRAM) CPU-only mode
Recommended: RTX 3090 (24GB VRAM) AMD GPUs (ROCm)

Why? DeepSeek-OCR requires FlashAttention 2.x, which only supports compute capability 7.5+ (Turing and newer).

No GPU? Join our hosted API waitlist (planned for future release).

📖 Detailed compatibility guide: GPU_COMPATIBILITY.md


🎯 What is This?

DeepSeek Visor Agent is a production-ready Python wrapper for DeepSeek-OCR, the state-of-the-art open-source OCR model by DeepSeek AI.

Built on DeepSeek-OCR, this wrapper makes document understanding effortless for AI agents by handling all the complexity:

  • Auto device detection (CUDA with Turing+ GPUs)
  • Automatic fallback (Gundam mode → Base mode → Tiny mode when OOM)
  • Structured output (Markdown + extracted fields)
  • Agent-ready (LangChain, LlamaIndex, Dify compatible)

⚡ Quick Start

Prerequisites

Before installation, ensure you have:

  1. NVIDIA GPU with Turing+ architecture (RTX 20/30/40 series, Tesla T4/A100)
  2. CUDA 11.8+ installed and configured
  3. Python 3.9+

Installation

Step 1: Install the package

pip install deepseek-visor-agent

Step 2: (First-time only) Model download

The first time you run the tool, it will automatically download the DeepSeek-OCR model (~6.2 GB) from HuggingFace:

from deepseek_visor_agent import VisionDocumentTool

# This will trigger model download on first run
tool = VisionDocumentTool()

The model will be cached in ~/.cache/huggingface/ and reused for subsequent runs.

Step 3: (Optional) Install FlashAttention for better performance

# For RTX GPUs with compute capability 7.5+
pip install flash-attn --no-build-isolation

Basic Usage

Process Images:

from deepseek_visor_agent import VisionDocumentTool

# Initialize the tool (auto-detects best device and model)
tool = VisionDocumentTool()

# Process a document image
result = tool.run("invoice.jpg")

print(result["fields"]["total"])  # "$199.00"
print(result["fields"]["date"])   # "2024-01-15"
print(result["document_type"])    # "invoice"

Process PDFs:

# PDF files work the same way - automatically converts pages to images
result = tool.run("contract.pdf")

print(f"Processed {result['pages']} pages")
print(result["markdown"])  # Multi-page PDFs have <--- Page Split ---> separators

# Process specific pages only
result = tool.run("long_document.pdf", pdf_start_page=0, pdf_end_page=2)

That's it! No configuration needed.

📖 Complete User Journey

Scenario 1: Standalone Python Script

Use Case: Extract invoice data for accounting automation

from deepseek_visor_agent import VisionDocumentTool
import json

# Initialize once
tool = VisionDocumentTool()

# Process multiple invoices
invoices = ["invoice1.jpg", "invoice2.pdf", "invoice3.png"]
results = []

for invoice_path in invoices:
    result = tool.run(invoice_path, document_type="invoice")
    results.append({
        "file": invoice_path,
        "total": result["fields"]["total"],
        "vendor": result["fields"]["vendor"],
        "date": result["fields"]["date"]
    })

# Export to JSON
with open("extracted_invoices.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Processed {len(results)} invoices")

Timeline:

  • First run: ~30 seconds (model download + first inference)
  • Subsequent runs: ~5-7 seconds per page

Scenario 2: LangChain AI Agent

Use Case: Build a chatbot that can answer questions about uploaded documents

from langchain.tools import tool
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from deepseek_visor_agent import VisionDocumentTool

# Initialize OCR tool
ocr_tool = VisionDocumentTool()

@tool
def analyze_document(image_path: str) -> dict:
    """Analyze any document image and extract structured data"""
    return ocr_tool.run(image_path, document_type="auto")

# Create agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
agent = initialize_agent(
    [analyze_document],
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# User interaction
response = agent.run("What is the total amount in invoice.jpg?")
print(response)
# Output: "The total amount in the invoice is $199.00, dated 2024-01-15 from Acme Corp."

User Flow:

  1. User uploads document image via chat interface
  2. Agent calls analyze_document tool
  3. DeepSeek-OCR extracts text + fields
  4. LLM interprets results and responds naturally

Scenario 3: Batch PDF Processing

Use Case: Process hundreds of multi-page contracts

from deepseek_visor_agent import VisionDocumentTool
from pathlib import Path
import json

tool = VisionDocumentTool()

# Find all PDFs
contracts_dir = Path("./contracts/")
pdf_files = list(contracts_dir.glob("*.pdf"))

results = []
for pdf_path in pdf_files:
    print(f"Processing {pdf_path.name}...")

    result = tool.run(
        str(pdf_path),
        document_type="contract",
        pdf_start_page=0,  # Process first 3 pages only
        pdf_end_page=2
    )

    results.append({
        "filename": pdf_path.name,
        "parties": result["fields"]["parties"],
        "effective_date": result["fields"]["effective_date"],
        "pages_processed": result["pages"]
    })

# Save results
with open("contracts_summary.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Processed {len(results)} contracts")

Performance: ~6-7 seconds per page on Tesla T4

Scenario 4: REST API for No-Code Platforms (Dify/Flowise)

Use Case: Integrate with Dify for visual workflow builder

See Dify Integration Guide for complete setup.

High-level flow:

  1. Deploy FastAPI wrapper (provided in examples)
  2. Configure Dify HTTP node with OCR endpoint
  3. Build visual workflow: Upload → OCR → Parse → Respond
  4. No Python code needed for end users

🔗 Integrations

LangChain

from langchain.tools import tool
from deepseek_visor_agent import VisionDocumentTool

ocr_tool = VisionDocumentTool()

@tool
def extract_invoice_data(image_path: str) -> dict:
    """Extract structured data from invoice images"""
    return ocr_tool.run(image_path, document_type="invoice")

# Use in your agent
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI

tools = [extract_invoice_data]
agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

response = agent.run("Extract the total from invoice.jpg")

LlamaIndex

from llama_index.tools import FunctionTool
from deepseek_visor_agent import VisionDocumentTool

tool = VisionDocumentTool()

def ocr_document(image_path: str) -> dict:
    """Process documents with OCR"""
    return tool.run(image_path)

llama_tool = FunctionTool.from_defaults(fn=ocr_document)

Dify / Flowise

See integration guide for REST API setup.

📊 Features

Automatic Device Management

The tool automatically detects your hardware and selects the optimal configuration:

Hardware Inference Mode Memory Usage
RTX 4090 (24GB) Gundam ~10GB
RTX 3090 (24GB) Base ~6GB
RTX 2060 (6GB) Tiny ~3GB
CPU only Not Supported -

Automatic Fallback

If inference fails (OOM, CUDA errors), automatically falls back to lower-resolution modes:

Gundam mode (OOM) → Large mode → Base mode → Small mode → Tiny mode (Success!)

Supported Document Types

  • Invoices - Extracts total, date, vendor, line items
  • Contracts - Extracts parties, effective date, terms
  • PDF Documents - Multi-page PDFs with automatic page splitting
  • 🚧 Resumes - Coming soon
  • 🚧 Forms - Coming soon

PDF Support

Based on DeepSeek-OCR official implementation:

  • ✅ Multi-page PDF processing
  • ✅ Automatic page-to-image conversion (PyMuPDF)
  • ✅ Configurable DPI (default: 144, same as official)
  • ✅ Page range selection
  • ✅ Same API as image processing
# Process entire PDF
result = tool.run("contract.pdf")

# Process specific pages (0-indexed)
result = tool.run("doc.pdf", pdf_start_page=0, pdf_end_page=2)

# Adjust quality
result = tool.run("scan.pdf", pdf_dpi=200)  # Higher DPI = better quality

Output Format

{
    "markdown": "# Invoice\n\nDate: 2024-01-15\n...",
    "fields": {
        "total": "$199.00",
        "date": "2024-01-15",
        "vendor": "Acme Corp"
    },
    "document_type": "invoice",
    "confidence": 0.95,
    "metadata": {
        "inference_mode": "tiny",
        "device": "cuda",
        "inference_time_ms": 1823
    },
    "pages": 1  # Number of pages processed (1 for images, N for PDFs)
}

⚡ Performance

GPU-Tested on Tesla T4 (16GB VRAM) - 2025-10-21

Inference Mode Inference Time Test Environment Notes
Tiny 5.35s/page Tesla T4, Simple Doc Fastest, 64 tokens
Small 6.53s/page Tesla T4, Simple Doc 100 tokens
Base 6.77s/page Tesla T4, Simple Doc 256 tokens, Most Common
Large 6.35s/page Tesla T4, Simple Doc 400 tokens
Gundam 6.67s/page Tesla T4, Simple Doc Crop mode, 256+400 tokens

⚠️ Note: Performance tested on simple text documents. Real-world complex documents (tables, images, forms) may vary.

📚 Documentation

🚀 Getting Started

🏗️ For Developers

🛣️ Roadmap

  • Core OCR engine with auto-fallback
  • Invoice parser
  • Contract parser (basic)
  • PDF support (via PyMuPDF, official DeepSeek-OCR method)
  • Resume parser
  • Multi-language support
  • Hosted API (Cloud version)
  • LlamaIndex native tool
  • Dify plugin

🤝 Contributing

We welcome contributions! Areas where help is needed:

  1. New parsers - Add support for new document types
  2. Testing - More test cases and edge cases
  3. Documentation - Improve guides and examples
  4. Performance - Optimization suggestions

Please submit issues or pull requests on GitHub.

📖 Citation

Built on top of DeepSeek-OCR:

@misc{deepseek-ocr,
  author = {DeepSeek AI},
  title = {DeepSeek-OCR},
  year = {2025},
  url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}

📄 License

Apache License 2.0 - see LICENSE file for details.

🙏 Acknowledgments

  • DeepSeek AI team for the amazing OCR model
  • Hugging Face for model hosting
  • LangChain and LlamaIndex communities for inspiration

📬 Contact


Star ⭐ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepseek_visor_agent-0.2.0.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepseek_visor_agent-0.2.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file deepseek_visor_agent-0.2.0.tar.gz.

File metadata

  • Download URL: deepseek_visor_agent-0.2.0.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for deepseek_visor_agent-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b6e7e5b51b17cc5f3cf766d3ba087cbd6ab553fad9d69ce2f45be0feefb5881b
MD5 bcb2366abf5c2c1472be4d27c02b31ef
BLAKE2b-256 8fbd6594ac3d21e2a4a5001f08a4da947ab1ca62cf44f08d38ac76194be8ba95

See more details on using hashes here.

File details

Details for the file deepseek_visor_agent-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for deepseek_visor_agent-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c17d2ed4017ce0899812d29acb9d2e0ceb8023c02f5a8c14e1410689cf5ffeb0
MD5 1c865da911685c7a1ca138c4f4f67b90
BLAKE2b-256 fdb95228d72d1061535ee8ef4292fd7d3fef0712de72eb934b234b150d587bfb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page