Skip to main content

Standard vision tool for AI agents - Convert documents to structured data

Project description

DeepSeek Visor Agent

Standard vision tool for AI agents - Convert documents to structured data in 3 lines of code

PyPI version License: Apache 2.0 Python 3.9+


⚠️ GPU Requirements (CRITICAL)

NVIDIA GPU with Turing+ architecture required

✅ Supported ❌ Not Supported
RTX 20/30/40 series (Turing/Ampere/Ada) GTX 10 series (Pascal - no FlashAttention)
Tesla T4, A10, A100 GTX 1080 Ti, GTX 1660
Minimum: RTX 2060 (6GB VRAM) CPU-only mode
Recommended: RTX 3090 (24GB VRAM) AMD GPUs (ROCm)

Why? DeepSeek-OCR requires FlashAttention 2.x, which only supports compute capability 7.5+ (Turing and newer).

No GPU? Join our hosted API waitlist (planned for future release).

📖 Detailed compatibility guide: GPU_COMPATIBILITY.md


🎯 What is This?

DeepSeek Visor Agent is a production-ready wrapper for DeepSeek-OCR that makes document understanding effortless for AI agents.

Instead of wrestling with GPU configurations, model variants, and raw markdown output, you get:

  • Auto device detection (CUDA with Turing+ GPUs)
  • Automatic fallback (Gundam mode → Base mode → Tiny mode when OOM)
  • Structured output (Markdown + extracted fields)
  • Agent-ready (LangChain, LlamaIndex, Dify compatible)

⚡ Quick Start

Installation

pip install deepseek-visor-agent

# Optional: For RTX GPUs with FlashAttention support
pip install deepseek-visor-agent[flash-attn]

Basic Usage

from deepseek_visor_agent import VisionDocumentTool

# Initialize the tool (auto-detects best device and model)
tool = VisionDocumentTool()

# Process a document
result = tool.run("invoice.jpg")

print(result["fields"]["total"])  # "$199.00"
print(result["fields"]["date"])   # "2024-01-15"
print(result["document_type"])    # "invoice"

That's it! No configuration needed.

🔗 Integrations

LangChain

from langchain.tools import tool
from deepseek_visor_agent import VisionDocumentTool

ocr_tool = VisionDocumentTool()

@tool
def extract_invoice_data(image_path: str) -> dict:
    """Extract structured data from invoice images"""
    return ocr_tool.run(image_path, document_type="invoice")

# Use in your agent
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI

tools = [extract_invoice_data]
agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

response = agent.run("Extract the total from invoice.jpg")

LlamaIndex

from llama_index.tools import FunctionTool
from deepseek_visor_agent import VisionDocumentTool

tool = VisionDocumentTool()

def ocr_document(image_path: str) -> dict:
    """Process documents with OCR"""
    return tool.run(image_path)

llama_tool = FunctionTool.from_defaults(fn=ocr_document)

Dify / Flowise

See integration guide for REST API setup.

📊 Features

Automatic Device Management

The tool automatically detects your hardware and selects the optimal configuration:

Hardware Inference Mode Memory Usage
RTX 4090 (24GB) Gundam ~10GB
RTX 3090 (24GB) Base ~6GB
RTX 2060 (6GB) Tiny ~3GB
CPU only Not Supported -

Automatic Fallback

If inference fails (OOM, CUDA errors), automatically falls back to lower-resolution modes:

Gundam mode (OOM) → Large mode → Base mode → Small mode → Tiny mode (Success!)

Supported Document Types

  • Invoices - Extracts total, date, vendor, line items
  • Contracts - Extracts parties, effective date, terms
  • 🚧 Resumes - Coming soon
  • 🚧 Forms - Coming soon

Output Format

{
    "markdown": "# Invoice\n\nDate: 2024-01-15\n...",
    "fields": {
        "total": "$199.00",
        "date": "2024-01-15",
        "vendor": "Acme Corp"
    },
    "document_type": "invoice",
    "confidence": 0.95,
    "metadata": {
        "inference_mode": "tiny",
        "device": "cuda",
        "inference_time_ms": 1823
    }
}

⚡ Performance

GPU-Tested on Tesla T4 (16GB VRAM) - 2025-10-21

Inference Mode Inference Time Test Environment Notes
Tiny 5.35s/page Tesla T4, Simple Doc Fastest, 64 tokens
Small 6.53s/page Tesla T4, Simple Doc 100 tokens
Base 6.77s/page Tesla T4, Simple Doc 256 tokens, Most Common
Large 6.35s/page Tesla T4, Simple Doc 400 tokens
Gundam 6.67s/page Tesla T4, Simple Doc Crop mode, 256+400 tokens

⚠️ Note: Performance tested on simple text documents. Real-world complex documents (tables, images, forms) may vary.

📚 Documentation

🚀 Getting Started

🏗️ For Developers

🛣️ Roadmap

  • Core OCR engine with auto-fallback
  • Invoice parser
  • Contract parser (basic)
  • PDF support (via pdf2image)
  • Resume parser
  • Multi-language support
  • Hosted API (Cloud version)
  • LlamaIndex native tool
  • Dify plugin

🤝 Contributing

We welcome contributions! Areas where help is needed:

  1. New parsers - Add support for new document types
  2. Testing - More test cases and edge cases
  3. Documentation - Improve guides and examples
  4. Performance - Optimization suggestions

Please submit issues or pull requests on GitHub.

📖 Citation

Built on top of DeepSeek-OCR:

@misc{deepseek-ocr,
  author = {DeepSeek AI},
  title = {DeepSeek-OCR},
  year = {2025},
  url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}

📄 License

Apache License 2.0 - see LICENSE file for details.

🙏 Acknowledgments

  • DeepSeek AI team for the amazing OCR model
  • Hugging Face for model hosting
  • LangChain and LlamaIndex communities for inspiration

📬 Contact


Star ⭐ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepseek_visor_agent-0.1.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepseek_visor_agent-0.1.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file deepseek_visor_agent-0.1.0.tar.gz.

File metadata

  • Download URL: deepseek_visor_agent-0.1.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for deepseek_visor_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 80a77e3d65c28c822cb283e5035b7ccfc8431d1be2087e022476a136bd3d50a3
MD5 cfa533bc6d02bd656ba9f0958396af9f
BLAKE2b-256 71cf92bb474e884e4cf69b09e9c43573cbf5d96a7fd69ebf4d9e6deb85f199b1

See more details on using hashes here.

File details

Details for the file deepseek_visor_agent-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for deepseek_visor_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7539a1320dcad92454c8c12295ed3dc0e58144be560b7ea365ce821bc8763014
MD5 599afd0adf89e16795005cee4b559a20
BLAKE2b-256 36daa4aa5d859912119431918ad025265feabff201c663307a6bec0b29ca1e41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page