Standard vision tool for AI agents - Convert documents to structured data
Project description
DeepSeek Visor Agent
Standard vision tool for AI agents - Convert documents to structured data in 3 lines of code
⚠️ GPU Requirements (CRITICAL)
NVIDIA GPU with Turing+ architecture required
| ✅ Supported | ❌ Not Supported |
|---|---|
| RTX 20/30/40 series (Turing/Ampere/Ada) | GTX 10 series (Pascal - no FlashAttention) |
| Tesla T4, A10, A100 | GTX 1080 Ti, GTX 1660 |
| Minimum: RTX 2060 (6GB VRAM) | CPU-only mode |
| Recommended: RTX 3090 (24GB VRAM) | AMD GPUs (ROCm) |
Why? DeepSeek-OCR requires FlashAttention 2.x, which only supports compute capability 7.5+ (Turing and newer).
No GPU? Join our hosted API waitlist (planned for future release).
📖 Detailed compatibility guide: GPU_COMPATIBILITY.md
🎯 What is This?
DeepSeek Visor Agent is a production-ready wrapper for DeepSeek-OCR that makes document understanding effortless for AI agents.
Instead of wrestling with GPU configurations, model variants, and raw markdown output, you get:
- ✅ Auto device detection (CUDA with Turing+ GPUs)
- ✅ Automatic fallback (Gundam mode → Base mode → Tiny mode when OOM)
- ✅ Structured output (Markdown + extracted fields)
- ✅ Agent-ready (LangChain, LlamaIndex, Dify compatible)
⚡ Quick Start
Installation
pip install deepseek-visor-agent
# Optional: For RTX GPUs with FlashAttention support
pip install deepseek-visor-agent[flash-attn]
Basic Usage
from deepseek_visor_agent import VisionDocumentTool
# Initialize the tool (auto-detects best device and model)
tool = VisionDocumentTool()
# Process a document
result = tool.run("invoice.jpg")
print(result["fields"]["total"]) # "$199.00"
print(result["fields"]["date"]) # "2024-01-15"
print(result["document_type"]) # "invoice"
That's it! No configuration needed.
🔗 Integrations
LangChain
from langchain.tools import tool
from deepseek_visor_agent import VisionDocumentTool
ocr_tool = VisionDocumentTool()
@tool
def extract_invoice_data(image_path: str) -> dict:
"""Extract structured data from invoice images"""
return ocr_tool.run(image_path, document_type="invoice")
# Use in your agent
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI
tools = [extract_invoice_data]
agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
response = agent.run("Extract the total from invoice.jpg")
LlamaIndex
from llama_index.tools import FunctionTool
from deepseek_visor_agent import VisionDocumentTool
tool = VisionDocumentTool()
def ocr_document(image_path: str) -> dict:
"""Process documents with OCR"""
return tool.run(image_path)
llama_tool = FunctionTool.from_defaults(fn=ocr_document)
Dify / Flowise
See integration guide for REST API setup.
📊 Features
Automatic Device Management
The tool automatically detects your hardware and selects the optimal configuration:
| Hardware | Inference Mode | Memory Usage |
|---|---|---|
| RTX 4090 (24GB) | Gundam | ~10GB |
| RTX 3090 (24GB) | Base | ~6GB |
| RTX 2060 (6GB) | Tiny | ~3GB |
| CPU only | Not Supported | - |
Automatic Fallback
If inference fails (OOM, CUDA errors), automatically falls back to lower-resolution modes:
Gundam mode (OOM) → Large mode → Base mode → Small mode → Tiny mode (Success!)
Supported Document Types
- ✅ Invoices - Extracts total, date, vendor, line items
- ✅ Contracts - Extracts parties, effective date, terms
- 🚧 Resumes - Coming soon
- 🚧 Forms - Coming soon
Output Format
{
"markdown": "# Invoice\n\nDate: 2024-01-15\n...",
"fields": {
"total": "$199.00",
"date": "2024-01-15",
"vendor": "Acme Corp"
},
"document_type": "invoice",
"confidence": 0.95,
"metadata": {
"inference_mode": "tiny",
"device": "cuda",
"inference_time_ms": 1823
}
}
⚡ Performance
✅ GPU-Tested on Tesla T4 (16GB VRAM) - 2025-10-21
| Inference Mode | Inference Time | Test Environment | Notes |
|---|---|---|---|
| Tiny | 5.35s/page | Tesla T4, Simple Doc | Fastest, 64 tokens |
| Small | 6.53s/page | Tesla T4, Simple Doc | 100 tokens |
| Base | 6.77s/page | Tesla T4, Simple Doc | 256 tokens, Most Common |
| Large | 6.35s/page | Tesla T4, Simple Doc | 400 tokens |
| Gundam | 6.67s/page | Tesla T4, Simple Doc | Crop mode, 256+400 tokens |
⚠️ Note: Performance tested on simple text documents. Real-world complex documents (tables, images, forms) may vary.
📚 Documentation
🚀 Getting Started
- 📚 Documentation Center - Complete documentation hub
- GPU Compatibility Guide
- Hardware Limitations
🏗️ For Developers
🛣️ Roadmap
- Core OCR engine with auto-fallback
- Invoice parser
- Contract parser (basic)
- PDF support (via pdf2image)
- Resume parser
- Multi-language support
- Hosted API (Cloud version)
- LlamaIndex native tool
- Dify plugin
🤝 Contributing
We welcome contributions! Areas where help is needed:
- New parsers - Add support for new document types
- Testing - More test cases and edge cases
- Documentation - Improve guides and examples
- Performance - Optimization suggestions
Please submit issues or pull requests on GitHub.
📖 Citation
Built on top of DeepSeek-OCR:
@misc{deepseek-ocr,
author = {DeepSeek AI},
title = {DeepSeek-OCR},
year = {2025},
url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}
📄 License
Apache License 2.0 - see LICENSE file for details.
🙏 Acknowledgments
- DeepSeek AI team for the amazing OCR model
- Hugging Face for model hosting
- LangChain and LlamaIndex communities for inspiration
📬 Contact
- GitHub Issues: Report bugs or request features
- Email: jack_ai@qq.com
Star ⭐ this repo if you find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepseek_visor_agent-0.1.0.tar.gz.
File metadata
- Download URL: deepseek_visor_agent-0.1.0.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80a77e3d65c28c822cb283e5035b7ccfc8431d1be2087e022476a136bd3d50a3
|
|
| MD5 |
cfa533bc6d02bd656ba9f0958396af9f
|
|
| BLAKE2b-256 |
71cf92bb474e884e4cf69b09e9c43573cbf5d96a7fd69ebf4d9e6deb85f199b1
|
File details
Details for the file deepseek_visor_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deepseek_visor_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7539a1320dcad92454c8c12295ed3dc0e58144be560b7ea365ce821bc8763014
|
|
| MD5 |
599afd0adf89e16795005cee4b559a20
|
|
| BLAKE2b-256 |
36daa4aa5d859912119431918ad025265feabff201c663307a6bec0b29ca1e41
|