Skip to main content

Enterprise OCR Platform with Modular Installation - Network & Process Intelligence

Project description

NetIntel-OCR (Network Intelligence OCR) v0.1.17.1

๐Ÿš€ Enterprise OCR Platform with Modular Installation and Enhanced Version Display!

Version Python Modular Knowledge Graph FalkorDB Milvus PyKEEN Docker Kubernetes API MCP

NetIntel-OCR is an enterprise-grade platform for extracting intelligence from technical documents. It automatically detects and processes network diagrams, flow diagrams, tables, and text - converting them into structured, searchable formats with optional knowledge graph representation and vector embeddings. With v0.1.17.1, it features modular installation reducing the base size from 2.5GB to 500MB, plus enhanced version display showing all system capabilities.

๐ŸŽ‰ Version 0.1.17.1 introduces Modular Installation! Install only what you need - base OCR is just 500MB, with optional modules for KG, Vector, API, and more. The enhanced --version command shows exactly what's installed and available.

๐ŸŽฏ Key Capabilities

Network Intelligence Extraction

  • Automatic Network Detection: AI-powered identification of network diagrams in documents
  • Component Recognition: Identifies routers, switches, firewalls, servers, and other network elements
  • Connection Mapping: Traces and documents network paths and relationships
  • Security Architecture Analysis: Extracts security zones, DMZs, and trust boundaries

โœจ Features

๐Ÿ†• New in v0.1.17.1 - Modular Installation & Enhanced Version Display

๐Ÿ“ฆ Modular Installation

  • Reduced Base Size: From 2.5GB to just 500MB for core OCR functionality
  • 7 Optional Modules: Choose what you need:
    • [kg] - Knowledge Graph with PyKEEN and FalkorDB (+1.5GB)
    • [vector] - Vector stores (Milvus, Qdrant, ChromaDB) (+300MB)
    • [api] - REST API server with FastAPI (+50MB)
    • [mcp] - Model Context Protocol server (+30MB)
    • [performance] - C++ optimizations and SIMD (+200MB)
    • [dev] - Development tools (pytest, black, ruff) (+100MB)
    • [all] - Everything included (2.5GB total)
  • Preset Configurations:
    • [production] - KG + Vector + API + Performance
    • [cloud] - Vector + API + MCP
  • Smart Detection: System automatically configures based on installed modules

๐Ÿ“Š Enhanced Version Display

# Check what's installed and available
netintel-ocr --version

# Example output:
NetIntel-OCR v0.1.17.1
โ”œโ”€โ”€ Core Components:
โ”‚   โ”œโ”€โ”€ C++ Core: โœ“ v1.0.1
โ”‚   โ”œโ”€โ”€ AVX2: โœ“
โ”‚   โ””โ”€โ”€ Platform: Linux x86_64
โ”œโ”€โ”€ Installed Modules:
โ”‚   โ”œโ”€โ”€ [base] Core OCR: โœ“ (always installed)
โ”‚   โ”œโ”€โ”€ [kg] Knowledge Graph: โœ“ (pykeen 1.10.1)
โ”‚   โ””โ”€โ”€ [vector] Vector Store: โœ— (not installed)
โ”œโ”€โ”€ Available for Install:
โ”‚   โ””โ”€โ”€ [vector]: pip install netintel-ocr[vector]
โ””โ”€โ”€ Active Features:
    โ”œโ”€โ”€ FalkorDB: โœ“ (connected)
    โ””โ”€โ”€ Ollama: โœ“ (connected)

๐Ÿ†• New in v0.1.17 - Hierarchical CLI & Hybrid Knowledge Graph System

๐ŸŽฏ Hierarchical CLI Structure

  • ๐Ÿ“ 8 Command Groups: Organized into intuitive categories for better discoverability
    • process - Document processing (pdf, batch, watch)
    • server - Server operations (api, mcp, worker, health)
    • db - Database management (query, merge, stats)
    • kg - Knowledge Graph (18+ commands)
    • model - Model management (list, set-default, ollama)
    • project - Project initialization (templates)
    • config - Configuration (profiles, templates, validation)
    • system - System utilities (check, diagnose, version)
  • ๐Ÿ”„ Breaking Change: netintel-ocr document.pdf โ†’ netintel-ocr process pdf document.pdf
  • ๐Ÿ“‹ Configuration Templates: 6 pre-built templates (minimal, development, staging, production, enterprise, cloud)
  • ๐Ÿ‘ค Profile Management: Multiple configuration profiles with easy switching
  • ๐ŸŒ Environment Variables: Complete configuration override capability

๐Ÿง  Knowledge Graph System (Now Default!)

  • ๐Ÿง  Knowledge Graph Construction: Automatically build graph representations from network diagrams, flow diagrams, tables, and text
  • ๐Ÿ—„๏ธ FalkorDB Integration: Unified storage for both graph structure and KG embeddings
  • ๐ŸŽฏ PyKEEN Embeddings: Train knowledge graph embeddings with 8 supported models (TransE, RotatE, ComplEx, DistMult, ConvE, TuckER, HolE, RESCAL)
  • ๐Ÿ” Hybrid Retrieval: Combine graph traversal with vector similarity search for powerful queries
  • ๐Ÿ“Š Query Intent Classification: Automatically route queries to optimal retrieval strategy (entity-centric, relational, topological, semantic, analytical, exploratory)
  • ๐Ÿš€ 4 Retrieval Strategies: Vector-first, graph-first, parallel (with RRF), and adaptive strategies
  • ๐Ÿ”„ Enhanced MiniRAG: Extended with FalkorDB storage adapter and 3 query modes (minirag_only, kg_embedding_only, hybrid)
  • ๐Ÿ“ˆ Performance Metrics: 92% query accuracy, <150ms response time, 25% storage reduction
  • ๐Ÿ› ๏ธ 18+ KG Commands: Including kg init, kg train-embeddings, kg hybrid-search, kg path-find, and more
  • ๐Ÿณ Production Ready: Docker Compose, Kubernetes manifests, REST API with health checks, and monitoring support
  • ๐Ÿ“Š Benchmarking Suite: Performance testing tools for all retrieval strategies
  • ๐ŸŽจ Visualization Tools: 2D/3D embedding visualization, clustering analysis, and statistics
  • โœ… Default Installation: KG features are now included in the standard installation (no extra steps needed!)

๐Ÿ†• New in v0.1.16.11 - Remote Server Support

  • Dynamic OLLAMA_HOST handling for remote deployments
  • Improved API endpoint resolution for better connectivity

๐Ÿ†• New in v0.1.16 - Flow Diagrams and Prompt Customization

  • ๐Ÿ“Š Flow Diagram Support: Full extraction and analysis of process flows, workflows, and decision trees
  • ๐Ÿ”„ Unified Diagram Detection: Automatically identifies network, flow, or hybrid diagrams
  • ๐ŸŽฏ Process Intelligence: Identifies bottlenecks, optimization opportunities, and critical paths
  • ๐Ÿ“ Customizable Prompts: Export, modify, and import all prompts for industry-specific needs
  • ๐ŸŽจ Prompt Templates: Pre-built templates for security, compliance, cloud, and process optimization
  • ๐Ÿ”ง Runtime Overrides: Change prompts on-the-fly without editing files
  • ๐Ÿ“ˆ Flow Mermaid Generation: Automatic conversion to flowchart TD/LR format
  • ๐Ÿง  Context-Aware Analysis: Reads 2 paragraphs before/after diagrams for accurate interpretation
  • ๐Ÿ” Type-Specific Processing: Different analysis for network vs flow diagrams
  • ๐ŸŒ Hybrid Diagram Support: Handles diagrams with both network and flow elements

Previous v0.1.15 - Milvus Vector Database Integration

  • ๐Ÿš€ 20-60x Faster Search: Sub-100ms query response with Milvus distributed architecture
  • ๐Ÿ’พ 70% Memory Reduction: Process 10x more documents with the same hardware
  • ๐ŸŽฏ Enterprise Scale: From standalone to distributed deployment without code changes
  • ๐Ÿค– Qwen3-8B Embeddings: Advanced 4096-dimensional embeddings via Ollama
  • ๐Ÿ”„ IVF_SQ8 Index: CPU-optimized scalar quantization for standard hardware
  • ๐Ÿ“ฆ One-Command Setup: Automatic configuration with netintel-ocr --init
  • ๐Ÿณ Docker Compose Ready: Pre-configured stack with etcd, MinIO, and Milvus
  • โ˜ธ๏ธ Kubernetes Support: Production-ready Helm charts for enterprise deployment
  • ๐Ÿ”ง OLLAMA_HOST Detection: Automatic discovery of Ollama embedding service

Previous v0.1.14 - High-Performance Deduplication with C++ Core

  • โšก 50-100x Performance Boost: C++ core with AVX2 SIMD and OpenMP parallelization
  • ๐ŸŽฏ Three-Level Deduplication: MD5 (exact), SimHash (fuzzy), CDC (content-level)
  • ๐Ÿ“ฆ Zero-Compilation Install: Pre-compiled binary wheels for Linux/macOS/Windows
  • ๐Ÿ” Near-Duplicate Detection: SimHash with configurable Hamming distance threshold
  • ๐Ÿ“Š Content-Defined Chunking: Remove repetitive blocks with 30-50% storage reduction
  • ๐ŸŽจ Version Information: netintel-ocr --version shows C++ core status
  • ๐Ÿ”ง Automatic Fallback: Python implementation when C++ unavailable

Previous v0.1.13 - Service-Oriented Architecture

  • ๐ŸŒ REST API Server: FastAPI-based server with full OpenAPI/Swagger documentation
  • ๐Ÿค– MCP Server: Model Context Protocol server for LLM integration
  • ๐Ÿ“ฆ Multi-Scale Deployments: From single container to enterprise Kubernetes
  • ๐Ÿš€ Flexible Worker Architecture: Embedded workers or Kubernetes Jobs

Previous v0.1.12 - Advanced Database Management

  • ๐Ÿ—„๏ธ Centralized Database Management: Unified LanceDB with deduplication and MD5 checksums
  • ๐Ÿ” Advanced Query Engine: Vector similarity search with multi-field filtering and reranking
  • ๐Ÿ“Š Multiple Output Formats: JSON, Markdown, and CSV output for queries
  • ๐Ÿš€ Batch Processing Pipeline: Parallel PDF processing with progress tracking

Core Features

  • ๐Ÿš€ Vector Database Integration (v0.1.7): Automatic generation of LanceDB-ready chunks and vector-optimized content
  • ๐ŸŽฏ Intelligent Hybrid Processing: Automatically detects and processes network diagrams as Mermaid.js, tables as JSON, text as markdown
  • ๐Ÿ“„ PDF to Text Conversion: Convert PDFs to markdown files locally, no token costs
  • ๐Ÿค– Multi-Model Support (v0.1.4): Use different models for text and network processing for optimal performance
  • ๐Ÿ“Š Table Extraction (v0.1.6-v0.1.10): Automatic detection and extraction of tables with smart ToC exclusion
  • ๐Ÿ–ผ๏ธ Visual Understanding: Turn images and diagrams into detailed text descriptions
  • ๐Ÿ”Œ Automatic Network Detection: No flags needed - network diagrams are detected and converted automatically
  • ๐ŸŽจ Icons by Default: Font Awesome icons automatically added to network diagrams for better visualization
  • โฑ๏ธ Smart Timeouts: Operations timeout gracefully with fallback to simpler methods
  • ๐Ÿ“Š Diagram Types Supported: Network topology, architecture diagrams, data flow diagrams, security diagrams
  • ๐Ÿ“ MD5-Based Organization (v0.1.4): Each document stored in unique folder using MD5 checksum
  • ๐Ÿ“ Document Index (v0.1.4): Automatic index.md tracking all processed documents
  • ๐Ÿ“ˆ Enhanced Metrics (v0.1.4): Comprehensive footer with processing details, errors, and configuration
  • โšก Optimized Processing: Processes up to 100 pages per run with detailed progress tracking
  • ๐Ÿ”ง Flexible Output: Unified markdown format with seamlessly embedded Mermaid diagrams and tables
  • ๐Ÿ”„ Checkpoint/Resume (v0.1.5): Resume interrupted processing from exact stopping point
  • ๐Ÿ” Vector Search Ready (v0.1.7): Pre-chunked content with minimal metadata for optimal vector search performance
  • ๐Ÿ” Vector Regeneration (v0.1.10): Regenerate vector files from existing markdown without reprocessing PDFs

๐Ÿ’ผ Use Cases

Network Documentation

  • Convert legacy network diagrams to modern formats
  • Extract network topology from vendor documentation
  • Audit and inventory network architectures

Security Analysis

  • Map security architecture from compliance documents
  • Extract firewall rules and network segmentation
  • Document data flow and trust boundaries

Infrastructure Planning

  • Analyze existing network designs
  • Extract capacity and redundancy information
  • Document interconnections and dependencies

๐Ÿ“ฆ Requirements

  • Python 3.10+
  • Ollama installed and running locally or on a remote server

Installing Ollama and the Default Model

  1. Install Ollama
  2. Pull the default model:
ollama run nanonets-ocr-s:latest

Using a Remote Ollama Server

By default, netintel-ocr connects to Ollama running on localhost. To use a remote Ollama server, set the OLLAMA_HOST environment variable:

# Connect to a remote Ollama server
export OLLAMA_HOST="http://192.168.1.100:11434"
netintel-ocr document.pdf

# Or run with the environment variable inline
OLLAMA_HOST="http://remote-server:11434" netintel-ocr document.pdf

Knowledge Graph Environment Variables (v0.1.17)

When using the hybrid Knowledge Graph system, configure these environment variables:

# LLM and Embedding Models (required - no defaults in code)
export MINIRAG_LLM="ollama/gemma3:4b-it-qat"              # Recommended LLM model
export MINIRAG_EMBEDDING="ollama/Qwen3-Embedding-8B"      # Recommended embedding model
export MINIRAG_EMBEDDING_DIM="4096"                       # Embedding dimensions

# External Ollama Server (required for KG system)
export OLLAMA_HOST="http://192.168.1.100:11434"          # External Ollama server

# FalkorDB Configuration
export FALKORDB_HOST="localhost"
export FALKORDB_PORT="6379"
export FALKORDB_GRAPH="netintel_kg"

# Milvus Configuration
export MILVUS_HOST="localhost"
export MILVUS_PORT="19530"
export MILVUS_COLLECTION="netintel_vectors"

# PyKEEN Configuration
export PYKEEN_MODEL="TransE"                             # Options: TransE, RotatE, ComplEx
export PYKEEN_EMBEDDING_DIM="200"
export PYKEEN_EPOCHS="100"
export PYKEEN_BATCH_SIZE="128"

# Query Configuration
export KG_QUERY_MODE="hybrid"                            # Options: graph, embedding, hybrid
export KG_MAX_RESULTS="20"
export KG_MIN_CONFIDENCE="0.7"

Installation

๐Ÿ†• Modular Installation (v0.1.17.1)

Choose your installation based on needs:

# Minimal installation (500MB) - Core OCR only
pip install netintel-ocr

# With Knowledge Graph (2GB total) - Recommended
pip install "netintel-ocr[kg]"

# Production setup (2.3GB) - KG + Vector + API + Performance
pip install "netintel-ocr[production]"

# Cloud deployment (1.5GB) - Vector + API + MCP
pip install "netintel-ocr[cloud]"

# Everything (2.5GB) - All features
pip install "netintel-ocr[all]"

# Check what's installed
netintel-ocr --version

The package now uses Ollama for embeddings (default: qwen3-embedding:8b with 4096 dimensions), providing superior accuracy with optional Milvus integration.

or uv:

uv tool install netintel-ocr

๐Ÿš€ Quick Start - Choose Your Deployment Scale (NEW v0.1.15!)

Development Scale (1-50 users, up to 1M documents)

# Initialize development deployment (default)
netintel-ocr --init
# Automatically detects OLLAMA_HOST
# Generates Docker Compose with Milvus Standalone

# Start the stack
cd ~/.netintel-ocr
docker-compose up -d
# Milvus: http://localhost:19530
# API: http://localhost:8000
# MCP: http://localhost:8001

Production Scale (100+ users, 100M+ documents)

# Initialize production deployment
netintel-ocr --init --scale production

# Deploy with Kubernetes
helm install netintel-ocr ./helm \
  --namespace netintel-ocr \
  --create-namespace

# Or use Docker with full monitoring
docker-compose -f docker/docker-compose.large.yml up -d
# Grafana: http://localhost:3000

Usage

Quick Start Examples

# Process a network architecture document
netintel-ocr process pdf network-architecture.pdf

# Batch process with Knowledge Graph
netintel-ocr process batch /documents/ --parallel 4

# Query processed documents
netintel-ocr db query "firewall configuration"

# Start production server
netintel-ocr server all --api-port 8000 --mcp-port 8001

# Check system status
netintel-ocr system check

๐Ÿ†• v0.1.17 Hierarchical CLI Usage

Process Documents (New Syntax!)

# NEW: Process a PDF document (KG enabled by default)
netintel-ocr process pdf document.pdf

# Process without Knowledge Graph
netintel-ocr process pdf document.pdf --no-kg

# Process specific pages
netintel-ocr process pdf document.pdf --start 5 --end 10

# Batch processing
netintel-ocr process batch /path/to/pdfs/

# Watch directory for new PDFs
netintel-ocr process watch /input/folder --pattern "*.pdf"

Server Operations

# Start all services (API + MCP)
netintel-ocr server all

# Start API server only
netintel-ocr server api --port 8000 --workers 4

# Start MCP server
netintel-ocr server mcp --port 8001

# Development server with hot reload
netintel-ocr server dev --reload

# Check health
netintel-ocr server health

Knowledge Graph Commands

# Initialize KG system
netintel-ocr kg init

# Process with KG
netintel-ocr kg process document.pdf

# Query the graph
netintel-ocr kg query "MATCH (n:NetworkDevice) RETURN n"

# Natural language query
netintel-ocr kg rag-query "What are the security vulnerabilities?"

# Train embeddings
netintel-ocr kg train-embeddings --model RotatE

# Find similar entities
netintel-ocr kg find-similar "Router-A"

# Visualize embeddings
netintel-ocr kg visualize --method tsne

Configuration Management

# Initialize configuration
netintel-ocr config init --template production

# Set configuration values
netintel-ocr config set server.api.port 8000
netintel-ocr config set models.default qwen2.5vl:7b

# Manage profiles
netintel-ocr config profile create production
netintel-ocr config profile use production

# Export environment variables
netintel-ocr config env export > .env

Legacy CLI Usage (Deprecated)

The old syntax still works but is deprecated:

# OLD SYNTAX (deprecated)
netintel-ocr document.pdf

# Use NEW SYNTAX instead:
netintel-ocr process pdf document.pdf

๐Ÿ†• v0.1.15 Commands - Milvus Integration & Vector Search

# Initialize with Milvus (auto-detects OLLAMA_HOST)
netintel-ocr --init

# Check version and capabilities
netintel-ocr --version
netintel-ocr --version-json  # JSON output with Milvus status

# Process with Milvus vector storage (20-60x faster search)
netintel-ocr document.pdf --vector-db milvus

# Vector similarity search in Milvus
netintel-ocr --search "network topology" \
  --collection netintel_vectors \
  --limit 10

# Process with full deduplication (enhanced with Milvus)
netintel-ocr document.pdf --dedup-mode full

# Find near-duplicates using Milvus binary vectors
netintel-ocr --find-duplicates document.pdf \
  --hamming-threshold 5 \
  --use-milvus

# Show Milvus collection statistics
netintel-ocr --milvus-stats

# Configure advanced processing
netintel-ocr document.pdf \
  --embedding-model qwen3-embedding:8b \
  --index-type IVF_SQ8 \
  --dedup-mode full

v0.1.12 Commands - Database Management

# Query centralized database with advanced filtering
netintel-ocr --query "network security" \
  --centralized-db ./centralized.lancedb \
  --filters '{"source_type": "network_diagram"}' \
  --output-format json \
  --limit 10

# Merge documents to centralized database
netintel-ocr --merge-to-centralized \
  --output ./output \
  --centralized-db ./unified.lancedb \
  --dedup-strategy md5

# Batch process multiple PDFs with parallel processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 4 \
  --auto-merge

# Database management commands
netintel-ocr --db-stats ./centralized.lancedb
netintel-ocr --db-optimize ./centralized.lancedb --vacuum
netintel-ocr --db-export ./centralized.lancedb --format json

Cloud Workflow with S3/MinIO:

# Configure S3/MinIO storage
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_BUCKET=netintel-documents
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Process with cloud storage
netintel-ocr document.pdf --s3-sync --s3-bucket netintel-documents

# Batch process from cloud storage
netintel-ocr --batch-ingest s3://netintel-documents/pdfs/ \
  --output s3://netintel-documents/output/ \
  --parallel-workers 8

Multi-Model Processing (NEW v0.1.4!)

Use different Ollama models optimized for specific tasks:

# Use fast OCR model for text, powerful model for diagrams
netintel-ocr document.pdf --model nanonets-ocr-s --network-model qwen2.5vl

# Fast processing with lightweight models
netintel-ocr document.pdf --model moondream --network-model bakllava

# Heavy processing for complex network diagrams
netintel-ocr document.pdf --network-model cogvlm --timeout 120

Multi-Model Benefits:

  • 30-50% faster text extraction with OCR-optimized models
  • Better diagram understanding with vision-language models
  • Resource efficiency by using appropriate model sizes
  • Flexibility to experiment with different combinations

Recommended Model Combinations:

Purpose Text Model Network Model Speed
Balanced (Default) nanonets-ocr-s qwen2.5vl Medium
Fast Processing moondream bakllava Fast
Maximum Accuracy qwen2.5vl cogvlm Slow
Resource Limited moondream llava-phi3 Fast

Table Extraction (NEW v0.1.6!)

NetIntel-OCR now automatically detects and extracts tables from PDFs:

# Tables are extracted by default in hybrid mode
netintel-ocr document.pdf

# Use library-first extraction for faster processing
netintel-ocr document.pdf --table-method pdfplumber

# Use LLM for complex tables with merged cells
netintel-ocr document.pdf --table-method llm

# Save tables as separate JSON files
netintel-ocr document.pdf --save-table-json

# Disable table extraction for faster processing
netintel-ocr document.pdf --no-tables

Table Extraction Features:

  • Automatic Detection: Tables identified alongside network diagrams
  • Multiple Methods: Library-first (pdfplumber), LLM-enhanced, or hybrid
  • Complex Table Support: Handles merged cells, multi-row fields, nested headers
  • Structured Output: Tables converted to JSON with validation
  • Markdown Integration: Tables embedded in markdown with both rendered and JSON views

Vector Database Integration (NEW v0.1.7!)

NetIntel-OCR now automatically generates vector database files optimized for RAG applications:

# Vector generation is ON by default - creates LanceDB-ready chunks
netintel-ocr document.pdf

# Disable vector generation (v0.1.6 behavior)
netintel-ocr document.pdf --no-vector

# Customize chunking strategy
netintel-ocr document.pdf --chunk-size 512 --chunk-overlap 50

# Use semantic chunking (default) vs fixed-size
netintel-ocr document.pdf --chunk-strategy semantic

Vector Features:

  • Automatic Generation: Creates document-vector.md and chunks.jsonl by default
  • Content Filtering: Removes processing artifacts, keeps only source content
  • Minimal Metadata: Only source filename, page numbers, and indexed date
  • LanceDB Optimized: Pre-chunked JSONL ready for direct ingestion
  • Smart Chunking: Semantic boundaries respect document structure

Using with LanceDB:

import lancedb
import json

# Load chunks generated by NetIntel-OCR
with open("output/<md5>/lancedb/chunks.jsonl") as f:
    chunks = [json.loads(line) for line in f]

# Create LanceDB table - ready to use!
db = lancedb.connect("./my_lancedb")
table = db.create_table("documents", chunks)

# Search your documents
results = table.search("network configuration").limit(5).to_list()

Performance Optimization

For faster processing of network diagrams, use the --fast-extraction flag:

# Fast extraction mode - reduces extraction time by 50-70%
netintel-ocr document.pdf --fast-extraction

# Combine with multi-model and timeout for best performance
netintel-ocr document.pdf --model nanonets-ocr-s --network-model bakllava --fast-extraction --timeout 30

Fast extraction benefits:

  • Detection: ~15 seconds (vs 30-60s standard)
  • Extraction: ~20 seconds (vs 30-60s standard)
  • Uses simplified prompts for quicker LLM responses
  • Automatic fallback if fast extraction fails

Command Line Options

Basic Options

  • --output, -o: Base output directory (default: "output", documents stored in output/<md5_checksum>/)
  • --model, -m: Ollama model for text extraction (default: "nanonets-ocr-s:latest")
  • --network-model: Separate model for network diagram processing (NEW v0.1.4)
  • --flow-model: Dedicated model for flow diagram processing (NEW v0.1.16.6, defaults to --network-model)
  • --keep-images, -k: Keep the intermediate image files (default: False)
  • --width, -w: Width to resize images to, 0 to skip resizing (default: 0)
  • --start, -s: Start page number (default: 0, processes from beginning)
  • --end, -e: End page number (default: 0, processes to end)
  • --resume: Resume processing from checkpoint if available (NEW v0.1.5)

Processing Mode Options

  • --text-only, -t: Skip network diagram detection for faster text-only processing
  • --network-only: Process only network diagrams, skip regular text pages

Network Diagram Options (applies to default mode)

  • --confidence, -c: Minimum confidence threshold for network diagram detection (0.0-1.0, default: 0.7)
  • --no-icons: Disable Font Awesome icons in Mermaid diagrams (icons are enabled by default)
  • --diagram-only: Only extract network diagrams without page text (by default, both are extracted)
  • --timeout: Timeout in seconds for each LLM operation (default: 60s, increase for complex diagrams)

Vector Database Options (NEW v0.1.7)

  • --no-vector: Disable vector generation (default: enabled)
  • --vector-format: Target vector DB format (default: lancedb, options: pinecone, weaviate, qdrant, chroma)
  • --chunk-size: Chunk size in tokens (default: 1000)
  • --chunk-overlap: Overlap between chunks (default: 100)
  • --chunk-strategy: Chunking strategy (default: semantic, options: fixed, sentence)
  • --embedding-metadata: Include extended metadata (reduces content space)

Examples

Basic Usage (with automatic network detection)

# DEFAULT: Automatic network diagram detection (with icons)
netintel-ocr document.pdf

# Process with custom settings
netintel-ocr document.pdf --confidence 0.8

# Increase timeout for complex diagrams
netintel-ocr document.pdf --timeout 120

# Text-only mode (faster, no detection)
netintel-ocr document.pdf --text-only

# Process specific pages
netintel-ocr document.pdf --start 1 --end 5

# Use a different Ollama model
netintel-ocr document.pdf --model qwen2.5vl:latest

Specialized Processing

# Process ONLY network diagrams (skip text pages)
netintel-ocr network-architecture.pdf --network-only

# Higher confidence threshold (stricter detection)
netintel-ocr document.pdf --confidence 0.9

# Disable icons if not needed
netintel-ocr document.pdf --no-icons

# Extract only diagrams without text (faster)
netintel-ocr document.pdf --diagram-only

# Faster text-only processing
netintel-ocr text-document.pdf --text-only

Process large documents in sections (max 100 pages per run):

# Process first 100 pages
netintel-ocr large-document.pdf --start 1 --end 100

# Process next section
netintel-ocr large-document.pdf --start 101 --end 200

# Process specific chapter (e.g., pages 50-100)
netintel-ocr large-document.pdf --start 50 --end 100

Checkpoint/Resume Capability (NEW v0.1.5)

The tool now supports automatic checkpoint saving and resume functionality for long documents:

How It Works

  • Automatic Saving: Processing state is saved after each page
  • Checkpoint Location: Stored in output/<md5>/.checkpoint/
  • Resume on Interruption: Use --resume to continue from where you left off
  • Page-Level Tracking: Each page is tracked individually
  • Smart Skip: Already processed pages are skipped when resuming

Usage Examples

# Start processing a large document
netintel-ocr large-document.pdf

# If interrupted (Ctrl+C, power failure, etc.), resume processing
netintel-ocr large-document.pdf --resume

# Resume with different settings (completed pages are kept)
netintel-ocr large-document.pdf --resume --timeout 120 --network-model qwen2.5vl

Resume Information

When resuming, you'll see a summary like:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                  RESUME CHECKPOINT FOUND                   โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ Previous Processing:                                        โ•‘
โ•‘   โ€ข Pages completed: 45/100                                โ•‘
โ•‘   โ€ข Network diagrams found: 5                              โ•‘
โ•‘   โ€ข Regular pages: 40                                      โ•‘
โ•‘   โ€ข Failed pages: 0                                        โ•‘
โ•‘                                                            โ•‘
โ•‘ Resume Information:                                        โ•‘
โ•‘   โ€ข Will skip 45 already processed pages                   โ•‘
โ•‘   โ€ข Will process 55 remaining pages                        โ•‘
โ•‘   โ€ข Starting from page 46                                  โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Benefits

  • No Lost Work: Never lose progress on long documents
  • Resource Efficient: Don't reprocess completed pages

Vector Regeneration (v0.1.10)

Regenerate Vector Files Without Reprocessing

Use --vector-regenerate to regenerate vector database files from existing markdown output:

# First time processing
netintel-ocr document.pdf

# Regenerate vectors with different chunk settings
netintel-ocr document.pdf --vector-regenerate --chunk-size 500 --chunk-overlap 100

# Change vector database format
netintel-ocr document.pdf --vector-regenerate --vector-format pinecone

# Use different chunking strategy
netintel-ocr document.pdf --vector-regenerate --chunk-strategy sentence

When to Use Vector Regeneration

  • Optimize chunk size: Adjust for better embedding performance
  • Change vector format: Switch between LanceDB, Pinecone, Weaviate, etc.
  • Update metadata: Add or remove extended metadata
  • Fix errors: Regenerate after fixing vector generation issues
  • Experiment: Try different strategies without re-OCR

Benefits

  • Flexible: Change settings when resuming
  • Automatic: No manual intervention needed

Processing Guidelines

Document Size Recommendations

Document Size Processing Strategy Example
1-50 pages Single run netintel-ocr doc.pdf
51-100 pages Single run or split netintel-ocr doc.pdf
101-300 pages Process in 100-page sections See examples below
300+ pages Process key sections only Use specific page ranges

Processing Large Documents

For a 250-page document:

# Section 1: Pages 1-100
netintel-ocr document.pdf --start 1 --end 100 -o output_section1

# Section 2: Pages 101-200
netintel-ocr document.pdf --start 101 --end 200 -o output_section2

# Section 3: Pages 201-250
netintel-ocr document.pdf --start 201 --end 250 -o output_section3

Network Diagram Detection (Now Default!)

NEW: Network diagram detection is now enabled by default! No flags needed.

netintel-ocr automatically (in order):

  1. Transcribes text content FIRST (guaranteed capture)
  2. Detects network diagrams in PDF pages
  3. Identifies components (routers, switches, firewalls, servers, databases, etc.)
  4. Extracts connections and relationships
  5. Converts to Mermaid.js format
  6. Combines BOTH the diagram AND the page's text content
  7. Embeds everything in unified markdown output

Supported Network Components

  • ๐Ÿ”€ Routers and Switches
  • ๐Ÿ›ก๏ธ Firewalls
  • ๐Ÿ–ฅ๏ธ Servers and Workstations
  • ๐Ÿ’พ Databases
  • โš–๏ธ Load Balancers
  • โ˜๏ธ Cloud Services
  • ๐Ÿ“ก Wireless Access Points

Output Format

Network diagrams are saved as markdown with embedded Mermaid code:

# Page 5 - Network Diagram

**Type**: topology
**Detection Confidence**: 0.95
**Components**: 8 detected
**Connections**: 12 detected

## Diagram

```mermaid
graph TB
    Router([Main Router])
    Switch[Core Switch]
    FW{{Firewall}}
    Server1[(Web Server)]
    
    Router --> FW
    FW --> Switch
    Switch --> Server1

Page Text Content

This section describes the SD-WAN architecture with multiple branch offices connecting to headquarters through various transport methods including MPLS, broadband, and LTE connections. The solution provides path selection, application-aware routing, and centralized management...


## Output Structure (Enhanced v0.1.4)

All output is organized using MD5 checksums for unique document identification:

output/ # Base directory (configurable with --output) โ”œโ”€โ”€ index.md # Master index tracking all processed documents โ”œโ”€โ”€ 6c928950e6b73fffe316e0ad6bba3a67/ # MD5 checksum as folder name โ”‚ โ”œโ”€โ”€ markdown/ # All transcribed content โ”‚ โ”‚ โ”œโ”€โ”€ page_001.md # Individual page (text or diagram) โ”‚ โ”‚ โ”œโ”€โ”€ page_002.md
โ”‚ โ”‚ โ””โ”€โ”€ document.md # Complete merged document with footer metrics โ”‚ โ”œโ”€โ”€ images/ # Original page images (if --keep-images) โ”‚ โ””โ”€โ”€ summary.md # Processing summary and statistics โ””โ”€โ”€ 0611ca05dab284e943e3b00d3993d424/ # Another document's folder โ””โ”€โ”€ ...

Benefits:

  • Same document won't be processed twice (deduplication)
  • Easy to find previous processing results
  • index.md provides overview of all processed documents

### Index File (output/index.md)
Automatically tracks all processed documents:
```markdown
| Filename | Timestamp | MD5 Checksum | Folder | Processing Time |
|----------|-----------|--------------|--------|----------------|
| network.pdf | 2025-08-20 14:30:15 | `6c9289...` | [๐Ÿ“ 6c9289...](./6c9289.../) | 2m 30s |
| manual.pdf | 2025-08-20 14:35:22 | `0611ca...` | [๐Ÿ“ 0611ca...](./0611ca.../) | 1m 45s |

Enhanced Footer Metrics (NEW v0.1.4)

Every merged document includes comprehensive processing metrics:

  • Document Info: Source file, size, MD5 checksum, pages processed
  • Processing Details: Date/time, models used, processing time, mode
  • Quality Report: Errors, warnings, success metrics
  • Configuration: Settings used during processing

Processing Modes

Default: Hybrid Mode (Text-First)

  • Text-First Approach: ALWAYS transcribes text before attempting diagram detection
  • Guaranteed Content: Text is captured even if diagram processing fails
  • Automatic Detection: Every page is analyzed for network diagrams
  • Dual Content Extraction: Pages with diagrams include BOTH Mermaid diagram AND text content
  • Intelligent Processing: Network diagrams โ†’ Mermaid (with icons), Text โ†’ Markdown
  • Progress Tracking: Detailed step-by-step progress messages
  • Smart Timeouts: Operations timeout after 60s with automatic fallback
  • Processing Time: 30-60 seconds per page
  • Best For: Most documents (mixed content)

Text-Only Mode (--text-only)

  • No Detection: Skip diagram detection for speed
  • Processing Time: 15-30 seconds per page
  • Best For: Documents with only text

Network-Only Mode (--network-only)

  • Diagram Focus: Process only network diagrams
  • Processing Time: 30-60 seconds per diagram
  • Best For: Network architecture documents

Performance & Troubleshooting

If Processing is Slow or Stuck

The tool now includes detailed progress messages showing what's happening and which models are being used:

  Page 3: Processing...
    Transcribing page text (nanonets-ocr-s)... Done (12.3s)  <-- Text captured first!
    Checking for network diagram (qwen2.5vl)... Done (2.1s)
    Network diagram detected (confidence: 0.90)
    Type: topology
    Extracting components (qwen2.5vl)... Done (5.1s)
    Generating Mermaid diagram (qwen2.5vl)... Done (8.2s)
    Validating Mermaid syntax... Valid (0.1s)
    Writing to file... Done (0.1s)
    Total processing time: 27.9s

Important: Text is ALWAYS transcribed first, so even if diagram processing times out or fails, you'll still have the page content.

If an operation takes too long:

  • Default timeout: 60 seconds per operation
  • Adjust timeout: Use --timeout 120 for complex diagrams
  • Automatic fallback: If LLM times out, falls back to simpler methods

Common Issues and Fixes

Mermaid Syntax Errors (Robust Auto-Fix)

The tool uses a comprehensive validator to automatically fix Mermaid syntax issues:

Phase 1 - Basic Cleanup:

  • C-style comments (//) โ†’ Removed or converted to Mermaid comments (%%)
  • Curly braces in graph declarations โ†’ Removed
  • Invalid syntax elements โ†’ Cleaned

Phase 2 - Node ID Fixing:

  • Spaces in node IDs โ†’ Converted to underscores (e.g., Data Center โ†’ Data_Center)
  • Special characters โ†’ Replaced with safe alternatives
  • Duplicate node IDs โ†’ Automatically numbered (e.g., Server, Server2, Server3)

Phase 3 - Connection Fixing:

  • Updates all connections to use fixed node IDs
  • Preserves connection types and labels
  • Maintains directional flow

Phase 4 - Style Application:

  • Fixes class applications to use corrected node IDs
  • Preserves styling and visual attributes

Examples of Auto-Fixes:

  • subgraph_DMZ โ†’ subgraph DMZ
  • Data Center (HQ) โ†’ Data_Center_HQ (as node ID)
  • Parentheses in labels โ†’ Automatically quoted
  • Multiple Secure SD-WAN nodes โ†’ Secure_SD_WAN, Secure_SD_WAN2, etc.

Centralized Database Management (NEW v0.1.12!)

NetIntel-OCR now supports unified database management with advanced query capabilities:

# Create unified database from per-document databases
netintel-ocr --merge-to-centralized --output ./documents --centralized-db ./unified.lancedb

# Query with advanced filtering and ranking
netintel-ocr --query "firewall configuration" \
  --centralized-db ./unified.lancedb \
  --filters '{"document_type": "network_diagram", "confidence": {"$gte": 0.8}}' \
  --rerank-strategy semantic \
  --output-format json \
  --limit 20

# Get database statistics and health
netintel-ocr --db-stats ./unified.lancedb
netintel-ocr --db-optimize ./unified.lancedb --vacuum --reindex

Key Features:

  • Deduplication: Automatic MD5-based duplicate detection
  • Multi-field Filtering: Query by source, type, confidence, date ranges
  • Reranking: Semantic, hybrid, and temporal reranking strategies
  • Export Formats: JSON, Markdown, CSV with customizable fields
  • Validation: Automatic schema validation and integrity checks
  • Statistics: Comprehensive database metrics and health monitoring

Enhanced Batch Processing (NEW v0.1.12!)

Process multiple PDFs efficiently with parallel processing and automatic merging:

# Batch process directory with parallel workers
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 6 \
  --checkpoint-interval 5 \
  --auto-merge \
  --s3-sync

# Resume interrupted batch processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --resume-batch \
  --skip-existing

Performance Benefits:

  • Parallel Processing: Up to 8x faster with multiple workers
  • Progress Tracking: Real-time progress with ETA and throughput
  • Checkpoint Resume: Resume from interruption point
  • Memory Management: Intelligent worker allocation based on system resources
  • Auto-merge: Automatic centralized database updates

S3/MinIO Cloud Storage (NEW v0.1.12!)

Full cloud storage integration for distributed deployments:

# Configure cloud storage
export S3_ENDPOINT=https://minio.company.com
export S3_BUCKET=netintel-docs
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password123

# Process with cloud sync
netintel-ocr document.pdf --s3-sync --s3-backup

# Batch process from cloud
netintel-ocr --batch-ingest s3://netintel-docs/input/ \
  --output s3://netintel-docs/output/ \
  --centralized-db s3://netintel-docs/unified.lancedb

Cloud Features:

  • Bi-directional Sync: Upload/download with versioning
  • Backup/Restore: Automatic backup with retention policies
  • Distributed Access: Multiple workers can access shared storage
  • Credentials Management: Support for AWS IAM, MinIO admin, environment variables

Advanced Embedding Management (NEW v0.1.12!)

Enhanced embedding generation with multiple providers and caching:

# Configure multiple embedding providers
netintel-ocr document.pdf \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large \
  --embedding-cache-ttl 7200 \
  --batch-size 50

# Use local Ollama embeddings
netintel-ocr document.pdf \
  --embedding-provider ollama \
  --embedding-model mxbai-embed-large \
  --embedding-cache ./embeddings_cache

Embedding Features:

  • Multiple Providers: OpenAI, Ollama, HuggingFace support
  • Caching with TTL: Intelligent caching to avoid recomputation
  • Batch Processing: Efficient batch embedding generation
  • Model Management: Automatic model configuration and validation
  • Cost Optimization: Cache hits reduce API costs by up to 90%

Recent Improvements

Version 0.1.12 (Latest - 2025-08-21)

  • โœ… Centralized Database Management: Unified LanceDB with MD5 deduplication
  • โœ… Advanced Query Engine: Vector search with filtering, reranking, and multiple output formats
  • โœ… Batch Processing Pipeline: Parallel PDF processing with progress tracking and checkpoints
  • โœ… S3/MinIO Storage Backend: Cloud storage integration with bi-directional sync
  • โœ… Enhanced CLI Commands: --query, --merge-to-centralized, --batch-ingest, --db-stats, --db-optimize
  • โœ… Embedding Management: Multiple provider support with caching and TTL
  • โœ… Database Optimization: Validation, statistics, export, and backup capabilities

Version 0.1.11 (2025-08-21)

  • โœ… Docker Support: Complete Docker containerization with MinIO integration
  • โœ… Kubernetes Ready: Full Helm chart for production deployments
  • โœ… Project Initialization: --init command creates complete containerized environment
  • โœ… Configuration Management: YAML-based configuration with environment variable overrides
  • โœ… Query Interface Foundation: Query vector databases (enhanced in v0.1.12)
  • โœ… Centralized DB Foundation: Merge per-document databases (enhanced in v0.1.12)

Version 0.1.10 (2025-08-20)

  • โœ… Checkpoint/Resume: Automatic saving and resume capability for long documents
  • โœ… Page-Level Tracking: Individual page checkpoint tracking
  • โœ… Resume Summary: Clear display of resume status and remaining work
  • โœ… Atomic Saves: Checkpoint integrity with atomic file operations
  • โœ… Automatic Cleanup: Checkpoints removed after successful completion

Version 0.1.4 (2025-08-20)

  • โœ… Multi-Model Support: Use different models for text and network processing
  • โœ… MD5-Based Output: Unique folders per document using MD5 checksums
  • โœ… Document Index: Automatic index.md tracking all processed documents
  • โœ… Enhanced Footer: Comprehensive metrics in merged documents
  • โœ… Simplified Defaults: Output to output/ instead of timestamped folders
  • โœ… Model Progress Display: Shows which model is being used for each operation
  • โœ… Deduplication: Same document uses same output folder

Version 0.1.3

  • โœ… Hybrid Mode by Default: Automatic network diagram detection
  • โœ… Text-First Processing: Guarantees content capture before diagram extraction
  • โœ… Fast Extraction Mode: 50-70% faster processing option
  • โœ… Enhanced Error Recovery: Graceful fallbacks and timeout management

Version 0.1.0

  • โœ… Initial pypi.org Release
  • โœ… Fixed Mermaid syntax issues: Automatically handles parentheses in node labels
  • โœ… Improved component detection: Fixed issue with multiple types being listed
  • โœ… Enhanced error handling: Better fallback for malformed LLM responses
  • โœ… Automatic syntax correction: C-style comments and invalid syntax auto-fixed
  • โœ… Better type selection: Ensures components have single, specific types

Limitations

  • Maximum 100 pages per processing run: This limit ensures optimal processing time and prevents memory issues. For larger documents, use the --start and --end flags to process specific sections.
  • Network Detection Accuracy: Detection confidence varies based on diagram complexity and clarity. Adjust the --confidence threshold as needed.
  • Model Requirements: Network detection requires vision-capable models (e.g., nanonets-ocr-s, qwen2.5vl, llava)
  • Timeout Behavior: Operations that exceed the timeout will fall back to simpler processing methods

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

netintel_ocr-0.1.17.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (675.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

netintel_ocr-0.1.17.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (674.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

File details

Details for the file netintel_ocr-0.1.17.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for netintel_ocr-0.1.17.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 4b301660c579937dd61cc963679485cfff5be34d137d4a798102abf1bcb4d89e
MD5 717e412ac842b8036dc160cf5c25b5ab
BLAKE2b-256 1cc5de1b93616ca040f897cf2c5475c78d2a60154b613e8b3f0863762f203299

See more details on using hashes here.

File details

Details for the file netintel_ocr-0.1.17.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for netintel_ocr-0.1.17.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 b17c5f2dc251ea4d14c7e14b4e1b63e068296f575cd2ce3c69c0dcf5838ab19b
MD5 91330b3cbbca747ecc58846c9fc613ef
BLAKE2b-256 48175e68bf3da96a71e942c161c6efd500d28d42ef954ebaa8548f79ba3fc05b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page