A tool for extracting network intelligence from PDFs using OCR and AI.

Project description

NetIntel-OCR (Network Intelligence OCR)

🔍 Extract Network Intelligence from PDFs with AI-Powered OCR

NetIntel-OCR is a specialized tool designed for network engineers and security professionals to automatically extract, analyze, and convert network diagrams from PDF documents into structured formats. Using advanced OCR and AI technologies, it identifies network topologies, components, and connections, transforming them into actionable intelligence.

🎯 Key Capabilities

Network Intelligence Extraction

Automatic Network Detection: AI-powered identification of network diagrams in documents
Component Recognition: Identifies routers, switches, firewalls, servers, and other network elements
Connection Mapping: Traces and documents network paths and relationships
Security Architecture Analysis: Extracts security zones, DMZs, and trust boundaries

✨ Features

🎯 Intelligent Hybrid Processing: Automatically detects and processes network diagrams as Mermaid.js, text as markdown
📄 PDF to Text Conversion: Convert PDFs to markdown files locally, no token costs
🤖 Multimodal AI: Use the latest vision-language models supported by Ollama
🖼️ Visual Understanding: Turn images and diagrams into detailed text descriptions
🔌 Automatic Network Detection: No flags needed - network diagrams are detected and converted automatically
🎨 Icons by Default: Font Awesome icons automatically added to network diagrams for better visualization
⏱️ Smart Timeouts: Operations timeout gracefully with fallback to simpler methods
📊 Diagram Types Supported: Network topology, architecture diagrams, data flow diagrams, security diagrams
⚡ Optimized Processing: Processes up to 100 pages per run with detailed progress tracking
🔧 Flexible Output: Unified markdown format with seamlessly embedded Mermaid diagrams

💼 Use Cases

Network Documentation

Convert legacy network diagrams to modern formats
Extract network topology from vendor documentation
Audit and inventory network architectures

Security Analysis

Map security architecture from compliance documents
Extract firewall rules and network segmentation
Document data flow and trust boundaries

Infrastructure Planning

Analyze existing network designs
Extract capacity and redundancy information
Document interconnections and dependencies

📦 Requirements

Python 3.10+
Ollama installed and running locally or on a remote server

Installing Ollama and the Default Model

Install Ollama
Pull the default model:

ollama run nanonets-ocr-s:latest

Using a Remote Ollama Server

By default, netintel-ocr connects to Ollama running on localhost. To use a remote Ollama server, set the OLLAMA_HOST environment variable:

# Connect to a remote Ollama server
export OLLAMA_HOST="http://192.168.1.100:11434"
netintel-ocr document.pdf

# Or run with the environment variable inline
OLLAMA_HOST="http://remote-server:11434" netintel-ocr document.pdf

Installation

From PyPI

Install the published version using pip:

pip install netintel-ocr

or uv:

uv tool install netintel-ocr

Usage

Default Behavior (NEW!)

By default, netintel-ocr automatically detects network diagrams and processes them as Mermaid diagrams:

# Automatic hybrid mode - detects and converts network diagrams
netintel-ocr path/to/your/file.pdf

Text-Only Mode

For faster processing when you know the document contains only text:

netintel-ocr document.pdf --text-only

Performance Optimization (NEW!)

For faster processing of network diagrams, use the --fast-extraction flag:

# Fast extraction mode - reduces extraction time by 50-70%
netintel-ocr document.pdf --fast-extraction

# Combine with adjusted timeout for best performance
netintel-ocr document.pdf --fast-extraction --timeout 30

Fast extraction benefits:

Detection: ~15 seconds (vs 30-60s standard)
Extraction: ~20 seconds (vs 30-60s standard)
Uses simplified prompts for quicker LLM responses
Automatic fallback if fast extraction fails

Command Line Options

Basic Options

--output, -o: Output directory (default: "output_YYYYMMDD_HHMMSS")
--model, -m: Ollama model to use (default: "nanonets-ocr-s:latest")
--keep-images, -k: Keep the intermediate image files (default: False)
--width, -w: Width to resize images to, 0 to skip resizing (default: 0)
--start, -s: Start page number (default: 0, processes from beginning)
--end, -e: End page number (default: 0, processes to end)

Processing Mode Options

--text-only, -t: Skip network diagram detection for faster text-only processing
--network-only: Process only network diagrams, skip regular text pages

Network Diagram Options (applies to default mode)

--confidence, -c: Minimum confidence threshold for network diagram detection (0.0-1.0, default: 0.7)
--no-icons: Disable Font Awesome icons in Mermaid diagrams (icons are enabled by default)
--diagram-only: Only extract network diagrams without page text (by default, both are extracted)
--timeout: Timeout in seconds for each LLM operation (default: 60s, increase for complex diagrams)

Examples

Basic Usage (with automatic network detection)

# DEFAULT: Automatic network diagram detection (with icons)
netintel-ocr document.pdf

# Process with custom settings
netintel-ocr document.pdf --confidence 0.8

# Increase timeout for complex diagrams
netintel-ocr document.pdf --timeout 120

# Text-only mode (faster, no detection)
netintel-ocr document.pdf --text-only

# Process specific pages
netintel-ocr document.pdf --start 1 --end 5

# Use a different Ollama model
netintel-ocr document.pdf --model qwen2.5vl:latest

Specialized Processing

# Process ONLY network diagrams (skip text pages)
netintel-ocr network-architecture.pdf --network-only

# Higher confidence threshold (stricter detection)
netintel-ocr document.pdf --confidence 0.9

# Disable icons if not needed
netintel-ocr document.pdf --no-icons

# Extract only diagrams without text (faster)
netintel-ocr document.pdf --diagram-only

# Faster text-only processing
netintel-ocr text-document.pdf --text-only

Process large documents in sections (max 100 pages per run):

# Process first 100 pages
netintel-ocr large-document.pdf --start 1 --end 100

# Process next section
netintel-ocr large-document.pdf --start 101 --end 200

# Process specific chapter (e.g., pages 50-100)
netintel-ocr large-document.pdf --start 50 --end 100

Processing Guidelines

Document Size Recommendations

Document Size	Processing Strategy	Example
1-50 pages	Single run	`netintel-ocr doc.pdf`
51-100 pages	Single run or split	`netintel-ocr doc.pdf`
101-300 pages	Process in 100-page sections	See examples below
300+ pages	Process key sections only	Use specific page ranges

Processing Large Documents

For a 250-page document:

# Section 1: Pages 1-100
netintel-ocr document.pdf --start 1 --end 100 -o output_section1

# Section 2: Pages 101-200
netintel-ocr document.pdf --start 101 --end 200 -o output_section2

# Section 3: Pages 201-250
netintel-ocr document.pdf --start 201 --end 250 -o output_section3

Network Diagram Detection (Now Default!)

NEW: Network diagram detection is now enabled by default! No flags needed.

netintel-ocr automatically (in order):

Transcribes text content FIRST (guaranteed capture)
Detects network diagrams in PDF pages
Identifies components (routers, switches, firewalls, servers, databases, etc.)
Extracts connections and relationships
Converts to Mermaid.js format
Combines BOTH the diagram AND the page's text content
Embeds everything in unified markdown output

Supported Network Components

🔀 Routers and Switches
🛡️ Firewalls
🖥️ Servers and Workstations
💾 Databases
⚖️ Load Balancers
☁️ Cloud Services
📡 Wireless Access Points

Output Format

Network diagrams are saved as markdown with embedded Mermaid code:

# Page 5 - Network Diagram

**Type**: topology
**Detection Confidence**: 0.95
**Components**: 8 detected
**Connections**: 12 detected

## Diagram

```mermaid
graph TB
    Router([Main Router])
    Switch[Core Switch]
    FW{{Firewall}}
    Server1[(Web Server)]
    
    Router --> FW
    FW --> Switch
    Switch --> Server1

Page Text Content

This section describes the SD-WAN architecture with multiple branch offices connecting to headquarters through various transport methods including MPLS, broadband, and LTE connections. The solution provides path selection, application-aware routing, and centralized management...


## Output Structure

All output is saved in a timestamped directory:

output_YYYYMMDD_HHMMSS/ ├── markdown/ # All transcribed content │ ├── page_001.md # Individual page (text or diagram) │ ├── page_002.md
│ └── document.md # Complete merged document (named after PDF) ├── images/ # Original page images (if --keep-images) └── summary.md # Processing summary and statistics

Note: The merged file is automatically named after your PDF file. For example: invoice.pdf → invoice.md


## Processing Modes

### Default: Hybrid Mode (Text-First)
- **Text-First Approach**: ALWAYS transcribes text before attempting diagram detection
- **Guaranteed Content**: Text is captured even if diagram processing fails
- **Automatic Detection**: Every page is analyzed for network diagrams
- **Dual Content Extraction**: Pages with diagrams include BOTH Mermaid diagram AND text content
- **Intelligent Processing**: Network diagrams → Mermaid (with icons), Text → Markdown
- **Progress Tracking**: Detailed step-by-step progress messages
- **Smart Timeouts**: Operations timeout after 60s with automatic fallback
- **Processing Time**: 30-60 seconds per page
- **Best For**: Most documents (mixed content)

### Text-Only Mode (`--text-only`)
- **No Detection**: Skip diagram detection for speed
- **Processing Time**: 15-30 seconds per page
- **Best For**: Documents with only text

### Network-Only Mode (`--network-only`)
- **Diagram Focus**: Process only network diagrams
- **Processing Time**: 30-60 seconds per diagram
- **Best For**: Network architecture documents

## Performance & Troubleshooting

### If Processing is Slow or Stuck

The tool now includes detailed progress messages to show what's happening:

Page 3: Processing... Transcribing page text... Done (12.3s) <-- Text captured first! Checking for network diagram... Done (2.1s) Network diagram detected (confidence: 0.90) Type: topology Extracting components... Done (5.1s) Generating Mermaid diagram... Done (8.2s) Validating Mermaid syntax... Valid (0.1s) Writing to file... Done (0.1s) Total processing time: 27.9s


**Important**: Text is ALWAYS transcribed first, so even if diagram processing times out or fails, you'll still have the page content.

If an operation takes too long:
- **Default timeout**: 60 seconds per operation
- **Adjust timeout**: Use `--timeout 120` for complex diagrams
- **Automatic fallback**: If LLM times out, falls back to simpler methods

### Common Issues and Fixes

#### Mermaid Syntax Errors (Robust Auto-Fix)
The tool uses a comprehensive validator to automatically fix Mermaid syntax issues:

**Phase 1 - Basic Cleanup:**
- C-style comments (`//`) → Removed or converted to Mermaid comments (`%%`)
- Curly braces in graph declarations → Removed
- Invalid syntax elements → Cleaned

**Phase 2 - Node ID Fixing:**
- Spaces in node IDs → Converted to underscores (e.g., `Data Center` → `Data_Center`)
- Special characters → Replaced with safe alternatives
- Duplicate node IDs → Automatically numbered (e.g., `Server`, `Server2`, `Server3`)

**Phase 3 - Connection Fixing:**
- Updates all connections to use fixed node IDs
- Preserves connection types and labels
- Maintains directional flow

**Phase 4 - Style Application:**
- Fixes class applications to use corrected node IDs
- Preserves styling and visual attributes

**Examples of Auto-Fixes:**
- `subgraph_DMZ` → `subgraph DMZ`
- `Data Center (HQ)` → `Data_Center_HQ` (as node ID)
- Parentheses in labels → Automatically quoted
- Multiple `Secure SD-WAN` nodes → `Secure_SD_WAN`, `Secure_SD_WAN2`, etc.

## Recent Improvements

### Version 0.1.0 
- ✅ **Initial pypi.org Release
- ✅ **Fixed Mermaid syntax issues**: Automatically handles parentheses in node labels
- ✅ **Improved component detection**: Fixed issue with multiple types being listed
- ✅ **Enhanced error handling**: Better fallback for malformed LLM responses
- ✅ **Automatic syntax correction**: C-style comments and invalid syntax auto-fixed
- ✅ **Better type selection**: Ensures components have single, specific types

## Limitations

- **Maximum 100 pages per processing run**: This limit ensures optimal processing time and prevents memory issues. For larger documents, use the `--start` and `--end` flags to process specific sections.
- **Network Detection Accuracy**: Detection confidence varies based on diagram complexity and clarity. Adjust the `--confidence` threshold as needed.
- **Model Requirements**: Network detection requires vision-capable models (e.g., nanonets-ocr-s, qwen2.5vl, llava)
- **Timeout Behavior**: Operations that exceed the timeout will fall back to simpler processing methods

Project details

Release history Release notifications | RSS feed

0.1.18.2

Sep 23, 2025

0.1.18.1

Sep 23, 2025

0.1.18.0

Sep 22, 2025

0.1.17.3

Sep 22, 2025

0.1.17.2

Sep 22, 2025

0.1.17.1

Sep 22, 2025

0.1.17

Sep 15, 2025

0.1.16.15

Sep 1, 2025

0.1.16.14

Sep 1, 2025

0.1.16.13

Sep 1, 2025

0.1.16.12

Sep 1, 2025

0.1.16.11

Sep 1, 2025

0.1.16.10

Sep 1, 2025

0.1.16.9

Sep 1, 2025

0.1.16.8

Sep 1, 2025

0.1.16.7

Sep 1, 2025

0.1.16.6

Sep 1, 2025

0.1.16.5

Aug 31, 2025

0.1.16.4

Aug 31, 2025

0.1.16.3

Aug 31, 2025

0.1.16.2

Aug 31, 2025

0.1.16.1

Aug 31, 2025

0.1.16

Aug 30, 2025

0.1.15.2

Aug 23, 2025

0.1.15.1

Aug 23, 2025

0.1.15

Aug 23, 2025

0.1.14.1

Aug 22, 2025

0.1.14

Aug 22, 2025

0.1.13

Aug 22, 2025

0.1.12

Aug 21, 2025

0.1.10

Aug 20, 2025

0.1.9

Aug 20, 2025

0.1.8

Aug 20, 2025

0.1.7

Aug 20, 2025

0.1.6

Aug 20, 2025

0.1.5

Aug 20, 2025

0.1.4

Aug 20, 2025

This version

0.1.3

Aug 19, 2025

0.1.2

Aug 19, 2025

0.1.1

Aug 19, 2025

0.1.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

netintel_ocr-0.1.3-py3-none-any.whl (75.6 kB view details)

Uploaded Aug 19, 2025 Python 3

File details

Details for the file netintel_ocr-0.1.3-py3-none-any.whl.

File metadata

Download URL: netintel_ocr-0.1.3-py3-none-any.whl
Upload date: Aug 19, 2025
Size: 75.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for netintel_ocr-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`67c6e14299f91ac010f6b0a01d92bb4a47f7b2c987c46c6f63b0bf30ee9f577a`
MD5	`d71456fee4cd40e3a456e65eae4bbdf6`
BLAKE2b-256	`0359e5d1c7dae42168e3d75bf6b9c0a4610b7854528de903daaac91e2087f9ba`

See more details on using hashes here.

netintel-ocr 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

NetIntel-OCR (Network Intelligence OCR)

🎯 Key Capabilities

Network Intelligence Extraction

✨ Features

💼 Use Cases

Network Documentation

Security Analysis

Infrastructure Planning

📦 Requirements

Installing Ollama and the Default Model

Using a Remote Ollama Server

Installation

From PyPI

Usage

Default Behavior (NEW!)

Text-Only Mode

Performance Optimization (NEW!)

Command Line Options

Basic Options

Processing Mode Options

Network Diagram Options (applies to default mode)

Examples

Basic Usage (with automatic network detection)

Specialized Processing

Processing Guidelines

Document Size Recommendations

Processing Large Documents

Network Diagram Detection (Now Default!)

Supported Network Components

Output Format

Page Text Content

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes