Document processing and knowledge extraction for Inputless Analytics SDK

These details have not been verified by PyPI

Project description

inputless-ingestion

Document processing and knowledge extraction package for the Inputless Analytics SDK.

Purpose

This package provides comprehensive document processing capabilities including:

Multi-format file processing (PDF, DOCX, TXT, images)
OCR text extraction from scanned documents and images
NLP analysis and entity recognition
AI-powered knowledge extraction using LLMs
Integration with Neo4j graph database for document relationships

Installation

pip install inputless-ingestion

Installation with Poetry

cd packages/python-core/ingestion
poetry install

Dependencies

Required

Document Processing: PyPDF2, pdfplumber, python-docx, python-pptx, openpyxl, pandas
OCR: pytesseract, Pillow, opencv-python
NLP: spacy, nltk, textblob, scikit-learn
AI/LLM: openai, anthropic, langchain
Graph Database: neo4j
File Handling: chardet, aiofiles

Optional

easyocr: For advanced OCR (requires torch, not installed by default)
python-magic: For MIME type detection (requires system libmagic library)
inputless_graph: For graph database integration (optional dependency)

Usage

Basic Document Processing

from inputless_ingestion import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
document_data = await processor.process_file("document.pdf")

print(f"Extracted text: {document_data['text'][:200]}...")
print(f"File metadata: {document_data['metadata']}")

OCR Processing

from inputless_ingestion import OCREngine

# Initialize OCR engine
ocr_engine = OCREngine()

# Process image with OCR
text = ocr_engine.process_image("scanned_document.jpg")
print(f"OCR text: {text}")

NLP Analysis

from inputless_ingestion import NLPProcessor

# Initialize NLP processor
nlp_processor = NLPProcessor()

# Extract entities
entities = nlp_processor.extract_entities(document_text)
print(f"Found entities: {entities}")

# Extract topics
topics = nlp_processor.extract_topics(document_text)
print(f"Document topics: {topics}")

AI Knowledge Extraction

from inputless_ingestion import AIKnowledgeExtractor

# Initialize AI extractor
ai_extractor = AIKnowledgeExtractor(llm_provider="openai")

# Extract structured knowledge
knowledge = ai_extractor.extract_knowledge(document_text, "legal_document")
print(f"AI insights: {knowledge}")

Graph Integration

from inputless_ingestion import DocumentGraphIntegration
from inputless_graph import Neo4jRepository

# Initialize graph integration
neo4j_repo = Neo4jRepository()
graph_integration = DocumentGraphIntegration(neo4j_repo)

# Create document node in Neo4j
document_id = graph_integration.create_document_node(document_data)

# Create entity relationships
graph_integration.create_entity_nodes(document_id, entities)

# Find similar documents
similar_docs = graph_integration.find_similar_documents(document_id)

Features

Supported File Formats

PDF: Text extraction + OCR for scanned PDFs
Microsoft Word: .doc, .docx - Native text extraction
Rich Text: .rtf - Formatted text extraction
Plain Text: .txt - Direct text processing
Markdown: .md - Structured text with metadata
Excel: .xls, .xlsx - Cell data + formulas
CSV: .csv - Tabular data processing
PowerPoint: .ppt, .pptx - Slide content + notes
Images: .jpg, .png, .tiff, .bmp - OCR processing
HTML: .html, .htm - Web page content
XML: .xml - Structured data extraction
JSON: .json - Data structure analysis

OCR Capabilities

Tesseract Integration: High-quality text recognition
Image Preprocessing: Automatic image enhancement for better OCR
Multi-language Support: Support for multiple languages
Batch Processing: Process multiple images/documents
Confidence Scoring: OCR confidence metrics

NLP Features

Entity Recognition: Extract people, organizations, locations, dates
Topic Modeling: LDA-based topic extraction
Sentiment Analysis: Document sentiment classification
Keyword Extraction: TF-IDF based key term identification
Text Classification: Document type classification

AI-Powered Analysis

LLM Integration: OpenAI GPT-4 and Anthropic Claude support
Knowledge Extraction: Structured information extraction
Document Summarization: AI-generated summaries
Insight Generation: Business-relevant insights
Relationship Discovery: Find connections between concepts

Graph Database Integration

Document Nodes: Store documents as graph nodes
Entity Relationships: Connect entities across documents
Similarity Graph: Find similar documents
Knowledge Graph: Build comprehensive knowledge bases
Graph RAG: LLM-powered graph queries

Module Structure

src/
├── __init__.py
├── file_processor.py      # Main processing orchestrator
├── ocr_engine.py         # OCR processing
├── nlp_processor.py      # NLP analysis
├── ai_extractor.py       # AI knowledge extraction
├── graph_integration.py  # Neo4j integration
└── extractors/           # Format-specific extractors
    ├── __init__.py
    ├── pdf_extractor.py
    ├── docx_extractor.py
    ├── image_extractor.py
    └── text_extractor.py

Configuration

Environment Variables

# OCR Configuration
TESSERACT_PATH=/usr/local/bin/tesseract
OCR_LANGUAGE=eng

# LLM Configuration
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

Configuration File

# config.py
OCR_CONFIG = {
    'language': 'eng',
    'config': '--psm 6',
    'preprocessing': True
}

NLP_CONFIG = {
    'model': 'en_core_web_sm',
    'max_entities': 100,
    'topic_count': 5
}

LLM_CONFIG = {
    'provider': 'openai',
    'model': 'gpt-4',
    'temperature': 0.3,
    'max_tokens': 2000
}

Complete Examples

Full Document Processing Pipeline

import asyncio
from inputless_ingestion import (
    DocumentProcessor,
    NLPProcessor,
    AIKnowledgeExtractor,
    DocumentGraphIntegration
)
from inputless_graph import Neo4jRepository, Neo4jConfig

async def complete_pipeline(file_path: str):
    """Complete document processing pipeline with all features"""
    
    # 1. Process document (includes OCR, NLP, AI if enabled)
    processor = DocumentProcessor(
        enable_ocr=True,
        enable_nlp=True,
        enable_ai=True
    )
    document_data = await processor.process_file(
        file_path,
        extract_entities=True,
        extract_topics=True,
        extract_ai_insights=True
    )
    
    # 2. Additional NLP processing if needed
    nlp_processor = NLPProcessor()
    entities = await nlp_processor.extract_entities(document_data.text)
    topics = await nlp_processor.extract_topics(document_data.text)
    sentiment = await nlp_processor.analyze_sentiment(document_data.text)
    keywords = await nlp_processor.extract_keywords(document_data.text)
    
    # 3. AI knowledge extraction
    ai_extractor = AIKnowledgeExtractor(llm_provider="openai")
    ai_knowledge = await ai_extractor.extract_knowledge(
        document_data.text,
        document_type=document_data.metadata.file_type
    )
    summary = await ai_extractor.summarize(document_data.text)
    
    # 4. Store in graph database (optional)
    try:
        config = Neo4jConfig(
            uri="bolt://localhost:7687",
            user="neo4j",
            password="password"
        )
        neo4j_repo = Neo4jRepository(config)
        graph_integration = DocumentGraphIntegration(neo4j_repo)
        
        document_id = graph_integration.create_document_node(document_data)
        graph_integration.create_entity_nodes(document_id, entities)
        graph_integration.create_topic_nodes(document_id, topics)
        
        similar_docs = graph_integration.find_similar_documents(document_id)
        
        neo4j_repo.close()
    except ImportError:
        print("Graph integration not available (inputless_graph not installed)")
    
    return {
        'document': document_data,
        'entities': entities,
        'topics': topics,
        'sentiment': sentiment,
        'keywords': keywords,
        'ai_knowledge': ai_knowledge,
        'summary': summary
    }

# Usage
result = asyncio.run(complete_pipeline("contract.pdf"))
print(f"Processed document with {len(result['entities'])} entities")

Processing Different File Types

import asyncio
from inputless_ingestion import DocumentProcessor

async def process_various_formats():
    processor = DocumentProcessor()
    
    # PDF document
    pdf_data = await processor.process_file("document.pdf")
    print(f"PDF: {len(pdf_data.text)} characters")
    
    # Word document
    docx_data = await processor.process_file("document.docx")
    print(f"DOCX: {len(docx_data.text)} characters")
    
    # Text file
    txt_data = await processor.process_file("document.txt")
    print(f"TXT: {len(txt_data.text)} characters")
    
    # Image with OCR
    image_data = await processor.process_file("scanned_document.jpg")
    print(f"Image OCR: {len(image_data.text)} characters")
    print(f"OCR Confidence: {image_data.ocr_confidence}")

asyncio.run(process_various_formats())

Error Handling

import asyncio
from inputless_ingestion import DocumentProcessor
from inputless_ingestion.exceptions import (
    UnsupportedFileTypeError,
    OCRProcessingError,
    NLPProcessingError,
    ProcessingError
)

async def safe_process(file_path: str):
    processor = DocumentProcessor()
    
    try:
        result = await processor.process_file(file_path)
        return result
    except UnsupportedFileTypeError as e:
        print(f"Unsupported file type: {e}")
    except OCRProcessingError as e:
        print(f"OCR failed: {e}")
    except NLPProcessingError as e:
        print(f"NLP processing failed: {e}")
    except ProcessingError as e:
        print(f"Processing error: {e}")
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"Unexpected error: {e}")

asyncio.run(safe_process("document.pdf"))

Performance Considerations

Async Processing: All I/O operations are asynchronous
Batch Operations: Process multiple documents efficiently
Caching: Cache processed results to avoid reprocessing
Memory Management: Stream large documents to avoid memory issues
GPU Acceleration: Use GPU for OCR when available

Error Handling

from inputless_ingestion.exceptions import (
    UnsupportedFileTypeError,
    OCRProcessingError,
    NLPProcessingError,
    GraphIntegrationError
)

try:
    result = await processor.process_file("document.pdf")
except UnsupportedFileTypeError:
    print("File type not supported")
except OCRProcessingError:
    print("OCR processing failed")
except NLPProcessingError:
    print("NLP processing failed")

Testing

# Install dependencies
poetry install

# Run all tests
poetry run pytest tests/

# Run with coverage
poetry run pytest --cov=inputless_ingestion tests/

# Run specific test file
poetry run pytest tests/test_file_processor.py

# Run with verbose output
poetry run pytest tests/ -v

Test Coverage

The module includes comprehensive unit tests:

test_file_processor.py - Document processing tests
test_ocr_engine.py - OCR engine tests
test_nlp_processor.py - NLP processing tests
test_ai_extractor.py - AI extraction tests
test_graph_integration.py - Graph integration tests
test_extractors/ - Format-specific extractor tests

Distribution

This package is distributed via PyPI as inputless-ingestion.

# Install from PyPI
pip install inputless-ingestion

# Install development version
pip install git+https://github.com/vesperxs/InputlessSDK.git

API Reference

DocumentProcessor

Main orchestrator for document processing.

processor = DocumentProcessor(
    enable_ocr=True,              # Enable OCR processing
    enable_nlp=True,              # Enable NLP processing
    enable_ai=True,              # Enable AI extraction
    ocr_confidence_threshold=0.7  # OCR confidence threshold
)

# Process single file
document_data = await processor.process_file(
    file_path,
    extract_entities=True,
    extract_topics=True,
    extract_ai_insights=True
)

# Batch process
results = await processor.process_batch(
    file_paths,
    max_concurrent=5
)

OCREngine

OCR text extraction from images.

ocr = OCREngine(
    language="eng",           # Language code
    provider="tesseract",     # "tesseract", "easyocr", or "auto"
    enable_preprocessing=True  # Image preprocessing
)

result = await ocr.process_image(image_path)
result = await ocr.process_with_confidence(image_path, min_confidence=0.8)

NLPProcessor

Natural language processing and analysis.

nlp = NLPProcessor(
    model="en_core_web_sm",  # spaCy model
    max_entities=100,        # Max entities to extract
    num_topics=5             # Number of topics
)

entities = await nlp.extract_entities(text)
topics = await nlp.extract_topics(text, num_topics=5)
sentiment = await nlp.analyze_sentiment(text)
keywords = await nlp.extract_keywords(text, num_keywords=10)
classification = await nlp.classify_text(text)

AIKnowledgeExtractor

AI-powered knowledge extraction using LLMs.

ai = AIKnowledgeExtractor(
    llm_provider="openai",  # "openai" or "anthropic"
    model="gpt-4",           # Model name
    temperature=0.3          # Temperature for generation
)

knowledge = await ai.extract_knowledge(text, document_type="legal")
summary = await ai.summarize(text, max_length=200)
insights = await ai.extract_insights(text)

DocumentGraphIntegration

Neo4j graph database integration (optional).

from inputless_graph import Neo4jRepository, Neo4jConfig

config = Neo4jConfig(uri="bolt://localhost:7687", user="neo4j", password="pass")
repo = Neo4jRepository(config)
graph = DocumentGraphIntegration(repo)

document_id = graph.create_document_node(document_data)
entity_ids = graph.create_entity_nodes(document_id, entities)
topic_ids = graph.create_topic_nodes(document_id, topics)
similar = graph.find_similar_documents(document_id, limit=10)

Data Models

DocumentData

class DocumentData(BaseModel):
    text: str
    metadata: DocumentMetadata
    entities: Optional[List[Entity]] = None
    topics: Optional[List[Topic]] = None
    ai_insights: Optional[Dict[str, Any]] = None
    ocr_confidence: Optional[float] = None

DocumentMetadata

class DocumentMetadata(BaseModel):
    file_path: str
    file_type: str
    file_size: int
    mime_type: str
    encoding: Optional[str] = None
    title: Optional[str] = None
    author: Optional[str] = None
    created_date: Optional[str] = None
    page_count: Optional[int] = None

OCRResult

class OCRResult(BaseModel):
    text: str
    confidence: float
    language: str
    provider: str

Entity

class Entity(BaseModel):
    text: str
    label: str
    start: int
    end: int
    confidence: float

Topic

class Topic(BaseModel):
    topic_id: int
    keywords: List[str]
    weight: float

Version: 1.0.0
Status: ✅ Implementation Complete
Last Updated: January 2024

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.2

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inputless_ingestion-1.0.2.tar.gz (21.7 kB view details)

Uploaded Jan 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inputless_ingestion-1.0.2-py3-none-any.whl (23.2 kB view details)

Uploaded Jan 18, 2026 Python 3

File details

Details for the file inputless_ingestion-1.0.2.tar.gz.

File metadata

Download URL: inputless_ingestion-1.0.2.tar.gz
Upload date: Jan 18, 2026
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for inputless_ingestion-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`fb53458cbdb441a70895e750ef949ff8758b2fb6818d0f41b481a5afe1206a50`
MD5	`86288b2b633f0296a929fc5face79143`
BLAKE2b-256	`eb54ace2a5d67cf126fca19b4e3028b295b3aca6f0346d4c7f7029e1a48a56ab`

See more details on using hashes here.

File details

Details for the file inputless_ingestion-1.0.2-py3-none-any.whl.

File metadata

Download URL: inputless_ingestion-1.0.2-py3-none-any.whl
Upload date: Jan 18, 2026
Size: 23.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for inputless_ingestion-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b354377fd9b9d101b423b6e60d019a94899a8d4890a77b0a1348099fde920744`
MD5	`edbe4aec3bae821d64ce017600d9d331`
BLAKE2b-256	`33924ebf746fc32ff87b83e09767806eb2c16429b82669928c199c6633efc966`

See more details on using hashes here.

inputless-ingestion 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

inputless-ingestion

Purpose

Installation

Installation with Poetry

Dependencies

Required

Optional

Usage

Basic Document Processing

OCR Processing

NLP Analysis

AI Knowledge Extraction

Graph Integration

Features

Supported File Formats

OCR Capabilities

NLP Features

AI-Powered Analysis

Graph Database Integration

Module Structure

Configuration

Environment Variables

Configuration File

Complete Examples

Full Document Processing Pipeline

Processing Different File Types

Error Handling

Performance Considerations

Error Handling

Testing

Test Coverage

Distribution

API Reference

DocumentProcessor

OCREngine

NLPProcessor

AIKnowledgeExtractor

DocumentGraphIntegration

Data Models

DocumentData

DocumentMetadata

OCRResult

Entity

Topic

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes