Skip to main content

Document processing and knowledge extraction for Inputless Analytics SDK

Project description

inputless-ingestion

Document processing and knowledge extraction package for the Inputless Analytics SDK.

Purpose

This package provides comprehensive document processing capabilities including:

  • Multi-format file processing (PDF, DOCX, TXT, images)
  • OCR text extraction from scanned documents and images
  • NLP analysis and entity recognition
  • AI-powered knowledge extraction using LLMs
  • Integration with Neo4j graph database for document relationships

Installation

pip install inputless-ingestion

Installation with Poetry

cd packages/python-core/ingestion
poetry install

Dependencies

Required

  • Document Processing: PyPDF2, pdfplumber, python-docx, python-pptx, openpyxl, pandas
  • OCR: pytesseract, Pillow, opencv-python
  • NLP: spacy, nltk, textblob, scikit-learn
  • AI/LLM: openai, anthropic, langchain
  • Graph Database: neo4j
  • File Handling: chardet, aiofiles

Optional

  • easyocr: For advanced OCR (requires torch, not installed by default)
  • python-magic: For MIME type detection (requires system libmagic library)
  • inputless_graph: For graph database integration (optional dependency)

Usage

Basic Document Processing

from inputless_ingestion import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
document_data = await processor.process_file("document.pdf")

print(f"Extracted text: {document_data['text'][:200]}...")
print(f"File metadata: {document_data['metadata']}")

OCR Processing

from inputless_ingestion import OCREngine

# Initialize OCR engine
ocr_engine = OCREngine()

# Process image with OCR
text = ocr_engine.process_image("scanned_document.jpg")
print(f"OCR text: {text}")

NLP Analysis

from inputless_ingestion import NLPProcessor

# Initialize NLP processor
nlp_processor = NLPProcessor()

# Extract entities
entities = nlp_processor.extract_entities(document_text)
print(f"Found entities: {entities}")

# Extract topics
topics = nlp_processor.extract_topics(document_text)
print(f"Document topics: {topics}")

AI Knowledge Extraction

from inputless_ingestion import AIKnowledgeExtractor

# Initialize AI extractor
ai_extractor = AIKnowledgeExtractor(llm_provider="openai")

# Extract structured knowledge
knowledge = ai_extractor.extract_knowledge(document_text, "legal_document")
print(f"AI insights: {knowledge}")

Graph Integration

from inputless_ingestion import DocumentGraphIntegration
from inputless_graph import Neo4jRepository

# Initialize graph integration
neo4j_repo = Neo4jRepository()
graph_integration = DocumentGraphIntegration(neo4j_repo)

# Create document node in Neo4j
document_id = graph_integration.create_document_node(document_data)

# Create entity relationships
graph_integration.create_entity_nodes(document_id, entities)

# Find similar documents
similar_docs = graph_integration.find_similar_documents(document_id)

Features

Supported File Formats

  • PDF: Text extraction + OCR for scanned PDFs
  • Microsoft Word: .doc, .docx - Native text extraction
  • Rich Text: .rtf - Formatted text extraction
  • Plain Text: .txt - Direct text processing
  • Markdown: .md - Structured text with metadata
  • Excel: .xls, .xlsx - Cell data + formulas
  • CSV: .csv - Tabular data processing
  • PowerPoint: .ppt, .pptx - Slide content + notes
  • Images: .jpg, .png, .tiff, .bmp - OCR processing
  • HTML: .html, .htm - Web page content
  • XML: .xml - Structured data extraction
  • JSON: .json - Data structure analysis

OCR Capabilities

  • Tesseract Integration: High-quality text recognition
  • Image Preprocessing: Automatic image enhancement for better OCR
  • Multi-language Support: Support for multiple languages
  • Batch Processing: Process multiple images/documents
  • Confidence Scoring: OCR confidence metrics

NLP Features

  • Entity Recognition: Extract people, organizations, locations, dates
  • Topic Modeling: LDA-based topic extraction
  • Sentiment Analysis: Document sentiment classification
  • Keyword Extraction: TF-IDF based key term identification
  • Text Classification: Document type classification

AI-Powered Analysis

  • LLM Integration: OpenAI GPT-4 and Anthropic Claude support
  • Knowledge Extraction: Structured information extraction
  • Document Summarization: AI-generated summaries
  • Insight Generation: Business-relevant insights
  • Relationship Discovery: Find connections between concepts

Graph Database Integration

  • Document Nodes: Store documents as graph nodes
  • Entity Relationships: Connect entities across documents
  • Similarity Graph: Find similar documents
  • Knowledge Graph: Build comprehensive knowledge bases
  • Graph RAG: LLM-powered graph queries

Module Structure

src/
├── __init__.py
├── file_processor.py      # Main processing orchestrator
├── ocr_engine.py         # OCR processing
├── nlp_processor.py      # NLP analysis
├── ai_extractor.py       # AI knowledge extraction
├── graph_integration.py  # Neo4j integration
└── extractors/           # Format-specific extractors
    ├── __init__.py
    ├── pdf_extractor.py
    ├── docx_extractor.py
    ├── image_extractor.py
    └── text_extractor.py

Configuration

Environment Variables

# OCR Configuration
TESSERACT_PATH=/usr/local/bin/tesseract
OCR_LANGUAGE=eng

# LLM Configuration
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

Configuration File

# config.py
OCR_CONFIG = {
    'language': 'eng',
    'config': '--psm 6',
    'preprocessing': True
}

NLP_CONFIG = {
    'model': 'en_core_web_sm',
    'max_entities': 100,
    'topic_count': 5
}

LLM_CONFIG = {
    'provider': 'openai',
    'model': 'gpt-4',
    'temperature': 0.3,
    'max_tokens': 2000
}

Complete Examples

Full Document Processing Pipeline

import asyncio
from inputless_ingestion import (
    DocumentProcessor,
    NLPProcessor,
    AIKnowledgeExtractor,
    DocumentGraphIntegration
)
from inputless_graph import Neo4jRepository, Neo4jConfig

async def complete_pipeline(file_path: str):
    """Complete document processing pipeline with all features"""
    
    # 1. Process document (includes OCR, NLP, AI if enabled)
    processor = DocumentProcessor(
        enable_ocr=True,
        enable_nlp=True,
        enable_ai=True
    )
    document_data = await processor.process_file(
        file_path,
        extract_entities=True,
        extract_topics=True,
        extract_ai_insights=True
    )
    
    # 2. Additional NLP processing if needed
    nlp_processor = NLPProcessor()
    entities = await nlp_processor.extract_entities(document_data.text)
    topics = await nlp_processor.extract_topics(document_data.text)
    sentiment = await nlp_processor.analyze_sentiment(document_data.text)
    keywords = await nlp_processor.extract_keywords(document_data.text)
    
    # 3. AI knowledge extraction
    ai_extractor = AIKnowledgeExtractor(llm_provider="openai")
    ai_knowledge = await ai_extractor.extract_knowledge(
        document_data.text,
        document_type=document_data.metadata.file_type
    )
    summary = await ai_extractor.summarize(document_data.text)
    
    # 4. Store in graph database (optional)
    try:
        config = Neo4jConfig(
            uri="bolt://localhost:7687",
            user="neo4j",
            password="password"
        )
        neo4j_repo = Neo4jRepository(config)
        graph_integration = DocumentGraphIntegration(neo4j_repo)
        
        document_id = graph_integration.create_document_node(document_data)
        graph_integration.create_entity_nodes(document_id, entities)
        graph_integration.create_topic_nodes(document_id, topics)
        
        similar_docs = graph_integration.find_similar_documents(document_id)
        
        neo4j_repo.close()
    except ImportError:
        print("Graph integration not available (inputless_graph not installed)")
    
    return {
        'document': document_data,
        'entities': entities,
        'topics': topics,
        'sentiment': sentiment,
        'keywords': keywords,
        'ai_knowledge': ai_knowledge,
        'summary': summary
    }

# Usage
result = asyncio.run(complete_pipeline("contract.pdf"))
print(f"Processed document with {len(result['entities'])} entities")

Processing Different File Types

import asyncio
from inputless_ingestion import DocumentProcessor

async def process_various_formats():
    processor = DocumentProcessor()
    
    # PDF document
    pdf_data = await processor.process_file("document.pdf")
    print(f"PDF: {len(pdf_data.text)} characters")
    
    # Word document
    docx_data = await processor.process_file("document.docx")
    print(f"DOCX: {len(docx_data.text)} characters")
    
    # Text file
    txt_data = await processor.process_file("document.txt")
    print(f"TXT: {len(txt_data.text)} characters")
    
    # Image with OCR
    image_data = await processor.process_file("scanned_document.jpg")
    print(f"Image OCR: {len(image_data.text)} characters")
    print(f"OCR Confidence: {image_data.ocr_confidence}")

asyncio.run(process_various_formats())

Error Handling

import asyncio
from inputless_ingestion import DocumentProcessor
from inputless_ingestion.exceptions import (
    UnsupportedFileTypeError,
    OCRProcessingError,
    NLPProcessingError,
    ProcessingError
)

async def safe_process(file_path: str):
    processor = DocumentProcessor()
    
    try:
        result = await processor.process_file(file_path)
        return result
    except UnsupportedFileTypeError as e:
        print(f"Unsupported file type: {e}")
    except OCRProcessingError as e:
        print(f"OCR failed: {e}")
    except NLPProcessingError as e:
        print(f"NLP processing failed: {e}")
    except ProcessingError as e:
        print(f"Processing error: {e}")
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"Unexpected error: {e}")

asyncio.run(safe_process("document.pdf"))

Performance Considerations

  • Async Processing: All I/O operations are asynchronous
  • Batch Operations: Process multiple documents efficiently
  • Caching: Cache processed results to avoid reprocessing
  • Memory Management: Stream large documents to avoid memory issues
  • GPU Acceleration: Use GPU for OCR when available

Error Handling

from inputless_ingestion.exceptions import (
    UnsupportedFileTypeError,
    OCRProcessingError,
    NLPProcessingError,
    GraphIntegrationError
)

try:
    result = await processor.process_file("document.pdf")
except UnsupportedFileTypeError:
    print("File type not supported")
except OCRProcessingError:
    print("OCR processing failed")
except NLPProcessingError:
    print("NLP processing failed")

Testing

# Install dependencies
poetry install

# Run all tests
poetry run pytest tests/

# Run with coverage
poetry run pytest --cov=inputless_ingestion tests/

# Run specific test file
poetry run pytest tests/test_file_processor.py

# Run with verbose output
poetry run pytest tests/ -v

Test Coverage

The module includes comprehensive unit tests:

  • test_file_processor.py - Document processing tests
  • test_ocr_engine.py - OCR engine tests
  • test_nlp_processor.py - NLP processing tests
  • test_ai_extractor.py - AI extraction tests
  • test_graph_integration.py - Graph integration tests
  • test_extractors/ - Format-specific extractor tests

Distribution

This package is distributed via PyPI as inputless-ingestion.

# Install from PyPI
pip install inputless-ingestion

# Install development version
pip install git+https://github.com/vesperxs/InputlessSDK.git

API Reference

DocumentProcessor

Main orchestrator for document processing.

processor = DocumentProcessor(
    enable_ocr=True,              # Enable OCR processing
    enable_nlp=True,              # Enable NLP processing
    enable_ai=True,              # Enable AI extraction
    ocr_confidence_threshold=0.7  # OCR confidence threshold
)

# Process single file
document_data = await processor.process_file(
    file_path,
    extract_entities=True,
    extract_topics=True,
    extract_ai_insights=True
)

# Batch process
results = await processor.process_batch(
    file_paths,
    max_concurrent=5
)

OCREngine

OCR text extraction from images.

ocr = OCREngine(
    language="eng",           # Language code
    provider="tesseract",     # "tesseract", "easyocr", or "auto"
    enable_preprocessing=True  # Image preprocessing
)

result = await ocr.process_image(image_path)
result = await ocr.process_with_confidence(image_path, min_confidence=0.8)

NLPProcessor

Natural language processing and analysis.

nlp = NLPProcessor(
    model="en_core_web_sm",  # spaCy model
    max_entities=100,        # Max entities to extract
    num_topics=5             # Number of topics
)

entities = await nlp.extract_entities(text)
topics = await nlp.extract_topics(text, num_topics=5)
sentiment = await nlp.analyze_sentiment(text)
keywords = await nlp.extract_keywords(text, num_keywords=10)
classification = await nlp.classify_text(text)

AIKnowledgeExtractor

AI-powered knowledge extraction using LLMs.

ai = AIKnowledgeExtractor(
    llm_provider="openai",  # "openai" or "anthropic"
    model="gpt-4",           # Model name
    temperature=0.3          # Temperature for generation
)

knowledge = await ai.extract_knowledge(text, document_type="legal")
summary = await ai.summarize(text, max_length=200)
insights = await ai.extract_insights(text)

DocumentGraphIntegration

Neo4j graph database integration (optional).

from inputless_graph import Neo4jRepository, Neo4jConfig

config = Neo4jConfig(uri="bolt://localhost:7687", user="neo4j", password="pass")
repo = Neo4jRepository(config)
graph = DocumentGraphIntegration(repo)

document_id = graph.create_document_node(document_data)
entity_ids = graph.create_entity_nodes(document_id, entities)
topic_ids = graph.create_topic_nodes(document_id, topics)
similar = graph.find_similar_documents(document_id, limit=10)

Data Models

DocumentData

class DocumentData(BaseModel):
    text: str
    metadata: DocumentMetadata
    entities: Optional[List[Entity]] = None
    topics: Optional[List[Topic]] = None
    ai_insights: Optional[Dict[str, Any]] = None
    ocr_confidence: Optional[float] = None

DocumentMetadata

class DocumentMetadata(BaseModel):
    file_path: str
    file_type: str
    file_size: int
    mime_type: str
    encoding: Optional[str] = None
    title: Optional[str] = None
    author: Optional[str] = None
    created_date: Optional[str] = None
    page_count: Optional[int] = None

OCRResult

class OCRResult(BaseModel):
    text: str
    confidence: float
    language: str
    provider: str

Entity

class Entity(BaseModel):
    text: str
    label: str
    start: int
    end: int
    confidence: float

Topic

class Topic(BaseModel):
    topic_id: int
    keywords: List[str]
    weight: float

Version: 1.0.0
Status: ✅ Implementation Complete
Last Updated: January 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inputless_ingestion-1.0.2.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inputless_ingestion-1.0.2-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file inputless_ingestion-1.0.2.tar.gz.

File metadata

  • Download URL: inputless_ingestion-1.0.2.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for inputless_ingestion-1.0.2.tar.gz
Algorithm Hash digest
SHA256 fb53458cbdb441a70895e750ef949ff8758b2fb6818d0f41b481a5afe1206a50
MD5 86288b2b633f0296a929fc5face79143
BLAKE2b-256 eb54ace2a5d67cf126fca19b4e3028b295b3aca6f0346d4c7f7029e1a48a56ab

See more details on using hashes here.

File details

Details for the file inputless_ingestion-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for inputless_ingestion-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b354377fd9b9d101b423b6e60d019a94899a8d4890a77b0a1348099fde920744
MD5 edbe4aec3bae821d64ce017600d9d331
BLAKE2b-256 33924ebf746fc32ff87b83e09767806eb2c16429b82669928c199c6633efc966

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page