Document processing and knowledge extraction for Inputless Analytics SDK
Project description
inputless-ingestion
Document processing and knowledge extraction package for the Inputless Analytics SDK.
Purpose
This package provides comprehensive document processing capabilities including:
- Multi-format file processing (PDF, DOCX, TXT, images)
- OCR text extraction from scanned documents and images
- NLP analysis and entity recognition
- AI-powered knowledge extraction using LLMs
- Integration with Neo4j graph database for document relationships
Installation
pip install inputless-ingestion
Installation with Poetry
cd packages/python-core/ingestion
poetry install
Dependencies
Required
- Document Processing: PyPDF2, pdfplumber, python-docx, python-pptx, openpyxl, pandas
- OCR: pytesseract, Pillow, opencv-python
- NLP: spacy, nltk, textblob, scikit-learn
- AI/LLM: openai, anthropic, langchain
- Graph Database: neo4j
- File Handling: chardet, aiofiles
Optional
- easyocr: For advanced OCR (requires torch, not installed by default)
- python-magic: For MIME type detection (requires system libmagic library)
- inputless_graph: For graph database integration (optional dependency)
Usage
Basic Document Processing
from inputless_ingestion import DocumentProcessor
# Initialize processor
processor = DocumentProcessor()
# Process a document
document_data = await processor.process_file("document.pdf")
print(f"Extracted text: {document_data['text'][:200]}...")
print(f"File metadata: {document_data['metadata']}")
OCR Processing
from inputless_ingestion import OCREngine
# Initialize OCR engine
ocr_engine = OCREngine()
# Process image with OCR
text = ocr_engine.process_image("scanned_document.jpg")
print(f"OCR text: {text}")
NLP Analysis
from inputless_ingestion import NLPProcessor
# Initialize NLP processor
nlp_processor = NLPProcessor()
# Extract entities
entities = nlp_processor.extract_entities(document_text)
print(f"Found entities: {entities}")
# Extract topics
topics = nlp_processor.extract_topics(document_text)
print(f"Document topics: {topics}")
AI Knowledge Extraction
from inputless_ingestion import AIKnowledgeExtractor
# Initialize AI extractor
ai_extractor = AIKnowledgeExtractor(llm_provider="openai")
# Extract structured knowledge
knowledge = ai_extractor.extract_knowledge(document_text, "legal_document")
print(f"AI insights: {knowledge}")
Graph Integration
from inputless_ingestion import DocumentGraphIntegration
from inputless_graph import Neo4jRepository
# Initialize graph integration
neo4j_repo = Neo4jRepository()
graph_integration = DocumentGraphIntegration(neo4j_repo)
# Create document node in Neo4j
document_id = graph_integration.create_document_node(document_data)
# Create entity relationships
graph_integration.create_entity_nodes(document_id, entities)
# Find similar documents
similar_docs = graph_integration.find_similar_documents(document_id)
Features
Supported File Formats
- PDF: Text extraction + OCR for scanned PDFs
- Microsoft Word:
.doc,.docx- Native text extraction - Rich Text:
.rtf- Formatted text extraction - Plain Text:
.txt- Direct text processing - Markdown:
.md- Structured text with metadata - Excel:
.xls,.xlsx- Cell data + formulas - CSV:
.csv- Tabular data processing - PowerPoint:
.ppt,.pptx- Slide content + notes - Images:
.jpg,.png,.tiff,.bmp- OCR processing - HTML:
.html,.htm- Web page content - XML:
.xml- Structured data extraction - JSON:
.json- Data structure analysis
OCR Capabilities
- Tesseract Integration: High-quality text recognition
- Image Preprocessing: Automatic image enhancement for better OCR
- Multi-language Support: Support for multiple languages
- Batch Processing: Process multiple images/documents
- Confidence Scoring: OCR confidence metrics
NLP Features
- Entity Recognition: Extract people, organizations, locations, dates
- Topic Modeling: LDA-based topic extraction
- Sentiment Analysis: Document sentiment classification
- Keyword Extraction: TF-IDF based key term identification
- Text Classification: Document type classification
AI-Powered Analysis
- LLM Integration: OpenAI GPT-4 and Anthropic Claude support
- Knowledge Extraction: Structured information extraction
- Document Summarization: AI-generated summaries
- Insight Generation: Business-relevant insights
- Relationship Discovery: Find connections between concepts
Graph Database Integration
- Document Nodes: Store documents as graph nodes
- Entity Relationships: Connect entities across documents
- Similarity Graph: Find similar documents
- Knowledge Graph: Build comprehensive knowledge bases
- Graph RAG: LLM-powered graph queries
Module Structure
src/
├── __init__.py
├── file_processor.py # Main processing orchestrator
├── ocr_engine.py # OCR processing
├── nlp_processor.py # NLP analysis
├── ai_extractor.py # AI knowledge extraction
├── graph_integration.py # Neo4j integration
└── extractors/ # Format-specific extractors
├── __init__.py
├── pdf_extractor.py
├── docx_extractor.py
├── image_extractor.py
└── text_extractor.py
Configuration
Environment Variables
# OCR Configuration
TESSERACT_PATH=/usr/local/bin/tesseract
OCR_LANGUAGE=eng
# LLM Configuration
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password
Configuration File
# config.py
OCR_CONFIG = {
'language': 'eng',
'config': '--psm 6',
'preprocessing': True
}
NLP_CONFIG = {
'model': 'en_core_web_sm',
'max_entities': 100,
'topic_count': 5
}
LLM_CONFIG = {
'provider': 'openai',
'model': 'gpt-4',
'temperature': 0.3,
'max_tokens': 2000
}
Complete Examples
Full Document Processing Pipeline
import asyncio
from inputless_ingestion import (
DocumentProcessor,
NLPProcessor,
AIKnowledgeExtractor,
DocumentGraphIntegration
)
from inputless_graph import Neo4jRepository, Neo4jConfig
async def complete_pipeline(file_path: str):
"""Complete document processing pipeline with all features"""
# 1. Process document (includes OCR, NLP, AI if enabled)
processor = DocumentProcessor(
enable_ocr=True,
enable_nlp=True,
enable_ai=True
)
document_data = await processor.process_file(
file_path,
extract_entities=True,
extract_topics=True,
extract_ai_insights=True
)
# 2. Additional NLP processing if needed
nlp_processor = NLPProcessor()
entities = await nlp_processor.extract_entities(document_data.text)
topics = await nlp_processor.extract_topics(document_data.text)
sentiment = await nlp_processor.analyze_sentiment(document_data.text)
keywords = await nlp_processor.extract_keywords(document_data.text)
# 3. AI knowledge extraction
ai_extractor = AIKnowledgeExtractor(llm_provider="openai")
ai_knowledge = await ai_extractor.extract_knowledge(
document_data.text,
document_type=document_data.metadata.file_type
)
summary = await ai_extractor.summarize(document_data.text)
# 4. Store in graph database (optional)
try:
config = Neo4jConfig(
uri="bolt://localhost:7687",
user="neo4j",
password="password"
)
neo4j_repo = Neo4jRepository(config)
graph_integration = DocumentGraphIntegration(neo4j_repo)
document_id = graph_integration.create_document_node(document_data)
graph_integration.create_entity_nodes(document_id, entities)
graph_integration.create_topic_nodes(document_id, topics)
similar_docs = graph_integration.find_similar_documents(document_id)
neo4j_repo.close()
except ImportError:
print("Graph integration not available (inputless_graph not installed)")
return {
'document': document_data,
'entities': entities,
'topics': topics,
'sentiment': sentiment,
'keywords': keywords,
'ai_knowledge': ai_knowledge,
'summary': summary
}
# Usage
result = asyncio.run(complete_pipeline("contract.pdf"))
print(f"Processed document with {len(result['entities'])} entities")
Processing Different File Types
import asyncio
from inputless_ingestion import DocumentProcessor
async def process_various_formats():
processor = DocumentProcessor()
# PDF document
pdf_data = await processor.process_file("document.pdf")
print(f"PDF: {len(pdf_data.text)} characters")
# Word document
docx_data = await processor.process_file("document.docx")
print(f"DOCX: {len(docx_data.text)} characters")
# Text file
txt_data = await processor.process_file("document.txt")
print(f"TXT: {len(txt_data.text)} characters")
# Image with OCR
image_data = await processor.process_file("scanned_document.jpg")
print(f"Image OCR: {len(image_data.text)} characters")
print(f"OCR Confidence: {image_data.ocr_confidence}")
asyncio.run(process_various_formats())
Error Handling
import asyncio
from inputless_ingestion import DocumentProcessor
from inputless_ingestion.exceptions import (
UnsupportedFileTypeError,
OCRProcessingError,
NLPProcessingError,
ProcessingError
)
async def safe_process(file_path: str):
processor = DocumentProcessor()
try:
result = await processor.process_file(file_path)
return result
except UnsupportedFileTypeError as e:
print(f"Unsupported file type: {e}")
except OCRProcessingError as e:
print(f"OCR failed: {e}")
except NLPProcessingError as e:
print(f"NLP processing failed: {e}")
except ProcessingError as e:
print(f"Processing error: {e}")
except FileNotFoundError:
print(f"File not found: {file_path}")
except Exception as e:
print(f"Unexpected error: {e}")
asyncio.run(safe_process("document.pdf"))
Performance Considerations
- Async Processing: All I/O operations are asynchronous
- Batch Operations: Process multiple documents efficiently
- Caching: Cache processed results to avoid reprocessing
- Memory Management: Stream large documents to avoid memory issues
- GPU Acceleration: Use GPU for OCR when available
Error Handling
from inputless_ingestion.exceptions import (
UnsupportedFileTypeError,
OCRProcessingError,
NLPProcessingError,
GraphIntegrationError
)
try:
result = await processor.process_file("document.pdf")
except UnsupportedFileTypeError:
print("File type not supported")
except OCRProcessingError:
print("OCR processing failed")
except NLPProcessingError:
print("NLP processing failed")
Testing
# Install dependencies
poetry install
# Run all tests
poetry run pytest tests/
# Run with coverage
poetry run pytest --cov=inputless_ingestion tests/
# Run specific test file
poetry run pytest tests/test_file_processor.py
# Run with verbose output
poetry run pytest tests/ -v
Test Coverage
The module includes comprehensive unit tests:
test_file_processor.py- Document processing teststest_ocr_engine.py- OCR engine teststest_nlp_processor.py- NLP processing teststest_ai_extractor.py- AI extraction teststest_graph_integration.py- Graph integration teststest_extractors/- Format-specific extractor tests
Distribution
This package is distributed via PyPI as inputless-ingestion.
# Install from PyPI
pip install inputless-ingestion
# Install development version
pip install git+https://github.com/vesperxs/InputlessSDK.git
API Reference
DocumentProcessor
Main orchestrator for document processing.
processor = DocumentProcessor(
enable_ocr=True, # Enable OCR processing
enable_nlp=True, # Enable NLP processing
enable_ai=True, # Enable AI extraction
ocr_confidence_threshold=0.7 # OCR confidence threshold
)
# Process single file
document_data = await processor.process_file(
file_path,
extract_entities=True,
extract_topics=True,
extract_ai_insights=True
)
# Batch process
results = await processor.process_batch(
file_paths,
max_concurrent=5
)
OCREngine
OCR text extraction from images.
ocr = OCREngine(
language="eng", # Language code
provider="tesseract", # "tesseract", "easyocr", or "auto"
enable_preprocessing=True # Image preprocessing
)
result = await ocr.process_image(image_path)
result = await ocr.process_with_confidence(image_path, min_confidence=0.8)
NLPProcessor
Natural language processing and analysis.
nlp = NLPProcessor(
model="en_core_web_sm", # spaCy model
max_entities=100, # Max entities to extract
num_topics=5 # Number of topics
)
entities = await nlp.extract_entities(text)
topics = await nlp.extract_topics(text, num_topics=5)
sentiment = await nlp.analyze_sentiment(text)
keywords = await nlp.extract_keywords(text, num_keywords=10)
classification = await nlp.classify_text(text)
AIKnowledgeExtractor
AI-powered knowledge extraction using LLMs.
ai = AIKnowledgeExtractor(
llm_provider="openai", # "openai" or "anthropic"
model="gpt-4", # Model name
temperature=0.3 # Temperature for generation
)
knowledge = await ai.extract_knowledge(text, document_type="legal")
summary = await ai.summarize(text, max_length=200)
insights = await ai.extract_insights(text)
DocumentGraphIntegration
Neo4j graph database integration (optional).
from inputless_graph import Neo4jRepository, Neo4jConfig
config = Neo4jConfig(uri="bolt://localhost:7687", user="neo4j", password="pass")
repo = Neo4jRepository(config)
graph = DocumentGraphIntegration(repo)
document_id = graph.create_document_node(document_data)
entity_ids = graph.create_entity_nodes(document_id, entities)
topic_ids = graph.create_topic_nodes(document_id, topics)
similar = graph.find_similar_documents(document_id, limit=10)
Data Models
DocumentData
class DocumentData(BaseModel):
text: str
metadata: DocumentMetadata
entities: Optional[List[Entity]] = None
topics: Optional[List[Topic]] = None
ai_insights: Optional[Dict[str, Any]] = None
ocr_confidence: Optional[float] = None
DocumentMetadata
class DocumentMetadata(BaseModel):
file_path: str
file_type: str
file_size: int
mime_type: str
encoding: Optional[str] = None
title: Optional[str] = None
author: Optional[str] = None
created_date: Optional[str] = None
page_count: Optional[int] = None
OCRResult
class OCRResult(BaseModel):
text: str
confidence: float
language: str
provider: str
Entity
class Entity(BaseModel):
text: str
label: str
start: int
end: int
confidence: float
Topic
class Topic(BaseModel):
topic_id: int
keywords: List[str]
weight: float
Version: 1.0.0
Status: ✅ Implementation Complete
Last Updated: January 2024
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inputless_ingestion-1.0.2.tar.gz.
File metadata
- Download URL: inputless_ingestion-1.0.2.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb53458cbdb441a70895e750ef949ff8758b2fb6818d0f41b481a5afe1206a50
|
|
| MD5 |
86288b2b633f0296a929fc5face79143
|
|
| BLAKE2b-256 |
eb54ace2a5d67cf126fca19b4e3028b295b3aca6f0346d4c7f7029e1a48a56ab
|
File details
Details for the file inputless_ingestion-1.0.2-py3-none-any.whl.
File metadata
- Download URL: inputless_ingestion-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b354377fd9b9d101b423b6e60d019a94899a8d4890a77b0a1348099fde920744
|
|
| MD5 |
edbe4aec3bae821d64ce017600d9d331
|
|
| BLAKE2b-256 |
33924ebf746fc32ff87b83e09767806eb2c16429b82669928c199c6633efc966
|