High-performance Python bindings for the Go-based Kodexa Document SDK with in-memory processing
Project description
Kodexa Document Python
High-performance Python bindings for the Go-based Kodexa Document SDK using CFFI. Provides comprehensive document processing capabilities with ~100x performance improvement through in-memory operations.
Overview
This package provides mature Python bindings for the Go-based Kodexa Document SDK. It uses CFFI (C Foreign Function Interface) to communicate with the Go library, offering full access to hierarchical document processing, advanced querying, and rich metadata management.
Key Highlights:
- Production Ready: 413+ comprehensive tests covering all functionality
- High Performance: ~100x faster with in-memory mode (1.19ms vs 121ms)
- Full Feature Set: Complete document manipulation, querying, and persistence
- Cross Platform: Linux, macOS (Intel/ARM), Windows, AWS Lambda
Features
Core Document Operations
- Document Creation: From text, JSON, KDDB files, or scratch
- In-Memory Processing: ~100x performance boost for temporary operations
- Context Managers: Automatic resource cleanup with
withstatements - Multiple Formats: JSON export/import, KDDB persistence, dict conversion
Content Structure & Navigation
- Hierarchical Nodes: Document tree structure like DOM for web pages
- Content Operations: Rich text handling with content parts
- Tree Navigation: Parent/child relationships, sibling traversal, path queries
- Node Management: Create, modify, remove nodes with full hierarchy support
Advanced Querying
- Selector Language: XPath-like queries (
//paragraph[contains(@content, 'text')]) - Variable Support: Parameterized queries with variable substitution
- Performance Options: First-only results, relative queries from nodes
- Rich Filtering: Content-based, tag-based, and feature-based selection
Metadata & Annotations
- Features System: Key-value metadata with type organization
- Tagging: Content annotation with confidence scores and values
- Document Labels: Classification and categorization
- Mixins: Capability flags and behavior markers
- External Data: Arbitrary data storage with custom keys
- Processing Steps: Workflow tracking and validation rules
Spatial & Geometric Operations
- Bounding Boxes: Position and dimension tracking
- Spatial Queries: Location-based content selection
- Coordinate Systems: Flexible positioning support
Enterprise Features
- Extraction Engine: Advanced content extraction with taxonomies
- Validation Framework: Rule-based document validation
- Statistics: Comprehensive document metrics and analysis
- Error Handling: Comprehensive exception system with specific error types
- Memory Management: Automatic cleanup with finalizers
Installation
pip install kodexa-document
Quick Start
from kodexa_document import Document
# Create high-performance in-memory document
with Document(inmemory=True) as doc:
# Create document structure
root = doc.create_node("document", "My Document")
doc.content_node = root
section = doc.create_node("section", "Introduction", parent=root)
para = doc.create_node("paragraph", "Important content", parent=section)
# Add rich metadata
para.tag("important", confidence=0.95, value="key-point")
para.add_feature("style", "emphasis", "bold")
doc.add_label("technical-document")
# Query with selectors
important_nodes = doc.select("//paragraph[@tag='important']")
all_content = doc.select("//*[contains(@content, 'content')]")
# Export to different formats
json_str = doc.to_json(indent=2)
doc.save("output.kddb")
print(f"Found {len(important_nodes)} important paragraphs")
Advanced Usage Examples
Document Processing Pipeline
from kodexa_document import Document
from kodexa_document.errors import DocumentError
def process_document(input_path, output_path):
"""Complete document processing pipeline."""
with Document.from_kddb(input_path, inmemory=True) as doc:
# Analyze structure
all_nodes = doc.select("//*")
paragraphs = doc.select("//paragraph")
# Process content
for i, para in enumerate(paragraphs):
if len(para.content) > 100: # Long paragraphs
para.tag("detailed", confidence=0.8)
para.add_feature("analysis", "length", len(para.content))
if i == 0: # First paragraph
para.tag("introduction")
# Add document metadata
doc.set_metadata("processed", True)
doc.set_metadata("node_count", len(all_nodes))
doc.add_label("processed-document")
# Save results
doc.save(output_path)
return {
"uuid": doc.uuid,
"nodes": len(all_nodes),
"tagged": len(doc.get_all_tagged_nodes())
}
# Process with error handling
try:
result = process_document("input.kddb", "processed.kddb")
print(f"Processed document {result['uuid']}: {result['nodes']} nodes")
except DocumentError as e:
print(f"Processing failed: {e}")
Content Analysis and Extraction
# Load and analyze document structure
with Document.from_text("Chapter 1\nIntroduction\nContent here",
separator="\n", inmemory=True) as doc:
# Navigate document hierarchy
root = doc.content_node
children = root.get_children()
# Rich querying
headers = doc.select("//paragraph[1]") # First paragraphs (likely headers)
long_content = doc.select("//paragraph[string-length(@content) > 50]")
# Feature analysis
for node in children:
node.add_feature("position", "index", node.index)
if "Chapter" in node.content:
node.tag("chapter-header")
node.add_feature("structure", "type", "header")
# Get comprehensive statistics
stats = doc.get_statistics()
tagged_nodes = doc.get_all_tagged_nodes()
print(f"Document structure: {len(children)} top-level nodes")
print(f"Tagged content: {len(tagged_nodes)} nodes")
print(f"Statistics: {stats}")
Performance Comparison
import time
# In-memory processing (recommended for temporary operations)
start = time.time()
with Document(inmemory=True) as doc:
root = doc.create_node("document", "Fast processing")
doc.content_node = root
for i in range(1000):
doc.create_node("item", f"Item {i}", parent=root)
nodes = doc.select("//*")
inmemory_time = time.time() - start
# File-based processing (for persistence)
start = time.time()
with Document(inmemory=False) as doc:
root = doc.create_node("document", "Persistent processing")
doc.content_node = root
for i in range(1000):
doc.create_node("item", f"Item {i}", parent=root)
nodes = doc.select("//*")
file_time = time.time() - start
print(f"In-memory: {inmemory_time:.3f}s")
print(f"File-based: {file_time:.3f}s")
print(f"Performance improvement: {file_time/inmemory_time:.1f}x faster")
Loading Documents
The from_kddb method supports flexible loading modes:
# Standard loading modes
doc = Document.from_kddb("input.kddb") # Detached copy (safe, default)
doc = Document.from_kddb("input.kddb", detached=False) # In-place editing
doc = Document.from_kddb("input.kddb", inmemory=True) # 100x performance boost
# Load from bytes (API responses, downloads, etc.)
with open("document.kddb", "rb") as f:
kddb_bytes = f.read()
doc = Document.from_kddb(kddb_bytes, inmemory=True)
# Temporary files with auto-cleanup
doc = Document.from_kddb("temp.kddb", delete_on_close=True)
| Parameter | Default | Description |
|---|---|---|
detached |
True |
Creates working copy vs editing original |
inmemory |
False |
Loads into memory for ~100x performance |
delete_on_close |
False |
Auto-deletes file when document closes |
Error Handling
from kodexa_document.errors import DocumentError, DocumentNotFoundError
# Robust error handling
try:
with Document.from_kddb("document.kddb", inmemory=True) as doc:
# Process document
nodes = doc.select("//paragraph")
for node in nodes:
node.tag("processed")
# Validate results
if not doc.uuid:
raise DocumentError("Invalid document state")
except DocumentNotFoundError:
print("Document file not found")
except DocumentError as e:
print(f"Document processing error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Architecture
Python Application
↓
CFFI Python Wrapper (413+ Tests)
↓
Go Shared Library (CGO)
↓
GORM Domain Layer
↓
SQLite Database (File/Memory)
Performance Modes:
- In-Memory SQLite:
:memory:database for maximum speed - File-Based SQLite: Persistent
.kddbfiles for storage - Hybrid Mode: Load from file, process in-memory, save back
Requirements
- Python 3.12+
- cffi >= 1.14.0
- Go shared library (automatically bundled in wheel)
Platform Support
- Linux x86_64 - Primary development platform
- macOS x86_64 & ARM64 - Intel and Apple Silicon support
- Windows x86_64 - Full Windows compatibility
- AWS Lambda - Amazon Linux 2 optimization
Testing & Quality
- 413+ Comprehensive Tests covering all functionality
- 100% Feature Coverage - All advertised features are tested and working
- Error Path Testing - Comprehensive error handling validation
- Performance Testing - Memory usage and speed benchmarks
- Cross-Platform Testing - Validated on all supported platforms
# Run comprehensive test suite
cd lib/python
source ../../venv/bin/activate
python -m pytest tests/ -v
# Test categories
python -m pytest tests/test_document.py -v # Core document operations
python -m pytest tests/test_contentnode_features_tags.py -v # Features and tags
python -m pytest tests/test_contentnode_selectors.py -v # Query system
python -m pytest tests/test_extraction.py -v # Advanced extraction
Development Setup
# Quick setup from repository root
python3 -m venv venv
source venv/bin/activate
pip install cffi pytest
# Build Go library and Python bindings
cd lib/go && make linux # or: make darwin, make windows
cd ../python
# Test installation
python -c "from kodexa_document import Document; print('Success!')"
# Run tests
python -m pytest tests/ -v
Documentation
User Documentation
- USAGE.md - Comprehensive usage examples and best practices
- docs/API_REFERENCE.md - Complete API reference
Build Documentation
- docs/BUILD_SCRIPTS_GUIDE.md - Build automation guide
- build/docs/BUILD.md - Detailed build instructions
- build/docs/WINDOWS_SETUP.md - Windows development setup
Best Practices
- Use
inmemory=Truefor temporary processing (~100x faster) - Use context managers (
withstatements) for automatic cleanup - Handle specific exceptions (DocumentError, DocumentNotFoundError)
- Structure documents hierarchically with proper parent-child relationships
- Leverage selectors for efficient document querying
- Use features and tags for rich content annotation
- Set meaningful metadata for document tracking and organization
Use Cases
- Document Processing Pipelines - ETL workflows for structured documents
- Content Analysis - Text mining, information extraction, document understanding
- Document Transformation - Format conversion, structure normalization
- Search and Indexing - Content indexing with rich metadata
- Validation and Quality - Document structure validation and quality assessment
- Machine Learning - Feature extraction for ML pipelines
- Enterprise Integration - High-performance document processing systems
Performance Characteristics
| Operation | In-Memory | File-Based | Improvement |
|---|---|---|---|
| Document Creation | ~1.2ms | ~121ms | 100x |
| Node Creation (1000 nodes) | ~15ms | ~1.5s | 100x |
| Selector Queries | ~2ms | ~45ms | 22x |
| Feature/Tag Operations | ~0.5ms | ~25ms | 50x |
License
Same as the main Kodexa Document SDK.
Ready to get started? Check out USAGE.md for comprehensive examples and run the test suite to see all features in action!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kodexa_document-8.0.0.dev20666903241.tar.gz.
File metadata
- Download URL: kodexa_document-8.0.0.dev20666903241.tar.gz
- Upload date:
- Size: 50.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da08fcb2cf29e96cf55d2f7bd4dbf284a131b6c5d612da1c06cdeaea626f9c8f
|
|
| MD5 |
10c221e00dd8c60a8b30cb85517faa08
|
|
| BLAKE2b-256 |
4fbd3c1e23d9a20949ac63880bbed213d8d15620fde982326063e4198e664ced
|
File details
Details for the file kodexa_document-8.0.0.dev20666903241-py3-none-any.whl.
File metadata
- Download URL: kodexa_document-8.0.0.dev20666903241-py3-none-any.whl
- Upload date:
- Size: 50.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9dcb4d1ca8cba119235f83047e9f601c21babe8d337d15b671eed0a2b3d2243
|
|
| MD5 |
6c004782b1a8aaa49dfd4c5216a9381c
|
|
| BLAKE2b-256 |
bc4962cbce83fda2913387dbfd04f088e04a3b69b36e1fcc68329d94ff7ef7c6
|