High-performance Python bindings for the Go-based Kodexa Document SDK with in-memory processing

These details have not been verified by PyPI

Project links

Project description

Kodexa Document Python

High-performance Python bindings for the Go-based Kodexa Document SDK using CFFI. Provides comprehensive document processing capabilities with ~100x performance improvement through in-memory operations.

Overview

This package provides mature Python bindings for the Go-based Kodexa Document SDK. It uses CFFI (C Foreign Function Interface) to communicate with the Go library, offering full access to hierarchical document processing, advanced querying, and rich metadata management.

Key Highlights:

Production Ready: 413+ comprehensive tests covering all functionality
High Performance: ~100x faster with in-memory mode (1.19ms vs 121ms)
Full Feature Set: Complete document manipulation, querying, and persistence
Cross Platform: Linux, macOS (Intel/ARM), Windows, AWS Lambda

Features

Core Document Operations

Document Creation: From text, JSON, KDDB files, or scratch
In-Memory Processing: ~100x performance boost for temporary operations
Context Managers: Automatic resource cleanup with with statements
Multiple Formats: JSON export/import, KDDB persistence, dict conversion

Content Structure & Navigation

Hierarchical Nodes: Document tree structure like DOM for web pages
Content Operations: Rich text handling with content parts
Tree Navigation: Parent/child relationships, sibling traversal, path queries
Node Management: Create, modify, remove nodes with full hierarchy support

Advanced Querying

Selector Language: XPath-like queries (//paragraph[contains(@content, 'text')])
Variable Support: Parameterized queries with variable substitution
Performance Options: First-only results, relative queries from nodes
Rich Filtering: Content-based, tag-based, and feature-based selection

Metadata & Annotations

Features System: Key-value metadata with type organization
Tagging: Content annotation with confidence scores and values
Document Labels: Classification and categorization
Mixins: Capability flags and behavior markers
External Data: Arbitrary data storage with custom keys
Processing Steps: Workflow tracking and validation rules

Spatial & Geometric Operations

Bounding Boxes: Position and dimension tracking
Spatial Queries: Location-based content selection
Coordinate Systems: Flexible positioning support

Enterprise Features

Extraction Engine: Advanced content extraction with taxonomies
Validation Framework: Rule-based document validation
Statistics: Comprehensive document metrics and analysis
Error Handling: Comprehensive exception system with specific error types
Memory Management: Automatic cleanup with finalizers

Installation

pip install kodexa-document

Quick Start

from kodexa_document import Document

# Create high-performance in-memory document
with Document(inmemory=True) as doc:
    # Create document structure
    root = doc.create_node("document", "My Document")
    doc.content_node = root

    section = doc.create_node("section", "Introduction", parent=root)
    para = doc.create_node("paragraph", "Important content", parent=section)

    # Add rich metadata
    para.tag("important", confidence=0.95, value="key-point")
    para.add_feature("style", "emphasis", "bold")
    doc.add_label("technical-document")

    # Query with selectors
    important_nodes = doc.select("//paragraph[@tag='important']")
    all_content = doc.select("//*[contains(@content, 'content')]")

    # Export to different formats
    json_str = doc.to_json(indent=2)
    doc.save("output.kddb")

print(f"Found {len(important_nodes)} important paragraphs")

Advanced Usage Examples

Document Processing Pipeline

from kodexa_document import Document
from kodexa_document.errors import DocumentError

def process_document(input_path, output_path):
    """Complete document processing pipeline."""
    with Document.from_kddb(input_path, inmemory=True) as doc:
        # Analyze structure
        all_nodes = doc.select("//*")
        paragraphs = doc.select("//paragraph")

        # Process content
        for i, para in enumerate(paragraphs):
            if len(para.content) > 100:  # Long paragraphs
                para.tag("detailed", confidence=0.8)
                para.add_feature("analysis", "length", len(para.content))

            if i == 0:  # First paragraph
                para.tag("introduction")

        # Add document metadata
        doc.set_metadata("processed", True)
        doc.set_metadata("node_count", len(all_nodes))
        doc.add_label("processed-document")

        # Save results
        doc.save(output_path)

        return {
            "uuid": doc.uuid,
            "nodes": len(all_nodes),
            "tagged": len(doc.get_all_tagged_nodes())
        }

# Process with error handling
try:
    result = process_document("input.kddb", "processed.kddb")
    print(f"Processed document {result['uuid']}: {result['nodes']} nodes")
except DocumentError as e:
    print(f"Processing failed: {e}")

Content Analysis and Extraction

# Load and analyze document structure
with Document.from_text("Chapter 1\nIntroduction\nContent here",
                       separator="\n", inmemory=True) as doc:

    # Navigate document hierarchy
    root = doc.content_node
    children = root.get_children()

    # Rich querying
    headers = doc.select("//paragraph[1]")  # First paragraphs (likely headers)
    long_content = doc.select("//paragraph[string-length(@content) > 50]")

    # Feature analysis
    for node in children:
        node.add_feature("position", "index", node.index)
        if "Chapter" in node.content:
            node.tag("chapter-header")
            node.add_feature("structure", "type", "header")

    # Get comprehensive statistics
    stats = doc.get_statistics()
    tagged_nodes = doc.get_all_tagged_nodes()

    print(f"Document structure: {len(children)} top-level nodes")
    print(f"Tagged content: {len(tagged_nodes)} nodes")
    print(f"Statistics: {stats}")

Performance Comparison

import time

# In-memory processing (recommended for temporary operations)
start = time.time()
with Document(inmemory=True) as doc:
    root = doc.create_node("document", "Fast processing")
    doc.content_node = root
    for i in range(1000):
        doc.create_node("item", f"Item {i}", parent=root)
    nodes = doc.select("//*")
inmemory_time = time.time() - start

# File-based processing (for persistence)
start = time.time()
with Document(inmemory=False) as doc:
    root = doc.create_node("document", "Persistent processing")
    doc.content_node = root
    for i in range(1000):
        doc.create_node("item", f"Item {i}", parent=root)
    nodes = doc.select("//*")
file_time = time.time() - start

print(f"In-memory: {inmemory_time:.3f}s")
print(f"File-based: {file_time:.3f}s")
print(f"Performance improvement: {file_time/inmemory_time:.1f}x faster")

Loading Documents

The from_kddb method supports flexible loading modes:

# Standard loading modes
doc = Document.from_kddb("input.kddb")  # Detached copy (safe, default)
doc = Document.from_kddb("input.kddb", detached=False)  # In-place editing
doc = Document.from_kddb("input.kddb", inmemory=True)  # 100x performance boost

# Load from bytes (API responses, downloads, etc.)
with open("document.kddb", "rb") as f:
    kddb_bytes = f.read()
doc = Document.from_kddb(kddb_bytes, inmemory=True)

# Temporary files with auto-cleanup
doc = Document.from_kddb("temp.kddb", delete_on_close=True)

Parameter	Default	Description
`detached`	`True`	Creates working copy vs editing original
`inmemory`	`False`	Loads into memory for ~100x performance
`delete_on_close`	`False`	Auto-deletes file when document closes

Error Handling

from kodexa_document.errors import DocumentError, DocumentNotFoundError

# Robust error handling
try:
    with Document.from_kddb("document.kddb", inmemory=True) as doc:
        # Process document
        nodes = doc.select("//paragraph")
        for node in nodes:
            node.tag("processed")

        # Validate results
        if not doc.uuid:
            raise DocumentError("Invalid document state")

except DocumentNotFoundError:
    print("Document file not found")
except DocumentError as e:
    print(f"Document processing error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Architecture

Python Application
       ↓
CFFI Python Wrapper (413+ Tests)
       ↓
Go Shared Library (CGO)
       ↓
GORM Domain Layer
       ↓
SQLite Database (File/Memory)

Performance Modes:

In-Memory SQLite: :memory: database for maximum speed
File-Based SQLite: Persistent .kddb files for storage
Hybrid Mode: Load from file, process in-memory, save back

Requirements

Python 3.12+
cffi >= 1.14.0
Go shared library (automatically bundled in wheel)

Platform Support

Linux x86_64 - Primary development platform
macOS x86_64 & ARM64 - Intel and Apple Silicon support
Windows x86_64 - Full Windows compatibility
AWS Lambda - Amazon Linux 2 optimization

Testing & Quality

413+ Comprehensive Tests covering all functionality
100% Feature Coverage - All advertised features are tested and working
Error Path Testing - Comprehensive error handling validation
Performance Testing - Memory usage and speed benchmarks
Cross-Platform Testing - Validated on all supported platforms

# Run comprehensive test suite
cd lib/python
source ../../venv/bin/activate
python -m pytest tests/ -v

# Test categories
python -m pytest tests/test_document.py -v                    # Core document operations
python -m pytest tests/test_contentnode_features_tags.py -v  # Features and tags
python -m pytest tests/test_contentnode_selectors.py -v      # Query system
python -m pytest tests/test_extraction.py -v                 # Advanced extraction

Development Setup

# Quick setup from repository root
python3 -m venv venv
source venv/bin/activate
pip install cffi pytest

# Build Go library and Python bindings
cd lib/go && make linux  # or: make darwin, make windows
cd ../python

# Test installation
python -c "from kodexa_document import Document; print('Success!')"

# Run tests
python -m pytest tests/ -v

Documentation

User Documentation

USAGE.md - Comprehensive usage examples and best practices
docs/API_REFERENCE.md - Complete API reference

Build Documentation

docs/BUILD_SCRIPTS_GUIDE.md - Build automation guide
build/docs/BUILD.md - Detailed build instructions
build/docs/WINDOWS_SETUP.md - Windows development setup

Best Practices

Use inmemory=True for temporary processing (~100x faster)
Use context managers (with statements) for automatic cleanup
Handle specific exceptions (DocumentError, DocumentNotFoundError)
Structure documents hierarchically with proper parent-child relationships
Leverage selectors for efficient document querying
Use features and tags for rich content annotation
Set meaningful metadata for document tracking and organization

Use Cases

Document Processing Pipelines - ETL workflows for structured documents
Content Analysis - Text mining, information extraction, document understanding
Document Transformation - Format conversion, structure normalization
Search and Indexing - Content indexing with rich metadata
Validation and Quality - Document structure validation and quality assessment
Machine Learning - Feature extraction for ML pipelines
Enterprise Integration - High-performance document processing systems

Performance Characteristics

Operation	In-Memory	File-Based	Improvement
Document Creation	~1.2ms	~121ms	100x
Node Creation (1000 nodes)	~15ms	~1.5s	100x
Selector Queries	~2ms	~45ms	22x
Feature/Tag Operations	~0.5ms	~25ms	50x

License

Same as the main Kodexa Document SDK.

Ready to get started? Check out USAGE.md for comprehensive examples and run the test suite to see all features in action!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

8.0.0.dev20745964609 pre-release

Jan 6, 2026

8.0.0.dev20737559826 pre-release

Jan 6, 2026

8.0.0.dev20733279079 pre-release

Jan 6, 2026

8.0.0.dev20728458459 pre-release

Jan 5, 2026

8.0.0.dev20726869752 pre-release

Jan 5, 2026

8.0.0.dev20726531817 pre-release

Jan 5, 2026

8.0.0.dev20725942880 pre-release

Jan 5, 2026

8.0.0.dev20725664046 pre-release

Jan 5, 2026

8.0.0.dev20725283848 pre-release

Jan 5, 2026

8.0.0.dev20723376586 pre-release

Jan 5, 2026

8.0.0.dev20722991702 pre-release

Jan 5, 2026

8.0.0.dev20722761506 pre-release

Jan 5, 2026

8.0.0.dev20716163312 pre-release

Jan 5, 2026

8.0.0.dev20714693154 pre-release

Jan 5, 2026

8.0.0.dev20714149286 pre-release

Jan 5, 2026

8.0.0.dev20712462784 pre-release

Jan 5, 2026

8.0.0.dev20712262155 pre-release

Jan 5, 2026

8.0.0.dev20711553863 pre-release

Jan 5, 2026

8.0.0.dev20705957598 pre-release

Jan 5, 2026

8.0.0.dev20705699635 pre-release

Jan 5, 2026

8.0.0.dev20705506454 pre-release

Jan 5, 2026

8.0.0.dev20705224088 pre-release

Jan 5, 2026

8.0.0.dev20704979117 pre-release

Jan 5, 2026

8.0.0.dev20704307503 pre-release

Jan 5, 2026

8.0.0.dev20703997947 pre-release

Jan 5, 2026

8.0.0.dev20703470305 pre-release

Jan 5, 2026

8.0.0.dev20703304086 pre-release

Jan 5, 2026

8.0.0.dev20703173255 pre-release

Jan 5, 2026

8.0.0.dev20703140562 pre-release

Jan 5, 2026

8.0.0.dev20702664986 pre-release

Jan 5, 2026

8.0.0.dev20702574302 pre-release

Jan 5, 2026

8.0.0.dev20702471574 pre-release

Jan 5, 2026

8.0.0.dev20702162995 pre-release

Jan 5, 2026

8.0.0.dev20702114364 pre-release

Jan 5, 2026

8.0.0.dev20701692973 pre-release

Jan 5, 2026

8.0.0.dev20701562661 pre-release

Jan 5, 2026

8.0.0.dev20701328606 pre-release

Jan 5, 2026

8.0.0.dev20701266889 pre-release

Jan 5, 2026

8.0.0.dev20700822672 pre-release

Jan 4, 2026

8.0.0.dev20700405129 pre-release

Jan 4, 2026

8.0.0.dev20700204513 pre-release

Jan 4, 2026

8.0.0.dev20698367124 pre-release

Jan 4, 2026

8.0.0.dev20698331116 pre-release

Jan 4, 2026

8.0.0.dev20697905845 pre-release

Jan 4, 2026

8.0.0.dev20697581423 pre-release

Jan 4, 2026

8.0.0.dev20697552426 pre-release

Jan 4, 2026

8.0.0.dev20696633518 pre-release

Jan 4, 2026

8.0.0.dev20695776720 pre-release

Jan 4, 2026

8.0.0.dev20695776155 pre-release

Jan 4, 2026

8.0.0.dev20695567650 pre-release

Jan 4, 2026

8.0.0.dev20695387761 pre-release

Jan 4, 2026

8.0.0.dev20695379926 pre-release

Jan 4, 2026

8.0.0.dev20695081175 pre-release

Jan 4, 2026

8.0.0.dev20694866993 pre-release

Jan 4, 2026

8.0.0.dev20694214308 pre-release

Jan 4, 2026

8.0.0.dev20694084673 pre-release

Jan 4, 2026

8.0.0.dev20693770058 pre-release

Jan 4, 2026

8.0.0.dev20693431012 pre-release

Jan 4, 2026

8.0.0.dev20693251405 pre-release

Jan 4, 2026

8.0.0.dev20692865923 pre-release

Jan 4, 2026

8.0.0.dev20692546346 pre-release

Jan 4, 2026

8.0.0.dev20692024221 pre-release

Jan 4, 2026

8.0.0.dev20691606099 pre-release

Jan 4, 2026

8.0.0.dev20685647973 pre-release

Jan 4, 2026

8.0.0.dev20685373778 pre-release

Jan 4, 2026

8.0.0.dev20684145721 pre-release

Jan 3, 2026

8.0.0.dev20684069061 pre-release

Jan 3, 2026

8.0.0.dev20682965013 pre-release

Jan 3, 2026

8.0.0.dev20682366087 pre-release

Jan 3, 2026

8.0.0.dev20682286664 pre-release

Jan 3, 2026

8.0.0.dev20681236849 pre-release

Jan 3, 2026

8.0.0.dev20681066565 pre-release

Jan 3, 2026

8.0.0.dev20679284165 pre-release

Jan 3, 2026

8.0.0.dev20678975450 pre-release

Jan 3, 2026

8.0.0.dev20678483129 pre-release

Jan 3, 2026

8.0.0.dev20678390361 pre-release

Jan 3, 2026

8.0.0.dev20677415017 pre-release

Jan 3, 2026

8.0.0.dev20677054091 pre-release

Jan 3, 2026

8.0.0.dev20676973016 pre-release

Jan 3, 2026

8.0.0.dev20676435617 pre-release

Jan 3, 2026

8.0.0.dev20668299940 pre-release

Jan 2, 2026

8.0.0.dev20668175967 pre-release

Jan 2, 2026

8.0.0.dev20667869904 pre-release

Jan 2, 2026

8.0.0.dev20667696096 pre-release

Jan 2, 2026

8.0.0.dev20667537543 pre-release

Jan 2, 2026

8.0.0.dev20666903241 pre-release

Jan 2, 2026

8.0.0.dev20666811679 pre-release

Jan 2, 2026

8.0.0.dev20666511743 pre-release

Jan 2, 2026

8.0.0.dev20666282928 pre-release

Jan 2, 2026

8.0.0.dev20666055921 pre-release

Jan 2, 2026

8.0.0.dev20665016781 pre-release

Jan 2, 2026

8.0.0.dev20664344428 pre-release

Jan 2, 2026

8.0.0.dev20663153063 pre-release

Jan 2, 2026

8.0.0.dev20662991508 pre-release

Jan 2, 2026

8.0.0.dev20661189627 pre-release

Jan 2, 2026

8.0.0.dev20660924609 pre-release

Jan 2, 2026

8.0.0.dev20660690597 pre-release

Jan 2, 2026

8.0.0.dev20660254147 pre-release

Jan 2, 2026

8.0.0.dev20656116298 pre-release

Jan 2, 2026

8.0.0.dev20649828202 pre-release

Jan 2, 2026

8.0.0.dev20649202577 pre-release

Jan 2, 2026

8.0.0.dev20648159557 pre-release

Jan 2, 2026

8.0.0.dev20648026193 pre-release

Jan 2, 2026

8.0.0.dev20647379970 pre-release

Jan 1, 2026

8.0.0.dev20647280140 pre-release

Jan 1, 2026

8.0.0.dev20645879260 pre-release

Jan 1, 2026

8.0.0.dev20645316394 pre-release

Jan 1, 2026

8.0.0.dev20644964681 pre-release

Jan 1, 2026

8.0.0.dev20644602896 pre-release

Jan 1, 2026

8.0.0.dev20643995611 pre-release

Jan 1, 2026

8.0.0.dev20643846711 pre-release

Jan 1, 2026

8.0.0.dev20643548238 pre-release

Jan 1, 2026

8.0.0.dev20643394910 pre-release

Jan 1, 2026

8.0.0.dev20642344785 pre-release

Jan 1, 2026

8.0.0.dev20641422684 pre-release

Jan 1, 2026

8.0.0.dev20639421966 pre-release

Jan 1, 2026

8.0.0.dev20630564730 pre-release

Jan 1, 2026

8.0.0.dev20630410546 pre-release

Jan 1, 2026

8.0.0.dev20629792125 pre-release

Jan 1, 2026

8.0.0.dev20628910995 pre-release

Dec 31, 2025

8.0.0.dev20627526277 pre-release

Dec 31, 2025

8.0.0.dev20626546696 pre-release

Dec 31, 2025

8.0.0.dev20625942501 pre-release

Dec 31, 2025

8.0.0.dev20625646320 pre-release

Dec 31, 2025

8.0.0.dev20624275537 pre-release

Dec 31, 2025

8.0.0.dev20623243537 pre-release

Dec 31, 2025

8.0.0.dev20621137504 pre-release

Dec 31, 2025

8.0.0.dev20620782829 pre-release

Dec 31, 2025

8.0.0.dev20620491358 pre-release

Dec 31, 2025

8.0.0.dev20620278031 pre-release

Dec 31, 2025

8.0.0.dev20620074708 pre-release

Dec 31, 2025

8.0.0.dev20619861858 pre-release

Dec 31, 2025

8.0.0.dev20619645857 pre-release

Dec 31, 2025

8.0.0.dev20619353322 pre-release

Dec 31, 2025

8.0.0.dev20618995933 pre-release

Dec 31, 2025

8.0.0.dev20610955524 pre-release

Dec 31, 2025

8.0.0.dev20610778978 pre-release

Dec 31, 2025

8.0.0.dev20610579708 pre-release

Dec 31, 2025

8.0.0.dev20610417426 pre-release

Dec 31, 2025

8.0.0.dev20610245017 pre-release

Dec 31, 2025

8.0.0.dev20610075040 pre-release

Dec 31, 2025

8.0.0.dev20609935157 pre-release

Dec 31, 2025

This version

8.0.0.dev20609788083 pre-release

Dec 31, 2025

8.0.0.dev20609604683 pre-release

Dec 31, 2025

8.0.0.dev20609455019 pre-release

Dec 31, 2025

8.0.0.dev20609282117 pre-release

Dec 31, 2025

8.0.0.dev20609104944 pre-release

Dec 31, 2025

8.0.0.dev20608931535 pre-release

Dec 31, 2025

8.0.0.dev20608298942 pre-release

Dec 30, 2025

8.0.0.dev20607935278 pre-release

Dec 30, 2025

8.0.0.dev20607635001 pre-release

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kodexa_document-8.0.0.dev20609788083.tar.gz (19.7 MB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kodexa_document-8.0.0.dev20609788083-py3-none-any.whl (19.7 MB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file kodexa_document-8.0.0.dev20609788083.tar.gz.

File metadata

Download URL: kodexa_document-8.0.0.dev20609788083.tar.gz
Upload date: Dec 31, 2025
Size: 19.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for kodexa_document-8.0.0.dev20609788083.tar.gz
Algorithm	Hash digest
SHA256	`b75205915560f05a3900f0e26143e58ab8e933fc2d06ed5c2300faee3d6658d1`
MD5	`c1d7d5cc69b985561c2ff0861b82dd1a`
BLAKE2b-256	`85e1afe0da35d599076256db7ff0eac1d1024fe1ee495fa2de6f8b3bef1298d3`

See more details on using hashes here.

File details

Details for the file kodexa_document-8.0.0.dev20609788083-py3-none-any.whl.

File metadata

Download URL: kodexa_document-8.0.0.dev20609788083-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 19.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for kodexa_document-8.0.0.dev20609788083-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b171e82acdfa45688c39e59e7e1e4cc7ea5842758979419f630d85890c1cd3a9`
MD5	`eb551cbf440dbd03713048f9cd157e8e`
BLAKE2b-256	`2cead7f7cf59a27d32e577fb7016f28c8781e7f1bb620de5ecb7e98ef2fd7787`

See more details on using hashes here.

kodexa-document 8.0.0.dev20609788083

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Kodexa Document Python

Overview

Features

Core Document Operations

Content Structure & Navigation

Advanced Querying

Metadata & Annotations

Spatial & Geometric Operations

Enterprise Features

Installation

Quick Start

Advanced Usage Examples

Document Processing Pipeline

Content Analysis and Extraction

Performance Comparison

Loading Documents

Error Handling

Architecture

Requirements

Platform Support

Testing & Quality

Development Setup

Documentation

User Documentation

Build Documentation

Best Practices

Use Cases

Performance Characteristics

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes