Sema4.ai Document Intelligence library

Project description

Sema4AI Document Intelligence Core

A Python library for document intelligence operations including document extraction, data model management, layout processing, and content transformation.

Overview

The Document Intelligence Core provides a unified service layer for:

Document Processing: Extract and transform content from 30+ document formats
Data Model Management: Create and manage document schemas and business views
Layout Management: Generate translation schemas for mapping between different data formats
Content Extraction: Leverage Reducto AI for intelligent document parsing
Validation: Validate extracted content against quality rules

Installation

pip install sema4ai-docint

Getting Started

Basic Setup

from sema4ai.data import DataSource
from sema4ai_docint import build_di_service

# Build the document intelligence service
di_service = build_di_service(
    datasource=datasource,
    sema4_api_key="your-sema4-api-key",  # Optional: required for extraction operations
    disable_ssl_verification=False       # Optional: if you want to disable ssl when talking with the extraction client
)

When to Use API Keys

`sema4_api_key` Parameter

Required for: Document extraction operations
Used by: ExtractionService and DocumentService.ingest() operations
If not provided: Extraction service will be None, and document ingestion will fail

`disable_ssl_verification` Parameter

Required for: Development environments or networks with SSL/proxy issues
Used by: Reducto client connections to Sema4AI backend
Default: False (SSL verification enabled)
When to set True: Testing environments, behind corporate proxies, or SSL certificate issues

API Reference

DIService (Main Facade)

The DIService class provides access to all document intelligence operations through organized sub-services.

# Access sub-services
di_service.document     # Document operations
di_service.data_model   # Data model operations
di_service.layout       # Layout operations
di_service.extraction   # Extraction operations (if sema4_api_key provided)

DocumentService

Handles high-level document operations including ingestion, querying, and validation.

`ingest(file_name: str, data_model_name: str, layout_name: str) -> dict`

Ingest a document into the system using a specific data model and layout.

Parameters:

file_name (str): Name of the PDF file to process
data_model_name (str): Name of the data model to use
layout_name (str): Name of the document layout for processing

Returns: Dict containing the processed document and validation information

Requires: sema4_api_key (uses ExtractionService)

result = di_service.document.ingest(
    file_name="invoice.pdf",
    data_model_name="invoice_model",
    layout_name="standard_layout"
)

`query(document_id: str) -> dict`

Retrieve a document in data model format using business views.

Parameters:

document_id (str): Document ID to retrieve

Returns: Dict with document data organized by view names

document_data = di_service.document.query("doc_123")

`validate(data_model_name: str, document_id: str) -> dict`

Validate a document against quality checks.

Parameters:

data_model_name (str): Name of the data model
document_id (str): ID of the document to validate

Returns: Validation results with overall status and rule outcomes

validation_result = di_service.document.validate("invoice_model", "doc_123")

DataModelService

Manages data models, schemas, and business view generation.

`generate_from_file(file_name: str) -> dict`

Generate a data model schema from an uploaded document.

Parameters:

file_name (str): Name of the file to analyze

Returns: Dict with generated schema and success message

Uses: AgentServerClient for AI-powered schema generation

schema_result = di_service.data_model.generate_from_file("sample_invoice.pdf")

`create_from_schema(name: str, description: str, json_schema_text: str, prompt: str = None, summary: str = None) -> dict`

Create a new data model from a JSON schema.

Parameters:

name (str): Name of the data model
description (str): Description of the data model
json_schema_text (str): JSON schema as string
prompt (str, optional): Custom prompt for the data model
summary (str, optional): Summary of the data model

Returns: Created data model as JSON

Uses: AgentServerClient for schema processing and summarization

data_model = di_service.data_model.create_from_schema(
    name="Invoice Model",
    description="Schema for processing invoices",
    json_schema_text='{"type": "object", "properties": {...}}',
    prompt="Extract invoice data accurately"
)

`create_business_views(data_model_name: str) -> dict`

Create SQL views for a data model in the database.

Parameters:

data_model_name (str): Name of the data model

Returns: Success message

views_result = di_service.data_model.create_business_views("invoice_model")

LayoutService

Handles document layout operations and translation schema generation.

`generate_translation_schema(data_model_name: str, layout_schema: str) -> dict`

Create translation rules to map layout schema to data model schema.

Parameters:

data_model_name (str): Name of the target data model
layout_schema (str): Source extraction schema as JSON string

Returns: Dict containing translation mapping rules

Uses: AgentServerClient for intelligent mapping generation

translation_schema = di_service.layout.generate_translation_schema(
    data_model_name="invoice_model",
    layout_schema='{"type": "object", "properties": {...}}'
)

ExtractionService

Provides document extraction capabilities using Reducto AI.

Note: Only available when sema4_api_key is provided to build_di_service().

`extract(file_path: Path, extraction_schema: Union[str, dict], data_model_prompt: str = None, extraction_config: dict = None, document_layout_prompt: str = None) -> dict`

Extract structured data from a document.

Parameters:

file_path (Path): Path to the document file
extraction_schema (str | dict): Schema defining what to extract
data_model_prompt (str, optional): Custom prompt for extraction
extraction_config (dict, optional): Reducto configuration options
document_layout_prompt (str, optional): Layout-specific prompt

Returns: Extracted data as dictionary

Requires: sema4_api_key

# Only available if sema4_api_key was provided
if di_service.extraction:
    extracted_data = di_service.extraction.extract(
        file_path=Path("document.pdf"),
        extraction_schema={"type": "object", "properties": {...}},
        data_model_prompt="Extract key business information"
    )

`reducto` Property

Access to the underlying Reducto client for advanced operations.

# Direct access to Reducto client
if di_service.extraction:
    reducto_client = di_service.extraction.reducto

Usage Examples

Complete Document Processing Workflow

from sema4ai.data import DataSource
from sema4ai_docint import build_di_service

# Setup
di_service = build_di_service(
    datasource=datasource,
    sema4_api_key="your-sema4-api-key"
)

# 1. Generate schema from sample document
schema_result = di_service.data_model.generate_from_file("sample.pdf")
print("Generated schema:", schema_result["schema"])

# 2. Create data model
data_model = di_service.data_model.create_from_schema(
    name="Contract Model",
    description="Schema for processing contracts",
    json_schema_text=json.dumps(schema_result["schema"])
)

# 3. Process a document
result = di_service.document.ingest(
    file_name="contract.pdf",
    data_model_name="contract_model",
    layout_name="default"
)

# 4. Query the processed document
document_data = di_service.document.query(result["document"]["id"])

# 5. Validate the document
validation = di_service.document.validate(
    "contract_model",
    result["document"]["id"]
)

Working Without Extraction Service

# For operations that don't require document extraction
di_service = build_di_service(datasource=datasource)
# No sema4_api_key provided

# These operations still work:
# - Query existing documents
# - Create data models from existing schemas
# - Generate translation schemas
# - Validate documents

# This will be None:
assert di_service.extraction is None

# Document ingestion will fail without extraction service

Error Handling

The library defines custom exceptions for different service operations:

from sema4ai_docint.services import (
    DocumentServiceError,
    DataModelServiceError,
    LayoutServiceError,
    ExtractionServiceError
)

try:
    result = di_service.document.ingest(
        file_name="document.pdf",
        data_model_name="model",
        layout_name="layout"
    )
except DocumentServiceError as e:
    print(f"Document processing failed: {e}")
except ExtractionServiceError as e:
    print(f"Extraction failed: {e}")

Project details

Release history Release notifications | RSS feed

This version

0.14.0

Mar 19, 2026

0.14.0a2 pre-release

Mar 19, 2026

0.14.0a1 pre-release

Mar 18, 2026

0.13.2

Jan 28, 2026

0.13.1

Nov 27, 2025

0.12.0

Nov 6, 2025

0.12.0a2 pre-release

Nov 4, 2025

0.12.0a1 pre-release

Nov 3, 2025

0.11.4

Oct 30, 2025

0.11.3

Oct 28, 2025

0.11.2

Oct 14, 2025

0.11.1

Oct 9, 2025

0.11.0

Oct 3, 2025

0.10.1

Sep 30, 2025

0.10.0

Sep 30, 2025

0.9.0

Sep 26, 2025

0.8.5

Sep 24, 2025

0.8.4

Sep 22, 2025

0.8.2

Sep 18, 2025

0.8.1

Sep 18, 2025

0.1.0

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sema4ai_docint-0.14.0.tar.gz (1.3 MB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sema4ai_docint-0.14.0-py3-none-any.whl (169.5 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file sema4ai_docint-0.14.0.tar.gz.

File metadata

Download URL: sema4ai_docint-0.14.0.tar.gz
Upload date: Mar 19, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sema4ai_docint-0.14.0.tar.gz
Algorithm	Hash digest
SHA256	`8d1ffa58a3054c064ea1dbf0b9e64bbe62e4f3a915c8142765850eb6d9fde6c9`
MD5	`9a568c10c8b5d5e6d4e240d996e9e61a`
BLAKE2b-256	`e438bebcb447ce7ac5caa8be5d692095d49925c2dba6057c870ef4b485f1ba09`

See more details on using hashes here.

File details

Details for the file sema4ai_docint-0.14.0-py3-none-any.whl.

File metadata

Download URL: sema4ai_docint-0.14.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 169.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sema4ai_docint-0.14.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2943fd0ebb4be105ec8c2ace044e7d032ad09c3437b8237d71b63f4527e87ec6`
MD5	`cc65bd929852bd8cdb54aebd4b1101d7`
BLAKE2b-256	`5be2f37f63c69719998b8363b6c08e72afbd93d30d067736f1716aeb79dd0178`

See more details on using hashes here.

sema4ai-docint 0.14.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Sema4AI Document Intelligence Core

Overview

Installation

Getting Started

Basic Setup

When to Use API Keys

sema4_api_key Parameter

disable_ssl_verification Parameter

API Reference

DIService (Main Facade)

DocumentService

ingest(file_name: str, data_model_name: str, layout_name: str) -> dict

query(document_id: str) -> dict

validate(data_model_name: str, document_id: str) -> dict

DataModelService

generate_from_file(file_name: str) -> dict

create_from_schema(name: str, description: str, json_schema_text: str, prompt: str = None, summary: str = None) -> dict

create_business_views(data_model_name: str) -> dict

LayoutService

generate_translation_schema(data_model_name: str, layout_schema: str) -> dict

ExtractionService

extract(file_path: Path, extraction_schema: Union[str, dict], data_model_prompt: str = None, extraction_config: dict = None, document_layout_prompt: str = None) -> dict

reducto Property

Usage Examples

Complete Document Processing Workflow

Working Without Extraction Service

Error Handling

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`sema4_api_key` Parameter

`disable_ssl_verification` Parameter

`ingest(file_name: str, data_model_name: str, layout_name: str) -> dict`

`query(document_id: str) -> dict`

`validate(data_model_name: str, document_id: str) -> dict`

`generate_from_file(file_name: str) -> dict`

`create_from_schema(name: str, description: str, json_schema_text: str, prompt: str = None, summary: str = None) -> dict`

`create_business_views(data_model_name: str) -> dict`

`generate_translation_schema(data_model_name: str, layout_schema: str) -> dict`

`extract(file_path: Path, extraction_schema: Union[str, dict], data_model_prompt: str = None, extraction_config: dict = None, document_layout_prompt: str = None) -> dict`

`reducto` Property