Sema4.ai Document Intelligence library
Project description
Sema4AI Document Intelligence Core
A Python library for document intelligence operations including document extraction, data model management, layout processing, and content transformation.
Overview
The Document Intelligence Core provides a unified service layer for:
- Document Processing: Extract and transform content from 30+ document formats
- Data Model Management: Create and manage document schemas and business views
- Layout Management: Generate translation schemas for mapping between different data formats
- Content Extraction: Leverage Reducto AI for intelligent document parsing
- Validation: Validate extracted content against quality rules
Installation
pip install sema4ai-docint
Getting Started
Basic Setup
from sema4ai.data import DataSource
from sema4ai_docint import build_di_service
# Build the document intelligence service
di_service = build_di_service(
datasource=datasource,
sema4_api_key="your-sema4-api-key", # Optional: required for extraction operations
disable_ssl_verification=False # Optional: if you want to disable ssl when talking with the extraction client
)
When to Use API Keys
sema4_api_key Parameter
- Required for: Document extraction operations
- Used by:
ExtractionServiceandDocumentService.ingest()operations - If not provided: Extraction service will be
None, and document ingestion will fail
disable_ssl_verification Parameter
- Required for: Development environments or networks with SSL/proxy issues
- Used by: Reducto client connections to Sema4AI backend
- Default:
False(SSL verification enabled) - When to set
True: Testing environments, behind corporate proxies, or SSL certificate issues
API Reference
DIService (Main Facade)
The DIService class provides access to all document intelligence operations through organized sub-services.
# Access sub-services
di_service.document # Document operations
di_service.data_model # Data model operations
di_service.layout # Layout operations
di_service.extraction # Extraction operations (if sema4_api_key provided)
DocumentService
Handles high-level document operations including ingestion, querying, and validation.
ingest(file_name: str, data_model_name: str, layout_name: str) -> dict
Ingest a document into the system using a specific data model and layout.
Parameters:
file_name(str): Name of the PDF file to processdata_model_name(str): Name of the data model to uselayout_name(str): Name of the document layout for processing
Returns: Dict containing the processed document and validation information
Requires: sema4_api_key (uses ExtractionService)
result = di_service.document.ingest(
file_name="invoice.pdf",
data_model_name="invoice_model",
layout_name="standard_layout"
)
query(document_id: str) -> dict
Retrieve a document in data model format using business views.
Parameters:
document_id(str): Document ID to retrieve
Returns: Dict with document data organized by view names
document_data = di_service.document.query("doc_123")
validate(data_model_name: str, document_id: str) -> dict
Validate a document against quality checks.
Parameters:
data_model_name(str): Name of the data modeldocument_id(str): ID of the document to validate
Returns: Validation results with overall status and rule outcomes
validation_result = di_service.document.validate("invoice_model", "doc_123")
DataModelService
Manages data models, schemas, and business view generation.
generate_from_file(file_name: str) -> dict
Generate a data model schema from an uploaded document.
Parameters:
file_name(str): Name of the file to analyze
Returns: Dict with generated schema and success message
Uses: AgentServerClient for AI-powered schema generation
schema_result = di_service.data_model.generate_from_file("sample_invoice.pdf")
create_from_schema(name: str, description: str, json_schema_text: str, prompt: str = None, summary: str = None) -> dict
Create a new data model from a JSON schema.
Parameters:
name(str): Name of the data modeldescription(str): Description of the data modeljson_schema_text(str): JSON schema as stringprompt(str, optional): Custom prompt for the data modelsummary(str, optional): Summary of the data model
Returns: Created data model as JSON
Uses: AgentServerClient for schema processing and summarization
data_model = di_service.data_model.create_from_schema(
name="Invoice Model",
description="Schema for processing invoices",
json_schema_text='{"type": "object", "properties": {...}}',
prompt="Extract invoice data accurately"
)
create_business_views(data_model_name: str) -> dict
Create SQL views for a data model in the database.
Parameters:
data_model_name(str): Name of the data model
Returns: Success message
views_result = di_service.data_model.create_business_views("invoice_model")
LayoutService
Handles document layout operations and translation schema generation.
generate_translation_schema(data_model_name: str, layout_schema: str) -> dict
Create translation rules to map layout schema to data model schema.
Parameters:
data_model_name(str): Name of the target data modellayout_schema(str): Source extraction schema as JSON string
Returns: Dict containing translation mapping rules
Uses: AgentServerClient for intelligent mapping generation
translation_schema = di_service.layout.generate_translation_schema(
data_model_name="invoice_model",
layout_schema='{"type": "object", "properties": {...}}'
)
ExtractionService
Provides document extraction capabilities using Reducto AI.
Note: Only available when sema4_api_key is provided to build_di_service().
extract(file_path: Path, extraction_schema: Union[str, dict], data_model_prompt: str = None, extraction_config: dict = None, document_layout_prompt: str = None) -> dict
Extract structured data from a document.
Parameters:
file_path(Path): Path to the document fileextraction_schema(str | dict): Schema defining what to extractdata_model_prompt(str, optional): Custom prompt for extractionextraction_config(dict, optional): Reducto configuration optionsdocument_layout_prompt(str, optional): Layout-specific prompt
Returns: Extracted data as dictionary
Requires: sema4_api_key
# Only available if sema4_api_key was provided
if di_service.extraction:
extracted_data = di_service.extraction.extract(
file_path=Path("document.pdf"),
extraction_schema={"type": "object", "properties": {...}},
data_model_prompt="Extract key business information"
)
reducto Property
Access to the underlying Reducto client for advanced operations.
# Direct access to Reducto client
if di_service.extraction:
reducto_client = di_service.extraction.reducto
Usage Examples
Complete Document Processing Workflow
from sema4ai.data import DataSource
from sema4ai_docint import build_di_service
# Setup
di_service = build_di_service(
datasource=datasource,
sema4_api_key="your-sema4-api-key"
)
# 1. Generate schema from sample document
schema_result = di_service.data_model.generate_from_file("sample.pdf")
print("Generated schema:", schema_result["schema"])
# 2. Create data model
data_model = di_service.data_model.create_from_schema(
name="Contract Model",
description="Schema for processing contracts",
json_schema_text=json.dumps(schema_result["schema"])
)
# 3. Process a document
result = di_service.document.ingest(
file_name="contract.pdf",
data_model_name="contract_model",
layout_name="default"
)
# 4. Query the processed document
document_data = di_service.document.query(result["document"]["id"])
# 5. Validate the document
validation = di_service.document.validate(
"contract_model",
result["document"]["id"]
)
Working Without Extraction Service
# For operations that don't require document extraction
di_service = build_di_service(datasource=datasource)
# No sema4_api_key provided
# These operations still work:
# - Query existing documents
# - Create data models from existing schemas
# - Generate translation schemas
# - Validate documents
# This will be None:
assert di_service.extraction is None
# Document ingestion will fail without extraction service
Error Handling
The library defines custom exceptions for different service operations:
from sema4ai_docint.services import (
DocumentServiceError,
DataModelServiceError,
LayoutServiceError,
ExtractionServiceError
)
try:
result = di_service.document.ingest(
file_name="document.pdf",
data_model_name="model",
layout_name="layout"
)
except DocumentServiceError as e:
print(f"Document processing failed: {e}")
except ExtractionServiceError as e:
print(f"Extraction failed: {e}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sema4ai_docint-0.14.0.tar.gz.
File metadata
- Download URL: sema4ai_docint-0.14.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d1ffa58a3054c064ea1dbf0b9e64bbe62e4f3a915c8142765850eb6d9fde6c9
|
|
| MD5 |
9a568c10c8b5d5e6d4e240d996e9e61a
|
|
| BLAKE2b-256 |
e438bebcb447ce7ac5caa8be5d692095d49925c2dba6057c870ef4b485f1ba09
|
File details
Details for the file sema4ai_docint-0.14.0-py3-none-any.whl.
File metadata
- Download URL: sema4ai_docint-0.14.0-py3-none-any.whl
- Upload date:
- Size: 169.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2943fd0ebb4be105ec8c2ace044e7d032ad09c3437b8237d71b63f4527e87ec6
|
|
| MD5 |
cc65bd929852bd8cdb54aebd4b1101d7
|
|
| BLAKE2b-256 |
5be2f37f63c69719998b8363b6c08e72afbd93d30d067736f1716aeb79dd0178
|