Official Python SDK for Cerevox
Project description
Cerevox - The Data Layer ๐ง โก
Parse documents with enterprise-grade reliability
AI-powered โข Highest Accuracy โข Vector DB ready
Official Python SDK for Lexa - Parse documents into structured data
๐ฏ Perfect for: RAG applications, document analysis, data extraction, and vector database preparation
๐ฆ Installation
pip install cerevox
System Requirements:
- Python 3.9+
- API key from Cerevox
๐ Quick Start
Get started in 30 seconds:
from cerevox import Lexa
# Parse a document
client = Lexa(api_key="your-api-key")
documents = client.parse(["document.pdf"])
print(f"Extracted {len(documents[0].content)} characters")
print(f"Found {len(documents[0].tables)} tables")
Async Processing (Recommended):
import asyncio
from cerevox import AsyncLexa
async def main():
async with AsyncLexa(api_key="your-api-key") as client:
documents = await client.parse(["document.pdf", "report.docx"])
# Get chunks optimized for vector databases
chunks = documents.get_all_text_chunks(target_size=500)
print(f"Ready for embedding: {len(chunks)} chunks")
asyncio.run(main())
๐ฅ See It In Action
Document Processing Pipeline
๐ Input Document โ ๐ง AI Processing โ ๐ Structured Output โ ๐ Vector Ready
Sample Output Structure
{
"filename": "financial_report.pdf",
"content": "Q4 financial results show...",
"tables": [
{
"headers": ["Quarter", "Revenue", "Growth"],
"rows": [["Q4", "$2.3M", "15%"]]
}
],
"metadata": {
"pages": 12,
"confidence": 0.998,
"processing_time": 2.3
},
"chunks": [
{
"content": "Executive Summary: Q4 results...",
"metadata": {"page": 1, "section": "summary"}
}
]
}
โจ Features
๐ Performance & Scale
- 10x Faster than traditional solutions
- Native Async Support with concurrent processing
- Enterprise-grade reliability with automatic retries
๐ง AI-Powered Extraction
- SOTA Accuracy with cutting-edge ML models
- Advanced Table Extraction preserving structure and formatting
- 12+ File Formats including PDF, DOCX, PPTX, HTML, and more
๐ Integration Ready
- Vector Database Optimized chunks for RAG applications
- 7+ Cloud Storage integrations (S3, SharePoint, Google Drive, etc.)
- Framework Agnostic works with Django, Flask, FastAPI
๐จโ๐ป Developer Experience
- Intuitive API with full type hints and comprehensive examples
- Rich Metadata extraction including images, formatting, and structure
- Smart Search across documents and batches
๐งฉ Intelligent Vector Database Preparation
Engineered specifically for vector databases and RAG applications
๐ฏ Smart Chunking Features
- Structure-Aware: Preserves headers, paragraphs, code blocks, and logical document boundaries
- Precise Control: Configurable target sizes with tolerance for optimal embedding performance
- Format-Aware: Maintains markdown formatting, code syntax, and table structures
- Performance-First: Built-in async processing with no manual post-processing required
- Rich Context: Full document metadata for enhanced retrieval and search relevance
๐ Quick Start Examples
from cerevox import AsyncLexa, chunk_markdown, chunk_text
# ๐ฏ Method 1: Direct Vector DB Preparation (Recommended)
async with AsyncLexa() as client:
documents = await client.parse(["document.pdf", "report.docx"])
# Get optimized chunks for vector databases
text_chunks = documents.get_all_text_chunks(
target_size=500, # Performant for most embedding models
include_metadata=True # Rich context for retrieval
)
markdown_chunks = documents.get_all_markdown_chunks(
target_size=800, # Larger chunks for formatted content
tolerance=0.1 # ยฑ10% size flexibility
)
# ๐ง Method 2: Standalone Chunking Functions
chunks = chunk_markdown(markdown_content, target_size=500)
chunks = chunk_text(plain_text, target_size=300)
๐๏ธ Vector Database Integration
async with AsyncLexa() as client:
documents = await client.parse(["doc.pdf"])
chunks = documents.get_all_text_chunks(target_size=512)
for chunk in chunks:
# Pinecone Integration
embedding = generate_embedding(chunk['content'])
index.upsert([{
'id': f"{chunk['document_filename']}_{chunk['chunk_index']}",
'values': embedding,
'metadata': chunk # Includes filename, page, element_type, etc.
}])
# ChromaDB Integration
collection.add(
documents=[chunk['content'] for chunk in chunks],
metadatas=[chunk for chunk in chunks],
ids=[f"doc_{i}" for i in range(len(chunks))]
)
โ๏ธ Cloud Storage Integrations - Coming Soon!
Coming Soon! Connect and parse documents from 7+ cloud storage services just setup authentication on Cerevox:
async with AsyncLexa() as client:
# Amazon S3
s3_docs = await client.parse_s3_folder(
bucket_name="my-bucket",
folder_path="documents/"
)
# Microsoft SharePoint
sharepoint_docs = await client.parse_sharepoint_folder(
drive_id="drive-id",
folder_id="folder-id"
)
# Also supports: Box, Dropbox, Google Drive, Salesforce, Sendme
Supported Services: Coming Soon!
- ๐๏ธ Amazon S3 - Bucket and folder parsing
- ๐ฆ Box - Enterprise file management
- ๐พ Dropbox - Personal and business accounts
- ๐ Google Drive - File and folder processing
- ๐ข Microsoft SharePoint - Sites, drives, and folders
- ๐ค Salesforce - CRM document processing
- ๐ค Sendme - Secure file transfer integration
๐ Examples
Explore comprehensive examples in the examples/ directory:
| Example | Description |
|---|---|
lexa_examples.py |
Complete SDK functionality demonstration |
vector_db_preparation.py |
Vector database chunking and integration patterns |
async_examples.py |
Advanced async processing and cloud integrations |
document_examples.py |
Document analysis and manipulation features |
cloud_integrations.py |
All cloud storage service integrations |
๐ Run the Complete Demo
# Clone and explore
git clone https://github.com/CerevoxAI/cerevox-python.git
cd cerevox-python
export CEREVOX_API_KEY="your-api-key"
# Run comprehensive demos
python examples/async_examples.py # Async features
python examples/cloud_integrations.py # Cloud Integrations
python examples/document_examples.py # Document analysis
python examples/vector_db_preparation.py # Vector DB preparation
๐งช Advanced Examples
๐ Content Analysis & Search
# Advanced document analysis
doc = documents[0]
# Extract statistics
stats = doc.get_statistics()
print(f"Characters: {stats['characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
# Content search with metadata
matches = doc.search_content("revenue", include_metadata=True)
for match in matches:
print(f"Found on page {match['page_number']}: {match['context']}")
# Batch analysis
similarity_matrix = documents.get_content_similarity_matrix()
key_phrases = documents.extract_key_phrases(top_n=10)
๐๏ธ Table Extraction & Processing
# Extract and analyze tables
all_tables = documents.get_all_tables()
print(f"Found {len(all_tables)} tables across documents")
# Convert to pandas for analysis
df_tables = documents.to_pandas_tables()
for filename, tables in df_tables.items():
print(f"๐ {filename}: {len(tables)} tables")
for table in tables:
print(f" Table shape: {table.shape}")
# Export tables to CSV
documents.export_tables_to_csv("exported_tables/")
โก Performance Optimization
# Configure for high-performance processing
async with AsyncLexa(
api_key="your-api-key",
max_concurrent=20, # Increase parallel processing
timeout=120.0, # Extended timeout for large files
max_retries=5 # Enhanced error resilience
) as client:
# Batch processing with progress tracking
def progress_callback(status):
print(f"๐ {status.status} - Processing...")
documents = await client.parse(
files=large_file_list,
mode=ProcessingMode.ADVANCED,
progress_callback=progress_callback
)
๐ Documentation
For complete API documentation, visit:
- ๐ Full Documentation - Comprehensive guides and tutorials
- ๐ง API Reference - Interactive API documentation
- ๐ฌ Discord Community - Get help from the community
๐ API Reference
AsyncLexa(api_key: [string], [options: [dict]])
The main async client for document processing with enterprise-grade reliability.
api_key
- Required
- Type: [string]
- Values:
<your cerevox api key>
Your Cerevox API key obtained from Cerevox.
options
max_concurrent
- Optional
- Type: [int]
- Default:
10
Maximum number of concurrent processing jobs.
timeout
- Optional
- Type: [float]
- Default:
60.0
Request timeout in seconds for API calls.
max_retries
- Optional
- Type: [int]
- Default:
3
Maximum number of retry attempts for failed requests.
AsyncLexa Methods
parse(files: [list], [options: [dict]])
Parse documents from local files or file paths.
files
- Required
- Type: [list]<[string]>
- Values:
["path/to/file.pdf", "document.docx"]
List of file paths to parse.
options
progress_callback
- Optional
- Type: [function]
- Default:
None
Callback function to track parsing progress. Receives status updates.
mode
- Optional
- Type: [string]
- Default:
'STANDARD' - Values:
'STANDARD','ADVANCED'
Processing mode for document parsing.
parse_urls(urls: [list], [options: [dict]])
Parse documents from URLs.
urls
- Required
- Type: [list]<[string]>
- Values:
["https://example.com/doc.pdf"]
List of URLs pointing to documents to parse.
options
Same as parse() method options.
Document Object
Individual document with rich metadata and content access.
Properties
filename
- Type: [string]
- Description: Original filename of the document
file_type
- Type: [string]
- Description: Document type (e.g., 'pdf', 'docx', 'html')
page_count
- Type: [int]
- Description: Number of pages in the document
content
- Type: [string]
- Description: Plain text content of the document
elements
- Type: [list]<[dict]>
- Description: Structured document elements with metadata
tables
- Type: [list]<[dict]>
- Description: Extracted tables from the document
Methods
to_markdown()
- Returns: [string]
- Description: Convert document to formatted markdown
to_html()
- Returns: [string]
- Description: Convert document to HTML format
to_dict()
- Returns: [dict]
- Description: Convert document to dictionary format
search_content(query: [string], [options: [dict]])
Search for content within the document.
query
- Required
- Type: [string]
The search query string.
options
include_metadata
- Optional
- Type: [bool]
- Default:
False
Include metadata in search results.
get_elements_by_page(page_number: [int])
- Returns: [list]<[dict]>
- Description: Get all elements from a specific page
page_number
- Required
- Type: [int]
- Values:
1, 2, 3...
Page number to retrieve elements from.
get_elements_by_type(element_type: [string])
- Returns: [list]<[dict]>
- Description: Filter elements by type
element_type
- Required
- Type: [string]
- Values:
'table','paragraph','header', etc.
Type of elements to retrieve.
get_statistics()
- Returns: [dict]
- Description: Get document statistics including character count, word count, etc.
DocumentBatch Object
Collection of documents with batch operations.
Properties
total_pages
- Type: [int]
- Description: Total pages across all documents in the batch
Methods
search_all(query: [string], [options: [dict]])
Search across all documents in the batch.
query
- Required
- Type: [string]
The search query string.
options
Same as Document search_content() options.
filter_by_type(file_type: [string])
- Returns: [list]<Document>
- Description: Filter documents by file type
file_type
- Required
- Type: [string]
- Values:
'pdf','docx','html', etc.
File type to filter by.
save_to_json(filepath: [string])
Save batch to JSON file.
filepath
- Required
- Type: [string]
Path where to save the JSON file.
to_combined_text()
- Returns: [string]
- Description: Combine all document content into single text string
to_combined_markdown()
- Returns: [string]
- Description: Combine all document content into single markdown string
to_combined_html()
- Returns: [string]
- Description: Combine all document content into single HTML string
get_all_text_chunks([options: [dict]])
Get optimized text chunks for vector databases.
options
####### target_size
- Optional
- Type: [int]
- Default:
500
Target size for each chunk in characters.
####### tolerance
- Optional
- Type: [float]
- Default:
0.1 - Values:
0.0 - 1.0
Size tolerance as a percentage (e.g., 0.1 = ยฑ10%).
####### include_metadata
- Optional
- Type: [bool]
- Default:
True
Include document metadata with each chunk.
get_all_markdown_chunks([options: [dict]])
Get optimized markdown chunks for vector databases.
options
Same as get_all_text_chunks() plus:
####### preserve_tables
- Optional
- Type: [bool]
- Default:
True
Keep table structures intact in chunks.
get_all_tables()
- Returns: [list]<[dict]>
- Description: Extract all tables from all documents
to_pandas_tables()
- Returns: [dict]
- Description: Convert all tables to pandas DataFrames, organized by filename
export_tables_to_csv(directory: [string])
Export all tables to CSV files.
directory
- Required
- Type: [string]
Directory path where CSV files will be saved.
Standalone Functions
chunk_text(text: [string], [options: [dict]])
Chunk plain text content for vector databases.
text
- Required
- Type: [string]
The text content to chunk.
options
target_size
- Optional
- Type: [int]
- Default:
500
Target size for each chunk in characters.
tolerance
- Optional
- Type: [float]
- Default:
0.1
Size tolerance as a percentage.
chunk_markdown(markdown: [string], [options: [dict]])
Chunk markdown content while preserving structure.
markdown
- Required
- Type: [string]
The markdown content to chunk.
options
Same as chunk_text() plus:
preserve_tables
- Optional
- Type: [bool]
- Default:
True
Keep table structures intact in chunks.
๐ก๏ธ Error Handling & Configuration
Robust Error Handling
from cerevox import (
LexaAuthError,
LexaError,
LexaJobFailedError,
LexaTimeoutError
)
try:
documents = await client.parse(files)
except LexaAuthError as e:
print(f"โ Authentication failed: {e.message}")
except LexaJobFailedError as e:
print(f"โ Job failed error: {e.message}")
except LexaTimeoutError as e:
print(f"โ Timeout error: {e.message} (status: {e.status_code})")
except LexaError as e:
print(f"โ General Lexa API error: {e.message}")
๐ Migration Guide
From LlamaIndex
# Before (LlamaIndex)
documents = SimpleDirectoryReader('docs').load_data()
# After (Cerevox) - Better performance + async support
async with AsyncLexa() as client:
documents = await client.parse(glob.glob('docs/*'))
chunks = documents.get_all_text_chunks(target_size=500)
From Unstructured
# Before (Unstructured)
elements = partition_auto(filename="document.pdf")
# After (Cerevox) - More accurate tables + async support
async with AsyncLexa() as client:
documents = await client.parse(["document.pdf"])
elements = documents[0].elements # Structured with rich metadata
From Amazon Textract
# Before (Textract) - Manual polling required
response = textract.start_document_text_detection(...)
# After (Cerevox) - Automatic polling + most accurate tables
async with AsyncLexa() as client:
# Automatic polling, no manual loops needed
documents = await client.parse(["document.pdf"])
๐งช Development and Testing
Setting up for Development
# Clone and install
git clone https://github.com/CerevoxAI/cerevox-python.git
cd cerevox-python/python-sdk
pip install -e .[dev]
# Run tests
pytest
# Run the advanced demo
export CEREVOX_API_KEY="your-api-key"
python examples/async_advanced.py
# Test async features
python -c "
import asyncio
from cerevox import AsyncLexa
async def test():
async with AsyncLexa() as client:
buckets = await client.list_s3_buckets()
print(f'Found {len(buckets.buckets)} S3 buckets')
asyncio.run(test())
"
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support & Community
|
๐ Resources |
๐ฌ Get Help |
๐ Issues |
๐ Changelog
See CHANGELOG.md for detailed release notes and migration guides.
โญ Star us on GitHub if Cerevox helped your project!
Made with โค๏ธ by the Cerevox team
Happy Parsing ๐ โจ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cerevox-0.1.0.tar.gz.
File metadata
- Download URL: cerevox-0.1.0.tar.gz
- Upload date:
- Size: 53.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f83bfbf3d72a2b78e70aba7f8d8f7191c1b25aa159e925a50d1ff290096ce5cd
|
|
| MD5 |
a6857fc24d0cbd76a6b1d19fed9b304b
|
|
| BLAKE2b-256 |
6f15936391682363c4b0c48f02d555b2f465c9180e0123194e17eaec04dacfee
|
File details
Details for the file cerevox-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cerevox-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84b85b7157ddf164e47c10859aedd097d313449b3090a1627aea7fff0d71e6e2
|
|
| MD5 |
dafce928fda199b1eafa1a14e69c63f8
|
|
| BLAKE2b-256 |
b92c676f4fc0755e2e27e04c14305f7527cae2171a740bfbb72f072082fce03e
|