Official Python SDK for Cerevox

These details have not been verified by PyPI

Project links

Project description

Cerevox - The Data Layer 🧠 ⚡

Parse documents with enterprise-grade reliability
AI-powered • Highest Accuracy • Vector DB ready

Installation
Quick Start
Features
Examples
Documentation
Support

Official Python SDK for Lexa - Parse documents into structured data

🎯 Perfect for: RAG applications, document analysis, data extraction, and vector database preparation

📦 Installation

pip install cerevox

System Requirements:

Python 3.9+
API key from Cerevox

🚀 Quick Start

Get started in 30 seconds:

from cerevox import Lexa

# Parse a document
client = Lexa(api_key="your-api-key")
documents = client.parse(["document.pdf"])

print(f"Extracted {len(documents[0].content)} characters")
print(f"Found {len(documents[0].tables)} tables")

Async Processing (Recommended):

import asyncio
from cerevox import AsyncLexa

async def main():
    async with AsyncLexa(api_key="your-api-key") as client:
        documents = await client.parse(["document.pdf", "report.docx"])
        
        # Get chunks optimized for vector databases
        chunks = documents.get_all_text_chunks(target_size=500)
        print(f"Ready for embedding: {len(chunks)} chunks")

asyncio.run(main())

🎥 See It In Action

Document Processing Pipeline

📄 Input Document → 🧠 AI Processing → 📊 Structured Output → 🔍 Vector Ready

Sample Output Structure

{
  "filename": "financial_report.pdf",
  "content": "Q4 financial results show...",
  "tables": [
    {
      "headers": ["Quarter", "Revenue", "Growth"],
      "rows": [["Q4", "$2.3M", "15%"]]
    }
  ],
  "metadata": {
    "pages": 12,
    "confidence": 0.998,
    "processing_time": 2.3
  },
  "chunks": [
    {
      "content": "Executive Summary: Q4 results...",
      "metadata": {"page": 1, "section": "summary"}
    }
  ]
}

✨ Features

🚀 Performance & Scale

10x Faster than traditional solutions
Native Async Support with concurrent processing
Enterprise-grade reliability with automatic retries

🧠 AI-Powered Extraction

SOTA Accuracy with cutting-edge ML models
Advanced Table Extraction preserving structure and formatting
12+ File Formats including PDF, DOCX, PPTX, HTML, and more

🔗 Integration Ready

Vector Database Optimized chunks for RAG applications
7+ Cloud Storage integrations (S3, SharePoint, Google Drive, etc.)
Framework Agnostic works with Django, Flask, FastAPI

👨‍💻 Developer Experience

Intuitive API with full type hints and comprehensive examples
Rich Metadata extraction including images, formatting, and structure
Smart Search across documents and batches

🧩 Intelligent Vector Database Preparation

Engineered specifically for vector databases and RAG applications

🎯 Smart Chunking Features

Structure-Aware: Preserves headers, paragraphs, code blocks, and logical document boundaries
Precise Control: Configurable target sizes with tolerance for optimal embedding performance
Format-Aware: Maintains markdown formatting, code syntax, and table structures
Performance-First: Built-in async processing with no manual post-processing required
Rich Context: Full document metadata for enhanced retrieval and search relevance

🚀 Quick Start Examples

from cerevox import AsyncLexa, chunk_markdown, chunk_text

# 🎯 Method 1: Direct Vector DB Preparation (Recommended)
async  with  AsyncLexa()  as client:
	documents =  await client.parse(["document.pdf",  "report.docx"])
	
	# Get optimized chunks for vector databases
	text_chunks = documents.get_all_text_chunks(
		target_size=500,  # Performant for most embedding models
		include_metadata=True # Rich context for retrieval
	)
	markdown_chunks = documents.get_all_markdown_chunks(
		target_size=800,  # Larger chunks for formatted content
		tolerance=0.1  # ±10% size flexibility
	)

	# 🔧 Method 2: Standalone Chunking Functions
	chunks =  chunk_markdown(markdown_content,  target_size=500)
	chunks =  chunk_text(plain_text,  target_size=300)

🗄️ Vector Database Integration

async  with  AsyncLexa()  as client:
	documents =  await client.parse(["doc.pdf"])
	chunks = documents.get_all_text_chunks(target_size=512)

	for chunk in chunks:
		# Pinecone Integration
		embedding =  generate_embedding(chunk['content'])
		index.upsert([{
			'id': f"{chunk['document_filename']}_{chunk['chunk_index']}",
			'values': embedding,
			'metadata': chunk # Includes filename, page, element_type, etc.
		}])

		# ChromaDB Integration
		collection.add(
			documents=[chunk['content']  for chunk in chunks],
			metadatas=[chunk for chunk in chunks],
			ids=[f"doc_{i}"  for i in  range(len(chunks))]
		)

☁️ Cloud Storage Integrations - Coming Soon!

Coming Soon! Connect and parse documents from 7+ cloud storage services just setup authentication on Cerevox:

async  with  AsyncLexa()  as client:
	# Amazon S3
	s3_docs =  await client.parse_s3_folder(
		bucket_name="my-bucket",
		folder_path="documents/"
	)

	# Microsoft SharePoint
	sharepoint_docs =  await client.parse_sharepoint_folder(
		drive_id="drive-id",
		folder_id="folder-id"
	)

	# Also supports: Box, Dropbox, Google Drive, Salesforce, Sendme

Supported Services: Coming Soon!

🗄️ Amazon S3 - Bucket and folder parsing
📦 Box - Enterprise file management
💾 Dropbox - Personal and business accounts
📁 Google Drive - File and folder processing
🏢 Microsoft SharePoint - Sites, drives, and folders
🤝 Salesforce - CRM document processing
📤 Sendme - Secure file transfer integration

📋 Examples

Explore comprehensive examples in the examples/ directory:

Example	Description
`lexa_examples.py`	Complete SDK functionality demonstration
`vector_db_preparation.py`	Vector database chunking and integration patterns
`async_examples.py`	Advanced async processing and cloud integrations
`document_examples.py`	Document analysis and manipulation features
`cloud_integrations.py`	All cloud storage service integrations

🚀 Run the Complete Demo

# Clone and explore
git clone https://github.com/CerevoxAI/cerevox-python.git

cd cerevox-python

export  CEREVOX_API_KEY="your-api-key"

# Run comprehensive demos
python  examples/async_examples.py  # Async features
python  examples/cloud_integrations.py  # Cloud Integrations
python  examples/document_examples.py  # Document analysis
python  examples/vector_db_preparation.py  # Vector DB preparation

🧪 Advanced Examples

🔍 Content Analysis & Search

# Advanced document analysis
doc = documents[0]

# Extract statistics
stats = doc.get_statistics()
print(f"Characters: {stats['characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")

# Content search with metadata
matches = doc.search_content("revenue",  include_metadata=True)

for match in matches:
	print(f"Found on page {match['page_number']}: {match['context']}")

# Batch analysis
similarity_matrix = documents.get_content_similarity_matrix()
key_phrases = documents.extract_key_phrases(top_n=10)

🗄️ Table Extraction & Processing

# Extract and analyze tables
all_tables = documents.get_all_tables()

print(f"Found {len(all_tables)} tables across documents")

# Convert to pandas for analysis
df_tables = documents.to_pandas_tables()

for filename, tables in df_tables.items():
	print(f"📄 {filename}: {len(tables)} tables")

	for table in tables:
		print(f" Table shape: {table.shape}")

# Export tables to CSV
documents.export_tables_to_csv("exported_tables/")

⚡ Performance Optimization

# Configure for high-performance processing
async with AsyncLexa(
	api_key="your-api-key",
	max_concurrent=20,  # Increase parallel processing
	timeout=120.0,  # Extended timeout for large files
	max_retries=5  # Enhanced error resilience
) as client:

	# Batch processing with progress tracking
	def  progress_callback(status):
		print(f"📊 {status.status} - Processing...")

	documents = await client.parse(
		files=large_file_list,
		mode=ProcessingMode.ADVANCED,
		progress_callback=progress_callback
	)

📚 Documentation

For complete API documentation, visit:

📖 Full Documentation - Comprehensive guides and tutorials
🔧 API Reference - Interactive API documentation
💬 Discord Community - Get help from the community

📋 API Reference

AsyncLexa(api_key: [string], [options: [dict]])

The main async client for document processing with enterprise-grade reliability.

api_key

Required
Type: [string]
Values: <your cerevox api key>

Your Cerevox API key obtained from Cerevox.

options

max_concurrent

Optional
Type: [int]
Default: 10

Maximum number of concurrent processing jobs.

timeout

Optional
Type: [float]
Default: 60.0

Request timeout in seconds for API calls.

max_retries

Optional
Type: [int]
Default: 3

Maximum number of retry attempts for failed requests.

AsyncLexa Methods

parse(files: [list], [options: [dict]])

Parse documents from local files or file paths.

files

Required
Type: [list]<[string]>
Values: ["path/to/file.pdf", "document.docx"]

List of file paths to parse.

options

progress_callback

Optional
Type: [function]
Default: None

Callback function to track parsing progress. Receives status updates.

mode

Optional
Type: [string]
Default: 'STANDARD'
Values: 'STANDARD', 'ADVANCED'

Processing mode for document parsing.

parse_urls(urls: [list], [options: [dict]])

Parse documents from URLs.

urls

Required
Type: [list]<[string]>
Values: ["https://example.com/doc.pdf"]

List of URLs pointing to documents to parse.

options

Same as parse() method options.

Document Object

Individual document with rich metadata and content access.

Properties

filename

Type: [string]
Description: Original filename of the document

file_type

Type: [string]
Description: Document type (e.g., 'pdf', 'docx', 'html')

page_count

Type: [int]
Description: Number of pages in the document

content

Type: [string]
Description: Plain text content of the document

elements

Type: [list]<[dict]>
Description: Structured document elements with metadata

tables

Type: [list]<[dict]>
Description: Extracted tables from the document

Methods

to_markdown()

Returns: [string]
Description: Convert document to formatted markdown

to_html()

Returns: [string]
Description: Convert document to HTML format

to_dict()

Returns: [dict]
Description: Convert document to dictionary format

search_content(query: [string], [options: [dict]])

Search for content within the document.

query

Required
Type: [string]

The search query string.

options

include_metadata

Optional
Type: [bool]
Default: False

Include metadata in search results.

get_elements_by_page(page_number: [int])

Returns: [list]<[dict]>
Description: Get all elements from a specific page

page_number

Required
Type: [int]
Values: 1, 2, 3...

Page number to retrieve elements from.

get_elements_by_type(element_type: [string])

Returns: [list]<[dict]>
Description: Filter elements by type

element_type

Required
Type: [string]
Values: 'table', 'paragraph', 'header', etc.

Type of elements to retrieve.

get_statistics()

Returns: [dict]
Description: Get document statistics including character count, word count, etc.

DocumentBatch Object

Collection of documents with batch operations.

Properties

total_pages

Type: [int]
Description: Total pages across all documents in the batch

Methods

search_all(query: [string], [options: [dict]])

Search across all documents in the batch.

query

Required
Type: [string]

The search query string.

options

Same as Document search_content() options.

filter_by_type(file_type: [string])

Returns: [list]<Document>
Description: Filter documents by file type

file_type

Required
Type: [string]
Values: 'pdf', 'docx', 'html', etc.

File type to filter by.

save_to_json(filepath: [string])

Save batch to JSON file.

filepath

Required
Type: [string]

Path where to save the JSON file.

to_combined_text()

Returns: [string]
Description: Combine all document content into single text string

to_combined_markdown()

Returns: [string]
Description: Combine all document content into single markdown string

to_combined_html()

Returns: [string]
Description: Combine all document content into single HTML string

get_all_text_chunks([options: [dict]])

Get optimized text chunks for vector databases.

options

####### target_size

Optional
Type: [int]
Default: 500

Target size for each chunk in characters.

####### tolerance

Optional
Type: [float]
Default: 0.1
Values: 0.0 - 1.0

Size tolerance as a percentage (e.g., 0.1 = ±10%).

####### include_metadata

Optional
Type: [bool]
Default: True

Include document metadata with each chunk.

get_all_markdown_chunks([options: [dict]])

Get optimized markdown chunks for vector databases.

options

Same as get_all_text_chunks() plus:

####### preserve_tables

Optional
Type: [bool]
Default: True

Keep table structures intact in chunks.

get_all_tables()

Returns: [list]<[dict]>
Description: Extract all tables from all documents

to_pandas_tables()

Returns: [dict]
Description: Convert all tables to pandas DataFrames, organized by filename

export_tables_to_csv(directory: [string])

Export all tables to CSV files.

Standalone Functions

chunk_text(text: [string], [options: [dict]])

Chunk plain text content for vector databases.

text

Required
Type: [string]

The text content to chunk.

options

target_size

Optional
Type: [int]
Default: 500

Target size for each chunk in characters.

tolerance

Optional
Type: [float]
Default: 0.1

Size tolerance as a percentage.

chunk_markdown(markdown: [string], [options: [dict]])

Chunk markdown content while preserving structure.

markdown

Required
Type: [string]

The markdown content to chunk.

options

Same as chunk_text() plus:

preserve_tables

Optional
Type: [bool]
Default: True

Keep table structures intact in chunks.

🛡️ Error Handling & Configuration

Robust Error Handling

from cerevox import (
	LexaAuthError,
	LexaError,
	LexaJobFailedError,
	LexaTimeoutError
)

try:
	documents =  await client.parse(files)
except LexaAuthError as e:
	print(f"❌ Authentication failed: {e.message}")

except LexaJobFailedError as e:
	print(f"❌ Job failed error: {e.message}")

except LexaTimeoutError as e:
	print(f"❌ Timeout error: {e.message} (status: {e.status_code})")

except LexaError as e:
	print(f"❌ General Lexa API error: {e.message}")

🔄 Migration Guide

From LlamaIndex

# Before (LlamaIndex)
documents = SimpleDirectoryReader('docs').load_data()

# After (Cerevox) - Better performance + async support
async with AsyncLexa()  as client:
	documents =  await client.parse(glob.glob('docs/*'))
	chunks = documents.get_all_text_chunks(target_size=500)

From Unstructured

# Before (Unstructured)
elements = partition_auto(filename="document.pdf")

# After (Cerevox) - More accurate tables + async support
async with AsyncLexa()  as client:
	documents =  await client.parse(["document.pdf"])
	elements = documents[0].elements # Structured with rich metadata

From Amazon Textract

# Before (Textract) - Manual polling required
response = textract.start_document_text_detection(...)

# After (Cerevox) - Automatic polling + most accurate tables
async  with  AsyncLexa()  as client:
	# Automatic polling, no manual loops needed
	documents =  await client.parse(["document.pdf"])

🧪 Development and Testing

Setting up for Development

# Clone and install
git  clone  https://github.com/CerevoxAI/cerevox-python.git
cd  cerevox-python/python-sdk
pip  install  -e  .[dev]

# Run tests
pytest

# Run the advanced demo
export  CEREVOX_API_KEY="your-api-key"
python  examples/async_advanced.py

# Test async features
python -c "
import asyncio
from cerevox import AsyncLexa

async def test():
	async with AsyncLexa() as client:
		buckets = await client.list_s3_buckets()
		print(f'Found {len(buckets.buckets)} S3 buckets')

asyncio.run(test())
"

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support & Community

📖 Resources

💬 Get Help

🐛 Issues

🔄 Changelog

See CHANGELOG.md for detailed release notes and migration guides.

⭐ Star us on GitHub if Cerevox helped your project!
_{Made with ❤️ by the Cerevox team}
_{Happy Parsing 🔍 ✨}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Oct 22, 2025

0.1.6

Jul 1, 2025

0.1.5

Jul 1, 2025

0.1.4

Jul 1, 2025

0.1.3

Jul 1, 2025

0.1.2

Jun 16, 2025

0.1.1

Jun 10, 2025

This version

0.1.0

Jun 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cerevox-0.1.0.tar.gz (53.4 kB view details)

Uploaded Jun 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cerevox-0.1.0-py3-none-any.whl (46.1 kB view details)

Uploaded Jun 9, 2025 Python 3

File details

Details for the file cerevox-0.1.0.tar.gz.

File metadata

Download URL: cerevox-0.1.0.tar.gz
Upload date: Jun 9, 2025
Size: 53.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for cerevox-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f83bfbf3d72a2b78e70aba7f8d8f7191c1b25aa159e925a50d1ff290096ce5cd`
MD5	`a6857fc24d0cbd76a6b1d19fed9b304b`
BLAKE2b-256	`6f15936391682363c4b0c48f02d555b2f465c9180e0123194e17eaec04dacfee`

See more details on using hashes here.

File details

Details for the file cerevox-0.1.0-py3-none-any.whl.

File metadata

Download URL: cerevox-0.1.0-py3-none-any.whl
Upload date: Jun 9, 2025
Size: 46.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for cerevox-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84b85b7157ddf164e47c10859aedd097d313449b3090a1627aea7fff0d71e6e2`
MD5	`dafce928fda199b1eafa1a14e69c63f8`
BLAKE2b-256	`b92c676f4fc0755e2e27e04c14305f7527cae2171a740bfbb72f072082fce03e`

See more details on using hashes here.

cerevox 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cerevox - The Data Layer 🧠 ⚡

📦 Installation

🚀 Quick Start

Get started in 30 seconds:

Async Processing (Recommended):

Document Processing Pipeline

Sample Output Structure

✨ Features

🚀 Performance & Scale

🧠 AI-Powered Extraction

🔗 Integration Ready

👨‍💻 Developer Experience

🎯 Smart Chunking Features

🚀 Quick Start Examples

🗄️ Vector Database Integration

📋 Examples

🚀 Run the Complete Demo

🔍 Content Analysis & Search

🗄️ Table Extraction & Processing

⚡ Performance Optimization

📚 Documentation

AsyncLexa(api_key: [string], [options: [dict]])

api_key

options

max_concurrent

timeout

max_retries

AsyncLexa Methods

parse(files: [list], [options: [dict]])

files

options

progress_callback

mode

parse_urls(urls: [list], [options: [dict]])

urls

options

Document Object

Properties

filename

file_type

page_count

content

elements

tables

Methods

to_markdown()

to_html()

to_dict()

search_content(query: [string], [options: [dict]])

query

options

include_metadata

get_elements_by_page(page_number: [int])

page_number

get_elements_by_type(element_type: [string])

element_type

get_statistics()

DocumentBatch Object

Properties

total_pages

Methods

search_all(query: [string], [options: [dict]])

query

options

filter_by_type(file_type: [string])

file_type

save_to_json(filepath: [string])

filepath

to_combined_text()

to_combined_markdown()

to_combined_html()