Skip to main content

Official Python SDK for Cerevox

Project description

Cerevox Logo

Cerevox - The Data Layer ๐Ÿง  โšก

Parse documents with enterprise-grade reliability
AI-powered โ€ข Highest Accuracy โ€ข Vector DB ready

CI Status Code Coverage Maintainability PyPI version Python versions License


Official Python SDK for Lexa - Parse documents into structured data

๐ŸŽฏ Perfect for: RAG applications, document analysis, data extraction, and vector database preparation

๐Ÿ“ฆ Installation

pip install cerevox

System Requirements:

๐Ÿš€ Quick Start

Get started in 30 seconds:

from cerevox import Lexa

# Parse a document
client = Lexa(api_key="your-api-key")
documents = client.parse(["document.pdf"])

print(f"Extracted {len(documents[0].content)} characters")
print(f"Found {len(documents[0].tables)} tables")

Async Processing (Recommended):

import asyncio
from cerevox import AsyncLexa

async def main():
    async with AsyncLexa(api_key="your-api-key") as client:
        documents = await client.parse(["document.pdf", "report.docx"])
        
        # Get chunks optimized for vector databases
        chunks = documents.get_all_text_chunks(target_size=500)
        print(f"Ready for embedding: {len(chunks)} chunks")

asyncio.run(main())
๐ŸŽฅ See It In Action

Document Processing Pipeline

๐Ÿ“„ Input Document โ†’ ๐Ÿง  AI Processing โ†’ ๐Ÿ“Š Structured Output โ†’ ๐Ÿ” Vector Ready

Sample Output Structure

{
  "filename": "financial_report.pdf",
  "content": "Q4 financial results show...",
  "tables": [
    {
      "headers": ["Quarter", "Revenue", "Growth"],
      "rows": [["Q4", "$2.3M", "15%"]]
    }
  ],
  "metadata": {
    "pages": 12,
    "confidence": 0.998,
    "processing_time": 2.3
  },
  "chunks": [
    {
      "content": "Executive Summary: Q4 results...",
      "metadata": {"page": 1, "section": "summary"}
    }
  ]
}

โœจ Features

๐Ÿš€ Performance & Scale

  • 10x Faster than traditional solutions
  • Native Async Support with concurrent processing
  • Enterprise-grade reliability with automatic retries

๐Ÿง  AI-Powered Extraction

  • SOTA Accuracy with cutting-edge ML models
  • Advanced Table Extraction preserving structure and formatting
  • 12+ File Formats including PDF, DOCX, PPTX, HTML, and more

๐Ÿ”— Integration Ready

  • Vector Database Optimized chunks for RAG applications
  • 7+ Cloud Storage integrations (S3, SharePoint, Google Drive, etc.)
  • Framework Agnostic works with Django, Flask, FastAPI

๐Ÿ‘จโ€๐Ÿ’ป Developer Experience

  • Intuitive API with full type hints and comprehensive examples
  • Rich Metadata extraction including images, formatting, and structure
  • Smart Search across documents and batches
๐Ÿงฉ Intelligent Vector Database Preparation

Engineered specifically for vector databases and RAG applications

๐ŸŽฏ Smart Chunking Features

  • Structure-Aware: Preserves headers, paragraphs, code blocks, and logical document boundaries
  • Precise Control: Configurable target sizes with tolerance for optimal embedding performance
  • Format-Aware: Maintains markdown formatting, code syntax, and table structures
  • Performance-First: Built-in async processing with no manual post-processing required
  • Rich Context: Full document metadata for enhanced retrieval and search relevance

๐Ÿš€ Quick Start Examples

from cerevox import AsyncLexa, chunk_markdown, chunk_text

# ๐ŸŽฏ Method 1: Direct Vector DB Preparation (Recommended)
async  with  AsyncLexa()  as client:
	documents =  await client.parse(["document.pdf",  "report.docx"])
	
	# Get optimized chunks for vector databases
	text_chunks = documents.get_all_text_chunks(
		target_size=500,  # Performant for most embedding models
		include_metadata=True # Rich context for retrieval
	)
	markdown_chunks = documents.get_all_markdown_chunks(
		target_size=800,  # Larger chunks for formatted content
		tolerance=0.1  # ยฑ10% size flexibility
	)

	# ๐Ÿ”ง Method 2: Standalone Chunking Functions
	chunks =  chunk_markdown(markdown_content,  target_size=500)
	chunks =  chunk_text(plain_text,  target_size=300)

๐Ÿ—„๏ธ Vector Database Integration

async  with  AsyncLexa()  as client:
	documents =  await client.parse(["doc.pdf"])
	chunks = documents.get_all_text_chunks(target_size=512)

	for chunk in chunks:
		# Pinecone Integration
		embedding =  generate_embedding(chunk['content'])
		index.upsert([{
			'id': f"{chunk['document_filename']}_{chunk['chunk_index']}",
			'values': embedding,
			'metadata': chunk # Includes filename, page, element_type, etc.
		}])

		# ChromaDB Integration
		collection.add(
			documents=[chunk['content']  for chunk in chunks],
			metadatas=[chunk for chunk in chunks],
			ids=[f"doc_{i}"  for i in  range(len(chunks))]
		)
โ˜๏ธ Cloud Storage Integrations - Coming Soon!

Coming Soon! Connect and parse documents from 7+ cloud storage services just setup authentication on Cerevox:

async  with  AsyncLexa()  as client:
	# Amazon S3
	s3_docs =  await client.parse_s3_folder(
		bucket_name="my-bucket",
		folder_path="documents/"
	)

	# Microsoft SharePoint
	sharepoint_docs =  await client.parse_sharepoint_folder(
		drive_id="drive-id",
		folder_id="folder-id"
	)

	# Also supports: Box, Dropbox, Google Drive, Salesforce, Sendme

Supported Services: Coming Soon!

  • ๐Ÿ—„๏ธ Amazon S3 - Bucket and folder parsing
  • ๐Ÿ“ฆ Box - Enterprise file management
  • ๐Ÿ’พ Dropbox - Personal and business accounts
  • ๐Ÿ“ Google Drive - File and folder processing
  • ๐Ÿข Microsoft SharePoint - Sites, drives, and folders
  • ๐Ÿค Salesforce - CRM document processing
  • ๐Ÿ“ค Sendme - Secure file transfer integration

๐Ÿ“‹ Examples

Explore comprehensive examples in the examples/ directory:

Example Description
lexa_examples.py Complete SDK functionality demonstration
vector_db_preparation.py Vector database chunking and integration patterns
async_examples.py Advanced async processing and cloud integrations
document_examples.py Document analysis and manipulation features
cloud_integrations.py All cloud storage service integrations

๐Ÿš€ Run the Complete Demo

# Clone and explore
git clone https://github.com/CerevoxAI/cerevox-python.git

cd cerevox-python

export  CEREVOX_API_KEY="your-api-key"

# Run comprehensive demos
python  examples/async_examples.py  # Async features
python  examples/cloud_integrations.py  # Cloud Integrations
python  examples/document_examples.py  # Document analysis
python  examples/vector_db_preparation.py  # Vector DB preparation
๐Ÿงช Advanced Examples

๐Ÿ” Content Analysis & Search

# Advanced document analysis
doc = documents[0]

# Extract statistics
stats = doc.get_statistics()
print(f"Characters: {stats['characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")

# Content search with metadata
matches = doc.search_content("revenue",  include_metadata=True)

for match in matches:
	print(f"Found on page {match['page_number']}: {match['context']}")

# Batch analysis
similarity_matrix = documents.get_content_similarity_matrix()
key_phrases = documents.extract_key_phrases(top_n=10)

๐Ÿ—„๏ธ Table Extraction & Processing

# Extract and analyze tables
all_tables = documents.get_all_tables()

print(f"Found {len(all_tables)} tables across documents")

# Convert to pandas for analysis
df_tables = documents.to_pandas_tables()

for filename, tables in df_tables.items():
	print(f"๐Ÿ“„ {filename}: {len(tables)} tables")

	for table in tables:
		print(f" Table shape: {table.shape}")

# Export tables to CSV
documents.export_tables_to_csv("exported_tables/")

โšก Performance Optimization

# Configure for high-performance processing
async with AsyncLexa(
	api_key="your-api-key",
	max_concurrent=20,  # Increase parallel processing
	timeout=120.0,  # Extended timeout for large files
	max_retries=5  # Enhanced error resilience
) as client:

	# Batch processing with progress tracking
	def  progress_callback(status):
		print(f"๐Ÿ“Š {status.status} - Processing...")

	documents = await client.parse(
		files=large_file_list,
		mode=ProcessingMode.ADVANCED,
		progress_callback=progress_callback
	)

๐Ÿ“š Documentation

For complete API documentation, visit:

๐Ÿ“‹ API Reference

AsyncLexa(api_key: [string], [options: [dict]])

The main async client for document processing with enterprise-grade reliability.

api_key

  • Required
  • Type: [string]
  • Values: <your cerevox api key>

Your Cerevox API key obtained from Cerevox.

options

max_concurrent
  • Optional
  • Type: [int]
  • Default: 10

Maximum number of concurrent processing jobs.

timeout
  • Optional
  • Type: [float]
  • Default: 60.0

Request timeout in seconds for API calls.

max_retries
  • Optional
  • Type: [int]
  • Default: 3

Maximum number of retry attempts for failed requests.

AsyncLexa Methods

parse(files: [list], [options: [dict]])

Parse documents from local files or file paths.

files
  • Required
  • Type: [list]<[string]>
  • Values: ["path/to/file.pdf", "document.docx"]

List of file paths to parse.

options
progress_callback
  • Optional
  • Type: [function]
  • Default: None

Callback function to track parsing progress. Receives status updates.

mode
  • Optional
  • Type: [string]
  • Default: 'STANDARD'
  • Values: 'STANDARD', 'ADVANCED'

Processing mode for document parsing.

parse_urls(urls: [list], [options: [dict]])

Parse documents from URLs.

urls
  • Required
  • Type: [list]<[string]>
  • Values: ["https://example.com/doc.pdf"]

List of URLs pointing to documents to parse.

options

Same as parse() method options.

Document Object

Individual document with rich metadata and content access.

Properties

filename
  • Type: [string]
  • Description: Original filename of the document
file_type
  • Type: [string]
  • Description: Document type (e.g., 'pdf', 'docx', 'html')
page_count
  • Type: [int]
  • Description: Number of pages in the document
content
  • Type: [string]
  • Description: Plain text content of the document
elements
  • Type: [list]<[dict]>
  • Description: Structured document elements with metadata
tables
  • Type: [list]<[dict]>
  • Description: Extracted tables from the document

Methods

to_markdown()
  • Returns: [string]
  • Description: Convert document to formatted markdown
to_html()
  • Returns: [string]
  • Description: Convert document to HTML format
to_dict()
  • Returns: [dict]
  • Description: Convert document to dictionary format
search_content(query: [string], [options: [dict]])

Search for content within the document.

query
  • Required
  • Type: [string]

The search query string.

options
include_metadata
  • Optional
  • Type: [bool]
  • Default: False

Include metadata in search results.

get_elements_by_page(page_number: [int])
  • Returns: [list]<[dict]>
  • Description: Get all elements from a specific page
page_number
  • Required
  • Type: [int]
  • Values: 1, 2, 3...

Page number to retrieve elements from.

get_elements_by_type(element_type: [string])
  • Returns: [list]<[dict]>
  • Description: Filter elements by type
element_type
  • Required
  • Type: [string]
  • Values: 'table', 'paragraph', 'header', etc.

Type of elements to retrieve.

get_statistics()
  • Returns: [dict]
  • Description: Get document statistics including character count, word count, etc.

DocumentBatch Object

Collection of documents with batch operations.

Properties

total_pages
  • Type: [int]
  • Description: Total pages across all documents in the batch

Methods

search_all(query: [string], [options: [dict]])

Search across all documents in the batch.

query
  • Required
  • Type: [string]

The search query string.

options

Same as Document search_content() options.

filter_by_type(file_type: [string])
  • Returns: [list]<Document>
  • Description: Filter documents by file type
file_type
  • Required
  • Type: [string]
  • Values: 'pdf', 'docx', 'html', etc.

File type to filter by.

save_to_json(filepath: [string])

Save batch to JSON file.

filepath
  • Required
  • Type: [string]

Path where to save the JSON file.

to_combined_text()
  • Returns: [string]
  • Description: Combine all document content into single text string
to_combined_markdown()
  • Returns: [string]
  • Description: Combine all document content into single markdown string
to_combined_html()
  • Returns: [string]
  • Description: Combine all document content into single HTML string
get_all_text_chunks([options: [dict]])

Get optimized text chunks for vector databases.

options

####### target_size

  • Optional
  • Type: [int]
  • Default: 500

Target size for each chunk in characters.

####### tolerance

  • Optional
  • Type: [float]
  • Default: 0.1
  • Values: 0.0 - 1.0

Size tolerance as a percentage (e.g., 0.1 = ยฑ10%).

####### include_metadata

  • Optional
  • Type: [bool]
  • Default: True

Include document metadata with each chunk.

get_all_markdown_chunks([options: [dict]])

Get optimized markdown chunks for vector databases.

options

Same as get_all_text_chunks() plus:

####### preserve_tables

  • Optional
  • Type: [bool]
  • Default: True

Keep table structures intact in chunks.

get_all_tables()
  • Returns: [list]<[dict]>
  • Description: Extract all tables from all documents
to_pandas_tables()
  • Returns: [dict]
  • Description: Convert all tables to pandas DataFrames, organized by filename
export_tables_to_csv(directory: [string])

Export all tables to CSV files.

directory
  • Required
  • Type: [string]

Directory path where CSV files will be saved.

Standalone Functions

chunk_text(text: [string], [options: [dict]])

Chunk plain text content for vector databases.

text
  • Required
  • Type: [string]

The text content to chunk.

options
target_size
  • Optional
  • Type: [int]
  • Default: 500

Target size for each chunk in characters.

tolerance
  • Optional
  • Type: [float]
  • Default: 0.1

Size tolerance as a percentage.

chunk_markdown(markdown: [string], [options: [dict]])

Chunk markdown content while preserving structure.

markdown
  • Required
  • Type: [string]

The markdown content to chunk.

options

Same as chunk_text() plus:

preserve_tables
  • Optional
  • Type: [bool]
  • Default: True

Keep table structures intact in chunks.

๐Ÿ›ก๏ธ Error Handling & Configuration

Robust Error Handling

from cerevox import (
	LexaAuthError,
	LexaError,
	LexaJobFailedError,
	LexaTimeoutError
)

try:
	documents =  await client.parse(files)
except LexaAuthError as e:
	print(f"โŒ Authentication failed: {e.message}")

except LexaJobFailedError as e:
	print(f"โŒ Job failed error: {e.message}")

except LexaTimeoutError as e:
	print(f"โŒ Timeout error: {e.message} (status: {e.status_code})")

except LexaError as e:
	print(f"โŒ General Lexa API error: {e.message}")
๐Ÿ”„ Migration Guide

From LlamaIndex

# Before (LlamaIndex)
documents = SimpleDirectoryReader('docs').load_data()

# After (Cerevox) - Better performance + async support
async with AsyncLexa()  as client:
	documents =  await client.parse(glob.glob('docs/*'))
	chunks = documents.get_all_text_chunks(target_size=500)

From Unstructured

# Before (Unstructured)
elements = partition_auto(filename="document.pdf")

# After (Cerevox) - More accurate tables + async support
async with AsyncLexa()  as client:
	documents =  await client.parse(["document.pdf"])
	elements = documents[0].elements # Structured with rich metadata

From Amazon Textract

# Before (Textract) - Manual polling required
response = textract.start_document_text_detection(...)

# After (Cerevox) - Automatic polling + most accurate tables
async  with  AsyncLexa()  as client:
	# Automatic polling, no manual loops needed
	documents =  await client.parse(["document.pdf"])
๐Ÿงช Development and Testing

Setting up for Development

# Clone and install
git  clone  https://github.com/CerevoxAI/cerevox-python.git
cd  cerevox-python/python-sdk
pip  install  -e  .[dev]

# Run tests
pytest

# Run the advanced demo
export  CEREVOX_API_KEY="your-api-key"
python  examples/async_advanced.py

# Test async features
python -c "
import asyncio
from cerevox import AsyncLexa

async def test():
	async with AsyncLexa() as client:
		buckets = await client.list_s3_buckets()
		print(f'Found {len(buckets.buckets)} S3 buckets')

asyncio.run(test())
"

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support & Community

๐Ÿ“– Resources

๐Ÿ’ฌ Get Help

๐Ÿ› Issues

๐Ÿ”„ Changelog

See CHANGELOG.md for detailed release notes and migration guides.


โญ Star us on GitHub if Cerevox helped your project!
Made with โค๏ธ by the Cerevox team
Happy Parsing ๐Ÿ” โœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cerevox-0.1.0.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cerevox-0.1.0-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file cerevox-0.1.0.tar.gz.

File metadata

  • Download URL: cerevox-0.1.0.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for cerevox-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f83bfbf3d72a2b78e70aba7f8d8f7191c1b25aa159e925a50d1ff290096ce5cd
MD5 a6857fc24d0cbd76a6b1d19fed9b304b
BLAKE2b-256 6f15936391682363c4b0c48f02d555b2f465c9180e0123194e17eaec04dacfee

See more details on using hashes here.

File details

Details for the file cerevox-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cerevox-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for cerevox-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84b85b7157ddf164e47c10859aedd097d313449b3090a1627aea7fff0d71e6e2
MD5 dafce928fda199b1eafa1a14e69c63f8
BLAKE2b-256 b92c676f4fc0755e2e27e04c14305f7527cae2171a740bfbb72f072082fce03e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page