Official Python SDK for Parsefy - AI-powered document data extraction
Project description
Parsefy Python SDK
Official Python SDK for Parsefy - AI-powered document data extraction.
Extract structured data from PDF and DOCX documents using Pydantic models. Simply define your schema and let Parsefy handle the rest.
Installation
pip install parsefy
Quick Start
from parsefy import Parsefy
from pydantic import BaseModel, Field
# Initialize client (reads PARSEFY_API_KEY from environment)
client = Parsefy()
# Define your extraction schema
class Invoice(BaseModel):
invoice_number: str = Field(description="The invoice number")
date: str = Field(description="Invoice date in YYYY-MM-DD format")
total: float = Field(description="Total amount")
currency: str = Field(description="3-letter currency code")
# Extract data from a document
result = client.extract(file="invoice.pdf", schema=Invoice)
if result.error is None:
print(f"Invoice #{result.data.invoice_number}")
print(f"Total: {result.data.total} {result.data.currency}")
print(f"Credits used: {result.metadata.credits}")
else:
print(f"Error: {result.error.message}")
Features
- Type-safe extraction - Full type inference with Pydantic models
- Sync & async support - Both
extract()andextract_async()methods - Multiple input types - File paths, bytes, or file-like objects
- Detailed metadata - Processing time, token usage, and credits consumed
- Client-side validation - File type, size, and existence checks before upload
Authentication
Set your API key via environment variable:
export PARSEFY_API_KEY=pk_your_api_key
Or pass it directly:
client = Parsefy(api_key="pk_your_api_key")
Usage Examples
Basic Extraction
from parsefy import Parsefy
from pydantic import BaseModel, Field
client = Parsefy()
class Person(BaseModel):
name: str = Field(description="Full name of the person")
email: str = Field(description="Email address")
phone: str | None = Field(default=None, description="Phone number if present")
result = client.extract(file="contact.pdf", schema=Person)
if result.error is None:
print(result.data.name)
print(result.data.email)
Complex Schemas
from parsefy import Parsefy
from pydantic import BaseModel, Field
client = Parsefy()
class LineItem(BaseModel):
description: str = Field(description="Item description")
quantity: int = Field(description="Quantity ordered")
unit_price: float = Field(description="Price per unit")
total: float = Field(description="Line total")
class Invoice(BaseModel):
invoice_number: str = Field(description="Invoice number")
vendor: str = Field(description="Vendor company name")
date: str = Field(description="Invoice date (YYYY-MM-DD)")
line_items: list[LineItem] = Field(description="List of items on the invoice")
subtotal: float = Field(description="Subtotal before tax")
tax: float = Field(description="Tax amount")
total: float = Field(description="Total amount due")
result = client.extract(file="invoice.pdf", schema=Invoice)
if result.error is None:
for item in result.data.line_items:
print(f"{item.description}: {item.quantity} x ${item.unit_price}")
Async Usage
import asyncio
from parsefy import Parsefy
from pydantic import BaseModel, Field
class Receipt(BaseModel):
store_name: str = Field(description="Name of the store")
total: float = Field(description="Total amount paid")
async def process_receipts():
async with Parsefy() as client:
tasks = [
client.extract_async(file=f"receipt_{i}.pdf", schema=Receipt)
for i in range(1, 4)
]
results = await asyncio.gather(*tasks)
for i, result in enumerate(results, 1):
if result.error is None:
print(f"Receipt {i}: {result.data.store_name} - ${result.data.total}")
asyncio.run(process_receipts())
Different Input Types
from parsefy import Parsefy
from pydantic import BaseModel
from pathlib import Path
client = Parsefy()
class Document(BaseModel):
title: str
content: str
# From file path string
result = client.extract(file="document.pdf", schema=Document)
# From Path object
result = client.extract(file=Path("document.pdf"), schema=Document)
# From bytes
with open("document.pdf", "rb") as f:
file_bytes = f.read()
result = client.extract(file=file_bytes, schema=Document)
# From file object
with open("document.pdf", "rb") as f:
result = client.extract(file=f, schema=Document)
Error Handling
from parsefy import Parsefy, APIError, ValidationError
from pydantic import BaseModel
client = Parsefy()
class Invoice(BaseModel):
number: str
total: float
try:
result = client.extract(file="invoice.pdf", schema=Invoice)
if result.error is None:
print(result.data)
else:
# Extraction-level error (API returned 200 but extraction failed)
print(f"Extraction failed: {result.error.code}")
print(f"Message: {result.error.message}")
except ValidationError as e:
# Client-side validation error (file not found, wrong type, etc.)
print(f"Validation error: {e.message}")
except APIError as e:
# HTTP error from API (401, 429, 500, etc.)
print(f"API error {e.status_code}: {e.message}")
API Reference
Parsefy Client
client = Parsefy(
api_key: str | None = None, # API key (or set PARSEFY_API_KEY env var)
timeout: float = 60.0, # Request timeout in seconds
)
extract() / extract_async()
result = client.extract(
file: str | Path | bytes | BinaryIO, # Document to extract from
schema: type[T], # Pydantic model class
) -> ExtractResult[T]
ExtractResult[T]
| Field | Type | Description |
|---|---|---|
data |
T | None |
Extracted data (or None on error) |
metadata |
ExtractionMetadata |
Processing metadata |
error |
APIErrorDetail | None |
Error details (or None on success) |
ExtractionMetadata
| Field | Type | Description |
|---|---|---|
processing_time_ms |
int |
Processing time in milliseconds |
input_tokens |
int |
Input tokens used |
output_tokens |
int |
Output tokens generated |
credits |
int |
Credits consumed (1 credit = 1 page) |
fallback_triggered |
bool |
Whether fallback model was used |
Supported File Types
- PDF (
.pdf) - Microsoft Word (
.docx)
Maximum file size: 10MB
Requirements
- Python 3.10+
- Pydantic 2.0+
- httpx 0.25+
License
MIT License - see LICENSE for details.
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsefy-1.0.0.tar.gz.
File metadata
- Download URL: parsefy-1.0.0.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dab11e40882596075d90fb8dd6f70e1d4e88cbbf68d1b2e0b1d15361a868d813
|
|
| MD5 |
e6969fefde29267386963222dc345ff9
|
|
| BLAKE2b-256 |
9566708b5fec9ff3707df6fcd942c49832d91301346fc4054e3b1f5d89bab227
|
File details
Details for the file parsefy-1.0.0-py3-none-any.whl.
File metadata
- Download URL: parsefy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72874c5d3080e853397c1b91f941f76b6c71ad6f5513774fd5653c4217a3aeae
|
|
| MD5 |
7b5fb35e2d412776d20bc627358930df
|
|
| BLAKE2b-256 |
b92b118bdb5890f26a2e8efc382bdb1c8fff0da3de2f9ba1e4a0639ed36f64f7
|