Python SDK for the FlexOrch API — process documents, build LLM-ready datasets
Project description
flexorch-sdk
Python SDK for the FlexOrch API.
FlexOrch turns unstructured documents (PDF, DOCX, invoices, emails…) into clean, structured, LLM-ready datasets — with automatic PII detection and masking, quality scoring, and multiple export formats.
Install
pip install flexorch-sdk
Requires Python 3.10+. The only dependency is httpx.
Quick start
from flexorch_sdk import FlexOrchClient
client = FlexOrchClient("fx_your_key_here")
# Upload a document and wait for the pipeline to finish
job = client.process("contract.pdf", locale="tr").wait()
print(job.quality_grade) # "A"
print(job.quality_score) # 0.91
# Download the resulting dataset
dataset = job.dataset()
dataset.export("jsonl", path="output.jsonl")
Auth
Pass your API key directly or set the FLEXORCH_API_KEY environment variable:
export FLEXORCH_API_KEY=fx_...
from flexorch_sdk import FlexOrchClient
client = FlexOrchClient() # reads FLEXORCH_API_KEY automatically
Get your API key from app.flexorch.com → Settings.
Supported input formats
| Category | Formats |
|---|---|
| Documents | PDF (text + scanned), DOCX, TXT |
| Spreadsheets | XLSX |
| EML, MSG | |
| E-invoices | XML/UBL (Peppol, GİB TR), FatturaPA (IT), XRechnung (DE), ZUGFeRD/Factur-X |
| Images | JPG, PNG, TIFF (OCR) |
| Web | HTML, HTM |
Export formats
json · jsonl · csv · parquet · md · xml · xlsx · rag
dataset.export("jsonl", path="output.jsonl") # write to file
raw = dataset.export("parquet") # return bytes
The rag format produces LlamaIndex/LangChain-compatible chunks with metadata.
Processing
Single file
job = client.process("invoice.pdf", locale="de").wait()
locale is an IETF language tag
used to activate the right PII detectors (tr, de, en, fr, it, nl, es, pl, und = all).
Batch
jobs = client.process_many(["a.pdf", "b.pdf", "c.pdf"], locale="und")
for job in jobs:
job.wait()
print(job.quality_grade, job.quality_score)
From S3
# Register a connector once; store conn.id for reuse
conn = client.connectors.create(
"Production S3", "s3",
{
"bucket": "my-bucket",
"region": "eu-central-1",
"access_key_id": "AKIA...",
"secret_access_key": "...",
},
)
# Verify connectivity
result = client.connectors.test(conn.id)
print(result.success, result.latency_ms) # True, 38
# Process files from S3
jobs = client.process_from_s3(conn.id, ["invoices/inv-001.pdf", "invoices/inv-002.pdf"])
for job in jobs:
job.wait()
Job polling
Job.wait() blocks until the pipeline completes or times out.
job = client.process("large-report.pdf").wait(
timeout=600, # seconds before TimeoutError (default: 300)
poll_interval=5, # polling interval in seconds (default: 2)
)
print(job.status) # "completed"
print(job.quality_grade) # "A" | "B" | "C" | "D"
print(job.quality_score) # 0.0 – 1.0
print(job.has_dataset) # True
Dataset operations
ds = job.dataset() # fetch dataset linked to this job
ds = client.datasets.get("dataset-id")
print(ds.name) # "contract-2024-q1"
print(ds.row_count) # 142
print(ds.available_formats) # ["json", "jsonl", "csv", "parquet"]
# Download locally
ds.export("jsonl", path="output.jsonl")
# Push directly to S3
push = ds.export_to_s3(conn.id, "jsonl", prefix="processed/datasets/")
print(push["s3_key"]) # "processed/datasets/contract-2024-q1.jsonl"
print(push["size_bytes"]) # 84320
# Semantic indexing (Pro+)
ds.index()
status = ds.index_status() # {"status": "ready", "chunks_indexed": 48}
Semantic search (Pro+)
results = client.search(
"payment terms net 30",
top_k=10,
filters={
"document_type": "invoice",
"language": "de",
"quality_grade": "A",
"pii_masked": True,
},
)
for r in results:
print(f"{r.score:.3f} [{r.dataset_id}] {r.text[:120]}")
Resources
# Jobs
jobs = client.jobs.list(page=1, page_size=20)
job = client.jobs.get("job-id")
# Datasets
datasets = client.datasets.list()
ds = client.datasets.get("dataset-id")
# Usage
usage = client.usage.current()
print(f"{usage.credits_used} / {usage.credits_limit} credits used")
print(f"Plan: {usage.plan} — resets {usage.reset_at}")
# Webhooks
client.webhooks.register("https://your-server.com/hook", events=["dataset.ready"])
client.webhooks.list()
client.webhooks.delete("webhook-id")
# Connectors
client.connectors.create("name", "s3", {...})
client.connectors.list()
client.connectors.get("connector-id")
client.connectors.test("connector-id")
client.connectors.delete("connector-id")
Error handling
from flexorch_sdk import (
FlexOrchClient,
AuthError, # 401 — invalid or missing API key
QuotaError, # 402 — credit limit reached or trial expired
RateLimitError, # 429 — too many requests; has .retry_after (seconds)
NotFoundError, # 404
ValidationError, # 422 — bad request parameters
ServerError, # 5xx
JobFailedError, # pipeline failed; has .job_id and .failure_reason
TimeoutError, # Job.wait() exceeded timeout; has .job_id
)
try:
job = client.process("doc.pdf").wait(timeout=120)
except AuthError:
print("Invalid API key — check FLEXORCH_API_KEY")
except QuotaError as e:
print(f"Out of credits — reset at {e.reset_at}")
except JobFailedError as e:
print(f"Pipeline failed for job {e.job_id}: {e.failure_reason}")
except TimeoutError as e:
print(f"Job {e.job_id} still running after timeout — poll manually")
The SDK automatically retries 429 and 5xx responses with exponential backoff (up to 3 attempts by default).
Configuration
client = FlexOrchClient(
api_key="fx_...",
base_url="https://api.flexorch.com/v1", # override for self-hosted
timeout=60.0, # HTTP timeout per request in seconds
max_retries=5, # retry attempts for transient errors
)
Context manager
with FlexOrchClient() as client:
job = client.process("report.pdf").wait()
job.dataset().export("jsonl", path="report.jsonl")
# HTTP connection pool released automatically
Examples
See examples/ for runnable scripts:
| File | Description |
|---|---|
basic_process.py |
Process a single document and export as JSONL |
batch_process.py |
Process multiple files with error handling |
s3_import.py |
Import from S3, process, export results back to S3 |
Development
git clone https://github.com/flexorch/flexorch-sdk
cd flexorch-sdk
pip install -e ".[dev]"
pytest
Tests use respx to mock httpx — no network calls, no API key needed.
Links
- Platform
- API reference
- flexorch-audit — open-source PII detection library
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexorch_sdk-0.1.0.tar.gz.
File metadata
- Download URL: flexorch_sdk-0.1.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f754fa86eb575d8b629d243728779a8772ac0f157bb0065de1550935726371ee
|
|
| MD5 |
404f1d5d442807692f676b907a694737
|
|
| BLAKE2b-256 |
2c9ad83a39b0c5b2810430fde9d1425c88e721690767c6911a6df7ea47bd9731
|
File details
Details for the file flexorch_sdk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flexorch_sdk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3928cc975ec72b3f583bec8204652c24765f61e5da5f5f325d51694c9fff1de4
|
|
| MD5 |
4f61fa40fe50169feabaeafea6c5e1f1
|
|
| BLAKE2b-256 |
86ed6ebfba81ab2787ed13801ccea33b8a75c8d506f4e94a28df8eca1970cf3f
|