Skip to main content

Python SDK for the FlexOrch API — process documents, build LLM-ready datasets

Project description

flexorch-sdk

PyPI Python CI License: MIT

Python SDK for the FlexOrch API.

FlexOrch turns unstructured documents (PDF, DOCX, invoices, emails…) into clean, structured, LLM-ready datasets — with automatic PII detection and masking, quality scoring, and multiple export formats.


Install

pip install flexorch-sdk

Requires Python 3.10+. The only dependency is httpx.


Quick start

from flexorch_sdk import FlexOrchClient

client = FlexOrchClient("fx_your_key_here")

# Upload a document and wait for the pipeline to finish
job = client.process("contract.pdf", locale="tr").wait()

print(job.quality_grade)   # "A"
print(job.quality_score)   # 0.91

# Download the resulting dataset
dataset = job.dataset()
dataset.export("jsonl", path="output.jsonl")

Auth

Pass your API key directly or set the FLEXORCH_API_KEY environment variable:

export FLEXORCH_API_KEY=fx_...
from flexorch_sdk import FlexOrchClient

client = FlexOrchClient()   # reads FLEXORCH_API_KEY automatically

Get your API key from app.flexorch.com → Settings.


Supported input formats

Category Formats
Documents PDF (text + scanned), DOCX, TXT
Spreadsheets XLSX
Email EML, MSG
E-invoices XML/UBL (Peppol, GİB TR), FatturaPA (IT), XRechnung (DE), ZUGFeRD/Factur-X
Images JPG, PNG, TIFF (OCR)
Web HTML, HTM

Export formats

json · jsonl · csv · parquet · md · xml · xlsx · rag

dataset.export("jsonl", path="output.jsonl")   # write to file
raw = dataset.export("parquet")                # return bytes

The rag format produces LlamaIndex/LangChain-compatible chunks with metadata.


Processing

Single file

job = client.process("invoice.pdf", locale="de").wait()

locale is an IETF language tag used to activate the right PII detectors (tr, de, en, fr, it, nl, es, pl, und = all).

Batch

jobs = client.process_many(["a.pdf", "b.pdf", "c.pdf"], locale="und")
for job in jobs:
    job.wait()
    print(job.quality_grade, job.quality_score)

From S3

# Register a connector once; store conn.id for reuse
conn = client.connectors.create(
    "Production S3", "s3",
    {
        "bucket": "my-bucket",
        "region": "eu-central-1",
        "access_key_id": "AKIA...",
        "secret_access_key": "...",
    },
)

# Verify connectivity
result = client.connectors.test(conn.id)
print(result.success, result.latency_ms)   # True, 38

# Process files from S3
jobs = client.process_from_s3(conn.id, ["invoices/inv-001.pdf", "invoices/inv-002.pdf"])
for job in jobs:
    job.wait()

Job polling

Job.wait() blocks until the pipeline completes or times out.

job = client.process("large-report.pdf").wait(
    timeout=600,       # seconds before TimeoutError (default: 300)
    poll_interval=5,   # polling interval in seconds (default: 2)
)

print(job.status)        # "completed"
print(job.quality_grade) # "A" | "B" | "C" | "D"
print(job.quality_score) # 0.0 – 1.0
print(job.has_dataset)   # True

Dataset operations

ds = job.dataset()          # fetch dataset linked to this job
ds = client.datasets.get("dataset-id")

print(ds.name)              # "contract-2024-q1"
print(ds.row_count)         # 142
print(ds.available_formats) # ["json", "jsonl", "csv", "parquet"]

# Download locally
ds.export("jsonl", path="output.jsonl")

# Push directly to S3
push = ds.export_to_s3(conn.id, "jsonl", prefix="processed/datasets/")
print(push["s3_key"])       # "processed/datasets/contract-2024-q1.jsonl"
print(push["size_bytes"])   # 84320

# Semantic indexing (Pro+)
ds.index()
status = ds.index_status()  # {"status": "ready", "chunks_indexed": 48}

Semantic search (Pro+)

results = client.search(
    "payment terms net 30",
    top_k=10,
    filters={
        "document_type": "invoice",
        "language": "de",
        "quality_grade": "A",
        "pii_masked": True,
    },
)

for r in results:
    print(f"{r.score:.3f}  [{r.dataset_id}]  {r.text[:120]}")

Resources

# Jobs
jobs = client.jobs.list(page=1, page_size=20)
job  = client.jobs.get("job-id")

# Datasets
datasets = client.datasets.list()
ds       = client.datasets.get("dataset-id")

# Usage
usage = client.usage.current()
print(f"{usage.credits_used} / {usage.credits_limit} credits used")
print(f"Plan: {usage.plan}  —  resets {usage.reset_at}")

# Webhooks
client.webhooks.register("https://your-server.com/hook", events=["dataset.ready"])
client.webhooks.list()
client.webhooks.delete("webhook-id")

# Connectors
client.connectors.create("name", "s3", {...})
client.connectors.list()
client.connectors.get("connector-id")
client.connectors.test("connector-id")
client.connectors.delete("connector-id")

Error handling

from flexorch_sdk import (
    FlexOrchClient,
    AuthError,       # 401 — invalid or missing API key
    QuotaError,      # 402 — credit limit reached or trial expired
    RateLimitError,  # 429 — too many requests; has .retry_after (seconds)
    NotFoundError,   # 404
    ValidationError, # 422 — bad request parameters
    ServerError,     # 5xx
    JobFailedError,  # pipeline failed; has .job_id and .failure_reason
    TimeoutError,    # Job.wait() exceeded timeout; has .job_id
)

try:
    job = client.process("doc.pdf").wait(timeout=120)
except AuthError:
    print("Invalid API key — check FLEXORCH_API_KEY")
except QuotaError as e:
    print(f"Out of credits — reset at {e.reset_at}")
except JobFailedError as e:
    print(f"Pipeline failed for job {e.job_id}: {e.failure_reason}")
except TimeoutError as e:
    print(f"Job {e.job_id} still running after timeout — poll manually")

The SDK automatically retries 429 and 5xx responses with exponential backoff (up to 3 attempts by default).


Configuration

client = FlexOrchClient(
    api_key="fx_...",
    base_url="https://api.flexorch.com/v1",  # override for self-hosted
    timeout=60.0,       # HTTP timeout per request in seconds
    max_retries=5,      # retry attempts for transient errors
)

Context manager

with FlexOrchClient() as client:
    job = client.process("report.pdf").wait()
    job.dataset().export("jsonl", path="report.jsonl")
# HTTP connection pool released automatically

Examples

See examples/ for runnable scripts:

File Description
basic_process.py Process a single document and export as JSONL
batch_process.py Process multiple files with error handling
s3_import.py Import from S3, process, export results back to S3

Development

git clone https://github.com/flexorch/flexorch-sdk
cd flexorch-sdk
pip install -e ".[dev]"
pytest

Tests use respx to mock httpx — no network calls, no API key needed.


Links


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexorch_sdk-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flexorch_sdk-0.1.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file flexorch_sdk-0.1.0.tar.gz.

File metadata

  • Download URL: flexorch_sdk-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_sdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f754fa86eb575d8b629d243728779a8772ac0f157bb0065de1550935726371ee
MD5 404f1d5d442807692f676b907a694737
BLAKE2b-256 2c9ad83a39b0c5b2810430fde9d1425c88e721690767c6911a6df7ea47bd9731

See more details on using hashes here.

File details

Details for the file flexorch_sdk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: flexorch_sdk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_sdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3928cc975ec72b3f583bec8204652c24765f61e5da5f5f325d51694c9fff1de4
MD5 4f61fa40fe50169feabaeafea6c5e1f1
BLAKE2b-256 86ed6ebfba81ab2787ed13801ccea33b8a75c8d506f4e94a28df8eca1970cf3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page