Python SDK for the SymageDocs synthetic data API

These details have not been verified by PyPI

Project links

Project description

SymageDocs Python SDK

Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.

Installation

pip install symagedocs

For progress bars during long jobs:

pip install symagedocs[progress]

Quick Start

from symagedocs import Client

client = Client(api_key="sk_live_...")

# List available forms
forms = client.forms.list()
for f in forms:
    print(f"{f.id}: {f.name} ({f.credit_cost} credits)")

# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    output_formats=["pdf_typed"],
    # Augmentation knobs (STAX-1524). `degradation_profile` affects credit cost —
    # `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
    # `coherence_mode` controls cross-form identity correlation in multi-form jobs.
    degradation_profile="scanned",
    coherence_mode="coherent",
)
result = client.generate.wait(job.job_id)  # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")

# Per-item training data via the unified jobs API (STAX-1600)
job = client.generate.create(
    form_id="irs_w2_2025",
    quantity=10,
    output_formats=["pdf_typed", "bio"],
    idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
    print(example.item_id, len(example.bio.tokens))

# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")

# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")

Authentication

Get your API key at symagedocs.ai/account?tab=api.

# Pass directly
client = Client(api_key="sk_live_...")

# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client()  # reads from env

Async Support

from symagedocs import AsyncClient

async with AsyncClient(api_key="sk_live_...") as client:
    forms = await client.forms.list()
    job = await client.generate.create("irs_w2_2025", quantity=10)
    result = await client.generate.wait(job.job_id)

Configuration

client = Client(
    api_key="sk_live_...",
    base_url="https://symagedocs.ai",  # custom server
    timeout=30.0,                       # request timeout (seconds)
    max_retries=3,                      # retry on 429/5xx
)

Method Reference

Forms

Method	Description
`forms.list(category=None)`	List available forms, optionally filtered by category
`forms.get(form_id)`	Get detailed form info including field definitions

Generation

Method	Description
`generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None)`	Create an async generation job. Pass either `form_id` (single form) or `form_ids` (coherent multi-form generation across the same identity). `ink_color_distribution` (when set) must be a per-color weight map summing to exactly `100`. `degradation_profile` and `coherence_mode` are typed kwargs over what used to live inside `config={...}` — see the augmentation knobs section. `idempotency_key` (STAX-1600) attaches an `Idempotency-Key` header so retries within 24 hours return the original `job_id` and don't double-charge. The deprecated `realism_level` API field is intentionally not exposed; call the REST API directly if you need it.
`generate.list_jobs(limit=50, cursor=None, status=None)`	List generation jobs (cursor-paginated)
`generate.get_job(job_id)`	Get full job status and progress
`generate.list_downloads(job_id)`	List per-artifact presigned download URLs for a completed job
`generate.download(job_id, format, path)`	Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable (STAX-1600).
`generate.wait(job_id, poll_interval=3.0)`	Poll until job completes or fails
`generate.cancel(job_id)`	Cancel a running job (STAX-1600). Idempotent. Items rendered before the cancel observed remain downloadable via `download(format="bundle")`.
`generate.list_items(job_id, limit=50, cursor=None)`	List per-item records for a job (STAX-1600 — moved from `batches.list_items`). Cursor-paginated; each item carries its presigned download URLs.
`generate.download_item(job_id, item_id)`	Presigned S3 URLs for one item's files (STAX-1600 — moved from `batches.download_urls`).
`generate.get_bio_labels(job_id, item_id)`	Client-side helper: fetches the item's `_bio.json` sidecar and returns a parsed `BioDataset` (STAX-1600).
`generate.get_word_annotations(job_id, item_id)`	Client-side helper: fetches the item's `_words.json` sidecar and returns parsed `WordAnnotations` (STAX-1600).
`generate.iter_training_examples(job_id, format="bio")`	Client-side helper: iterates all items, yielding training examples in the chosen format (`"bio"` (default), `"funsd"`, `"donut"`) (STAX-1600).

client.generation alias (STAX-1600). client.generation and client.generate reference the same resource — use whichever name you prefer. The historical client.batches.* namespace was removed in this change; per-item training-example access lives on client.generate.* (or client.generation.*).

Identities

Method	Description
`identities.generate(quantity=1, config=None, seed=None)`	Generate raw synthetic identities as JSON

Tabular

Method	Description
`tabular.parse(prompt)`	Convert natural language to a column schema (LLM-powered)
`tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None)`	Create a tabular generation job
`tabular.status(job_id)`	Get tabular job progress and ETA
`tabular.download(job_id, format, path)`	Download tabular output to a local file
`tabular.wait(job_id, poll_interval=2.0)`	Poll until tabular job completes or fails

Account

Method	Description
`account.balance()`	Get credit balance (`credits_used`, `credits_allocated`)
`account.usage(days=30)`	Get usage summary for the specified period

Pricing

The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.

Method	Description
`pricing.rates()`	Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …)
`pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None)`	Estimate the credit cost of a hypothetical job before submitting it

Health

Method	Description
`client.health()`	Lightweight reachability probe (`GET /api/v1/health`). Returns the parsed JSON body. Works on both `Client` and `AsyncClient`.

Augmentation knobs (STAX-1524)

Two of the most-used keys in the freeform config={...} dict on generate.create are also exposed as typed kwargs:

degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | None
coherence_mode: Literal["coherent", "shuffled", "random"] | None

Why bother? Two reasons:

degradation_profile affects credit cost. Non-clean profiles need extra rendering work (rasterization, noise, paper warp), so the billing engine applies a multiplier: scanned/faxed are billed at 1.2×, mixed at 1.25×, and photographed at 1.3×. A typo on the freeform config={...} form silently falls back to the default 1.0× multiplier — meaning you don't get the degradation you asked for AND the typo isn't caught until you notice the artifacts (or don't). The typed kwarg form catches typos at type-check time.
Pre-flight validation. The Literal types fence off unknown values at edit time in any IDE that supports type checking. The backend also rejects unknown values with 400 for both knobs, so even untyped callers get a fast failure — but the typed form catches the mistake before the network round-trip.

The SDK exports the canonical value tuples too:

from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES

assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES

If you pass a value via both forms (e.g. config={"degradation_profile": "X"} AND degradation_profile="Y"), the value in config wins and a RuntimeWarning is emitted so the conflict isn't silent.

# Typed kwarg form — recommended.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    degradation_profile="scanned",   # billed at 1.2× — see above
    coherence_mode="coherent",
)

# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)

Error Handling

The SDK raises typed exceptions for API errors and retries automatically on 429 and 5xx:

from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError

try:
    forms = client.forms.list()
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests — SDK retries automatically")
except NotFoundError:
    print("Resource not found")

All error classes:

Exception	HTTP Code	Description
`SymageDocsError`	—	Base exception for all SDK errors
`AuthenticationError`	401	Invalid or revoked API key
`PermissionDeniedError`	403	Key missing required scope
`NotFoundError`	404	Resource not found
`ValidationError`	400	Invalid request parameters
`InsufficientCreditsError`	402	Not enough credits for the operation
`ConflictError`	409	Resource in unexpected state (e.g., downloading incomplete job)
`RateLimitError`	429	Rate limit exceeded (SDK retries automatically)
`ServerError`	5xx	Server-side error (SDK retries automatically)

Examples

See examples/ in the downloaded SDK for complete working scripts:

list_forms.py — Browse available forms and credit costs
generate_w2s.py — Full pipeline: create job, wait, download PDF + JSON
tabular_dataset.py — Parse NL description, generate 5k rows, download CSV
train_kie_model.py — Create batch with NIST3 labels, iterate training examples with BIO labels and spatial annotations

Documentation

API User Manual — long-form guide with worked examples
API Explorer — interactive Swagger UI
API Reference — three-panel ReDoc reference

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

May 15, 2026

1.0.3

May 12, 2026

This version

1.0.2

May 8, 2026

1.0.1

Apr 24, 2026

1.0.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symagedocs-1.0.2.tar.gz (35.5 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

symagedocs-1.0.2-py3-none-any.whl (31.4 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file symagedocs-1.0.2.tar.gz.

File metadata

Download URL: symagedocs-1.0.2.tar.gz
Upload date: May 8, 2026
Size: 35.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`39a907edaabc4808a7ac2bc82c71a7adf369e25800dfbfe01319001d2fb9d6f3`
MD5	`4d2fb457daef7c225ee16bd5061a0a84`
BLAKE2b-256	`a6573a21e0d9b1deab50635c9154708bd38ec14392a7c5e1046898e41ed250d9`

See more details on using hashes here.

File details

Details for the file symagedocs-1.0.2-py3-none-any.whl.

File metadata

Download URL: symagedocs-1.0.2-py3-none-any.whl
Upload date: May 8, 2026
Size: 31.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a7f2b7188fe178a384efdbc0874082eac81838e345b2ce55191272e29a24639`
MD5	`1ce9215f26fe02febb63f60e3a0a74e8`
BLAKE2b-256	`72d11c67dd0d8859a7a648ae90d26f29fea877aba92b1e24921473d060da9eaf`

See more details on using hashes here.

symagedocs 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SymageDocs Python SDK

Installation

Quick Start

Authentication

Async Support

Configuration

Method Reference

Forms

Generation

Identities

Tabular

Account

Pricing

Health

Augmentation knobs (STAX-1524)

Error Handling

Examples

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes