Skip to main content

Python SDK for the SymageDocs synthetic data API

Project description

SymageDocs Python SDK

Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.

Installation

pip install symagedocs

For progress bars during long jobs:

pip install symagedocs[progress]

Quick Start

from symagedocs import Client

client = Client(api_key="sk_live_...")

# List available forms
forms = client.forms.list()
for f in forms:
    print(f"{f.id}: {f.name} ({f.credit_cost} credits)")

# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    output_formats=["pdf_typed"],
    # Augmentation knobs (STAX-1524). `degradation_profile` affects credit cost —
    # `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
    # `coherence_mode` controls cross-form identity correlation in multi-form jobs.
    degradation_profile="scanned",
    coherence_mode="coherent",
)
result = client.generate.wait(job.job_id)  # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")

# Per-item training data via the unified jobs API (STAX-1600)
job = client.generate.create(
    form_id="irs_w2_2025",
    quantity=10,
    output_formats=["pdf_typed", "bio"],
    idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
    print(example.item_id, len(example.bio.tokens))

# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")

# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")

Authentication

Get your API key at symagedocs.ai/account?tab=api.

# Pass directly
client = Client(api_key="sk_live_...")

# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client()  # reads from env

Async Support

from symagedocs import AsyncClient

async with AsyncClient(api_key="sk_live_...") as client:
    forms = await client.forms.list()
    job = await client.generate.create("irs_w2_2025", quantity=10)
    result = await client.generate.wait(job.job_id)

Configuration

client = Client(
    api_key="sk_live_...",
    base_url="https://symagedocs.ai",  # custom server
    timeout=30.0,                       # request timeout (seconds)
    max_retries=3,                      # retry on 429/5xx
)

Method Reference

Forms

Method Description
forms.list(category=None) List available forms, optionally filtered by category
forms.get(form_id) Get detailed form info including field definitions

Generation

Method Description
generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None) Create an async generation job. Pass either form_id (single form) or form_ids (coherent multi-form generation across the same identity). ink_color_distribution (when set) must be a per-color weight map summing to exactly 100. degradation_profile and coherence_mode are typed kwargs over what used to live inside config={...} — see the augmentation knobs section. idempotency_key (STAX-1600) attaches an Idempotency-Key header so retries within 24 hours return the original job_id and don't double-charge. The deprecated realism_level API field is intentionally not exposed; call the REST API directly if you need it.
generate.list_jobs(limit=50, cursor=None, status=None) List generation jobs (cursor-paginated)
generate.get_job(job_id) Get full job status and progress
generate.list_downloads(job_id) List per-artifact presigned download URLs for a completed job
generate.download(job_id, format, path) Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable (STAX-1600).
generate.wait(job_id, poll_interval=3.0) Poll until job completes or fails
generate.cancel(job_id) Cancel a running job (STAX-1600). Idempotent. Items rendered before the cancel observed remain downloadable via download(format="bundle").
generate.list_items(job_id, limit=50, cursor=None) List per-item records for a job (STAX-1600 — moved from batches.list_items). Cursor-paginated; each item carries its presigned download URLs.
generate.download_item(job_id, item_id) Presigned S3 URLs for one item's files (STAX-1600 — moved from batches.download_urls).
generate.get_bio_labels(job_id, item_id) Client-side helper: fetches the item's _bio.json sidecar and returns a parsed BioDataset (STAX-1600).
generate.get_word_annotations(job_id, item_id) Client-side helper: fetches the item's _words.json sidecar and returns parsed WordAnnotations (STAX-1600).
generate.iter_training_examples(job_id, format="bio") Client-side helper: iterates all items, yielding training examples in the chosen format ("bio" (default), "funsd", "donut") (STAX-1600).

client.generation alias (STAX-1600). client.generation and client.generate reference the same resource — use whichever name you prefer. The historical client.batches.* namespace was removed in this change; per-item training-example access lives on client.generate.* (or client.generation.*).

Identities

Method Description
identities.generate(quantity=1, config=None, seed=None) Generate raw synthetic identities as JSON

Tabular

Method Description
tabular.parse(prompt) Convert natural language to a column schema (LLM-powered)
tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None) Create a tabular generation job
tabular.status(job_id) Get tabular job progress and ETA
tabular.download(job_id, format, path) Download tabular output to a local file
tabular.wait(job_id, poll_interval=2.0) Poll until tabular job completes or fails

Account

Method Description
account.balance() Get credit balance (credits_used, credits_allocated)
account.usage(days=30) Get usage summary for the specified period

Pricing

The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.

Method Description
pricing.rates() Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …)
pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None) Estimate the credit cost of a hypothetical job before submitting it

Health

Method Description
client.health() Lightweight reachability probe (GET /api/v1/health). Returns the parsed JSON body. Works on both Client and AsyncClient.

Augmentation knobs (STAX-1524)

Two of the most-used keys in the freeform config={...} dict on generate.create are also exposed as typed kwargs:

  • degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | None
  • coherence_mode: Literal["coherent", "shuffled", "random"] | None

Why bother? Two reasons:

  1. degradation_profile affects credit cost. Non-clean profiles need extra rendering work (rasterization, noise, paper warp), so the billing engine applies a multiplier: scanned/faxed are billed at 1.2×, mixed at 1.25×, and photographed at 1.3×. A typo on the freeform config={...} form silently falls back to the default 1.0× multiplier — meaning you don't get the degradation you asked for AND the typo isn't caught until you notice the artifacts (or don't). The typed kwarg form catches typos at type-check time.
  2. Pre-flight validation. The Literal types fence off unknown values at edit time in any IDE that supports type checking. The backend also rejects unknown values with 400 for both knobs, so even untyped callers get a fast failure — but the typed form catches the mistake before the network round-trip.

The SDK exports the canonical value tuples too:

from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES

assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES

If you pass a value via both forms (e.g. config={"degradation_profile": "X"} AND degradation_profile="Y"), the value in config wins and a RuntimeWarning is emitted so the conflict isn't silent.

# Typed kwarg form — recommended.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    degradation_profile="scanned",   # billed at 1.2× — see above
    coherence_mode="coherent",
)

# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)

Error Handling

The SDK raises typed exceptions for API errors and retries automatically on 429 and 5xx:

from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError

try:
    forms = client.forms.list()
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests — SDK retries automatically")
except NotFoundError:
    print("Resource not found")

All error classes:

Exception HTTP Code Description
SymageDocsError Base exception for all SDK errors
AuthenticationError 401 Invalid or revoked API key
PermissionDeniedError 403 Key missing required scope
NotFoundError 404 Resource not found
ValidationError 400 Invalid request parameters
InsufficientCreditsError 402 Not enough credits for the operation
ConflictError 409 Resource in unexpected state (e.g., downloading incomplete job)
RateLimitError 429 Rate limit exceeded (SDK retries automatically)
ServerError 5xx Server-side error (SDK retries automatically)

Examples

See examples/ in the downloaded SDK for complete working scripts:

  • list_forms.py — Browse available forms and credit costs
  • generate_w2s.py — Full pipeline: create job, wait, download PDF + JSON
  • tabular_dataset.py — Parse NL description, generate 5k rows, download CSV
  • train_kie_model.py — Create batch with NIST3 labels, iterate training examples with BIO labels and spatial annotations

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symagedocs-1.0.2.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

symagedocs-1.0.2-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file symagedocs-1.0.2.tar.gz.

File metadata

  • Download URL: symagedocs-1.0.2.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.2.tar.gz
Algorithm Hash digest
SHA256 39a907edaabc4808a7ac2bc82c71a7adf369e25800dfbfe01319001d2fb9d6f3
MD5 4d2fb457daef7c225ee16bd5061a0a84
BLAKE2b-256 a6573a21e0d9b1deab50635c9154708bd38ec14392a7c5e1046898e41ed250d9

See more details on using hashes here.

File details

Details for the file symagedocs-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: symagedocs-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9a7f2b7188fe178a384efdbc0874082eac81838e345b2ce55191272e29a24639
MD5 1ce9215f26fe02febb63f60e3a0a74e8
BLAKE2b-256 72d11c67dd0d8859a7a648ae90d26f29fea877aba92b1e24921473d060da9eaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page