Skip to main content

Python SDK for the SymageDocs synthetic data API

Project description

SymageDocs Python SDK

Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.

Installation

pip install symagedocs

For progress bars during long jobs:

pip install symagedocs[progress]

Quick Start

from symagedocs import Client

client = Client(api_key="sk_live_...")

# List available forms
forms = client.forms.list()
for f in forms:
    print(f"{f.id}: {f.name} ({f.credit_cost} credits)")

# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    output_formats=["pdf_typed"],
    # Augmentation knobs. `degradation_profile` affects credit cost —
    # `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
    # `coherence_mode` controls cross-form identity correlation in multi-form jobs.
    degradation_profile="scanned",
    coherence_mode="coherent",
)
result = client.generate.wait(job.job_id)  # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")

# Per-item training data
job = client.generate.create(
    form_id="irs_w2_2025",
    quantity=10,
    output_formats=["pdf_typed", "bio"],
    idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
    print(example.item_id, len(example.bio.tokens))

# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")

# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")

Authentication

Get your API key at symagedocs.ai/account?tab=api.

# Pass directly
client = Client(api_key="sk_live_...")

# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client()  # reads from env

Async Support

from symagedocs import AsyncClient

async with AsyncClient(api_key="sk_live_...") as client:
    forms = await client.forms.list()
    job = await client.generate.create("irs_w2_2025", quantity=10)
    result = await client.generate.wait(job.job_id)

Configuration

client = Client(
    api_key="sk_live_...",
    base_url="https://symagedocs.ai",  # custom server
    timeout=30.0,                       # request timeout (seconds)
    max_retries=3,                      # retry on 429/5xx
)

Method Reference

Forms

Method Description
forms.list(category=None) List available forms, optionally filtered by category
forms.get(form_id) Get detailed form info including field definitions

Generation

Method Description
generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None) Create an async generation job. Pass either form_id (single form) or form_ids (coherent multi-form generation across the same identity). ink_color_distribution (when set) must be a per-color weight map summing to exactly 100. degradation_profile and coherence_mode are typed kwargs over what used to live inside config={...} — see the augmentation knobs section. idempotency_key attaches an Idempotency-Key header so retries within 24 hours return the original job_id and don't double-charge. The deprecated realism_level API field is intentionally not exposed; call the REST API directly if you need it.
generate.list_jobs(limit=50, cursor=None, status=None) List generation jobs (cursor-paginated)
generate.get_job(job_id) Get full job status and progress
generate.list_downloads(job_id) List per-artifact presigned download URLs for a completed job
generate.download(job_id, format, path) Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable.
generate.wait(job_id, poll_interval=3.0) Poll until job completes or fails
generate.cancel(job_id) Cancel a running job. Idempotent. Items rendered before the cancel observed remain downloadable via download(format="bundle").
generate.list_items(job_id, limit=50, cursor=None) List per-item records for a job. Cursor-paginated; each item carries its presigned download URLs.
generate.download_item(job_id, item_id) Presigned S3 URLs for one item's files.
generate.get_bio_labels(job_id, item_id) Client-side helper: fetches the item's _bio.json sidecar and returns a parsed BioDataset.
generate.get_word_annotations(job_id, item_id) Client-side helper: fetches the item's _words.json sidecar and returns parsed WordAnnotations.
generate.iter_training_examples(job_id, format="bio") Client-side helper: iterates all items, yielding training examples in the chosen format ("bio" (default), "funsd", "donut").

client.generation alias. client.generation and client.generate reference the same resource — use whichever name you prefer.

Identities

Method Description
identities.generate(quantity=1, config=None, seed=None) Generate raw synthetic identities as JSON

Tabular

Method Description
tabular.parse(prompt) Convert natural language to a column schema (LLM-powered)
tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None) Create a tabular generation job
tabular.status(job_id) Get tabular job progress and ETA
tabular.download(job_id, format, path) Download tabular output to a local file
tabular.wait(job_id, poll_interval=2.0) Poll until tabular job completes or fails

Account

Method Description
account.balance() Get credit balance (credits_used, credits_allocated)
account.usage(days=30) Get usage summary for the specified period

Pricing

The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.

Method Description
pricing.rates() Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …)
pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None) Estimate the credit cost of a hypothetical job before submitting it

Health

Method Description
client.health() Lightweight reachability probe (GET /api/v1/health). Returns the parsed JSON body. Works on both Client and AsyncClient.

Augmentation knobs

Two of the most-used keys in the freeform config={...} dict on generate.create are also exposed as typed kwargs:

  • degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | None
  • coherence_mode: Literal["coherent", "shuffled", "random"] | None

Why bother? Two reasons:

  1. degradation_profile affects credit cost. Non-clean profiles need extra rendering work (rasterization, noise, paper warp), so the billing engine applies a multiplier: scanned/faxed are billed at 1.2×, mixed at 1.25×, and photographed at 1.3×. A typo on the freeform config={...} form silently falls back to the default 1.0× multiplier — meaning you don't get the degradation you asked for AND the typo isn't caught until you notice the artifacts (or don't). The typed kwarg form catches typos at type-check time.
  2. Pre-flight validation. The Literal types fence off unknown values at edit time in any IDE that supports type checking. The backend also rejects unknown values with 400 for both knobs, so even untyped callers get a fast failure — but the typed form catches the mistake before the network round-trip.

The SDK exports the canonical value tuples too:

from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES

assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES

If you pass a value via both forms (e.g. config={"degradation_profile": "X"} AND degradation_profile="Y"), the value in config wins and a RuntimeWarning is emitted so the conflict isn't silent.

# Typed kwarg form — recommended.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    degradation_profile="scanned",   # billed at 1.2× — see above
    coherence_mode="coherent",
)

# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)

Error Handling

The SDK raises typed exceptions for API errors and retries automatically on 429 and 5xx:

from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError

try:
    forms = client.forms.list()
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests — SDK retries automatically")
except NotFoundError:
    print("Resource not found")

All error classes:

Exception HTTP Code Description
SymageDocsError Base exception for all SDK errors
AuthenticationError 401 Invalid or revoked API key
PermissionDeniedError 403 Key missing required scope
NotFoundError 404 Resource not found
ValidationError 400 Invalid request parameters
InsufficientCreditsError 402 Not enough credits for the operation
ConflictError 409 Resource in unexpected state (e.g., downloading incomplete job)
RateLimitError 429 Rate limit exceeded (SDK retries automatically)
ServerError 5xx Server-side error (SDK retries automatically)

Examples

See examples/ in the downloaded SDK for complete working scripts:

  • list_forms.py — Browse available forms and credit costs
  • generate_w2s.py — Full pipeline: create job, wait, download PDF + JSON
  • tabular_dataset.py — Parse NL description, generate 5k rows, download CSV
  • train_kie_model.py — Create job with NIST3 labels, iterate training examples with BIO labels and spatial annotations

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symagedocs-1.0.3.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

symagedocs-1.0.3-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file symagedocs-1.0.3.tar.gz.

File metadata

  • Download URL: symagedocs-1.0.3.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.3.tar.gz
Algorithm Hash digest
SHA256 fe2f7ec8cee12df0d82eaedd23998c910bd421ecb7f72e729731904805dfa0f4
MD5 36efce634926b94a8ebbf1503bef62e9
BLAKE2b-256 f53a5f1f437a611ff7b8148a73a7d5e7618e9afd5004f07471319bd7b1782db8

See more details on using hashes here.

File details

Details for the file symagedocs-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: symagedocs-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for symagedocs-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9f8eae315b6ac110fe4ecbe23ca3ef58d283523ed0dcd5d3e9aac354367e28f2
MD5 d486e8f9b2e876bc5e41a5f84f939145
BLAKE2b-256 172cf146c98272aa2256937ea58ebeef10c815217444d7fd41b51fd43910645c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page