Python SDK for the SymageDocs synthetic data API
Project description
SymageDocs Python SDK
Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.
Installation
pip install symagedocs
For progress bars during long jobs:
pip install symagedocs[progress]
Quick Start
from symagedocs import Client
client = Client(api_key="sk_live_...")
# List available forms
forms = client.forms.list()
for f in forms:
print(f"{f.id}: {f.name} ({f.credit_cost} credits)")
# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
"irs_w2_2025",
quantity=100,
output_formats=["pdf_typed"],
# Augmentation knobs. `degradation_profile` affects credit cost —
# `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
# `coherence_mode` controls cross-form identity correlation in multi-form jobs.
degradation_profile="scanned",
coherence_mode="coherent",
)
result = client.generate.wait(job.job_id) # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")
# Per-item training data
job = client.generate.create(
form_id="irs_w2_2025",
quantity=10,
output_formats=["pdf_typed", "bio"],
idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
print(example.item_id, len(example.bio.tokens))
# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")
# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")
Authentication
Get your API key at symagedocs.ai/account?tab=api.
# Pass directly
client = Client(api_key="sk_live_...")
# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client() # reads from env
Async Support
from symagedocs import AsyncClient
async with AsyncClient(api_key="sk_live_...") as client:
forms = await client.forms.list()
job = await client.generate.create("irs_w2_2025", quantity=10)
result = await client.generate.wait(job.job_id)
Configuration
client = Client(
api_key="sk_live_...",
base_url="https://symagedocs.ai", # custom server
timeout=30.0, # request timeout (seconds)
max_retries=3, # retry on 429/5xx
)
Method Reference
Forms
| Method | Description |
|---|---|
forms.list(category=None) |
List available forms, optionally filtered by category |
forms.get(form_id) |
Get detailed form info including field definitions |
Generation
| Method | Description |
|---|---|
generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None) |
Create an async generation job. Pass either form_id (single form) or form_ids (coherent multi-form generation across the same identity). ink_color_distribution (when set) must be a per-color weight map summing to exactly 100. degradation_profile and coherence_mode are typed kwargs over what used to live inside config={...} — see the augmentation knobs section. idempotency_key attaches an Idempotency-Key header so retries within 24 hours return the original job_id and don't double-charge. The deprecated realism_level API field is intentionally not exposed; call the REST API directly if you need it. |
generate.list_jobs(limit=50, cursor=None, status=None) |
List generation jobs (cursor-paginated) |
generate.get_job(job_id) |
Get full job status and progress |
generate.list_downloads(job_id) |
List per-artifact presigned download URLs for a completed job |
generate.download(job_id, format, path) |
Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable. |
generate.wait(job_id, poll_interval=3.0) |
Poll until job completes or fails |
generate.cancel(job_id) |
Cancel a running job. Idempotent. Items rendered before the cancel observed remain downloadable via download(format="bundle"). |
generate.list_items(job_id, limit=50, cursor=None) |
List per-item records for a job. Cursor-paginated; each item carries its presigned download URLs. |
generate.download_item(job_id, item_id) |
Presigned S3 URLs for one item's files. |
generate.get_bio_labels(job_id, item_id) |
Client-side helper: fetches the item's _bio.json sidecar and returns a parsed BioDataset. |
generate.get_word_annotations(job_id, item_id) |
Client-side helper: fetches the item's _words.json sidecar and returns parsed WordAnnotations. |
generate.iter_training_examples(job_id, format="bio") |
Client-side helper: iterates all items, yielding training examples in the chosen format ("bio" (default), "funsd", "donut"). |
client.generationalias.client.generationandclient.generatereference the same resource — use whichever name you prefer.
Identities
| Method | Description |
|---|---|
identities.generate(quantity=1, config=None, seed=None) |
Generate raw synthetic identities as JSON |
Tabular
| Method | Description |
|---|---|
tabular.parse(prompt) |
Convert natural language to a column schema (LLM-powered) |
tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None) |
Create a tabular generation job |
tabular.status(job_id) |
Get tabular job progress and ETA |
tabular.download(job_id, format, path) |
Download tabular output to a local file |
tabular.wait(job_id, poll_interval=2.0) |
Poll until tabular job completes or fails |
Account
| Method | Description |
|---|---|
account.balance() |
Get credit balance (credits_used, credits_allocated) |
account.usage(days=30) |
Get usage summary for the specified period |
Pricing
The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.
| Method | Description |
|---|---|
pricing.rates() |
Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …) |
pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None) |
Estimate the credit cost of a hypothetical job before submitting it |
Health
| Method | Description |
|---|---|
client.health() |
Lightweight reachability probe (GET /api/v1/health). Returns the parsed JSON body. Works on both Client and AsyncClient. |
Augmentation knobs
Two of the most-used keys in the freeform config={...} dict on
generate.create are also exposed as typed kwargs:
degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | Nonecoherence_mode: Literal["coherent", "shuffled", "random"] | None
Why bother? Two reasons:
degradation_profileaffects credit cost. Non-cleanprofiles need extra rendering work (rasterization, noise, paper warp), so the billing engine applies a multiplier:scanned/faxedare billed at 1.2×,mixedat 1.25×, andphotographedat 1.3×. A typo on the freeformconfig={...}form silently falls back to the default 1.0× multiplier — meaning you don't get the degradation you asked for AND the typo isn't caught until you notice the artifacts (or don't). The typed kwarg form catches typos at type-check time.- Pre-flight validation. The Literal types fence off unknown
values at edit time in any IDE that supports type checking. The
backend also rejects unknown values with
400for both knobs, so even untyped callers get a fast failure — but the typed form catches the mistake before the network round-trip.
The SDK exports the canonical value tuples too:
from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES
assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES
If you pass a value via both forms (e.g. config={"degradation_profile": "X"} AND degradation_profile="Y"), the value in config wins and a
RuntimeWarning is emitted so the conflict isn't silent.
# Typed kwarg form — recommended.
job = client.generate.create(
"irs_w2_2025",
quantity=100,
degradation_profile="scanned", # billed at 1.2× — see above
coherence_mode="coherent",
)
# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
"irs_w2_2025",
quantity=100,
config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)
Error Handling
The SDK raises typed exceptions for API errors and retries automatically on 429 and 5xx:
from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError
try:
forms = client.forms.list()
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Too many requests — SDK retries automatically")
except NotFoundError:
print("Resource not found")
All error classes:
| Exception | HTTP Code | Description |
|---|---|---|
SymageDocsError |
— | Base exception for all SDK errors |
AuthenticationError |
401 | Invalid or revoked API key |
PermissionDeniedError |
403 | Key missing required scope |
NotFoundError |
404 | Resource not found |
ValidationError |
400 | Invalid request parameters |
InsufficientCreditsError |
402 | Not enough credits for the operation |
ConflictError |
409 | Resource in unexpected state (e.g., downloading incomplete job) |
RateLimitError |
429 | Rate limit exceeded (SDK retries automatically) |
ServerError |
5xx | Server-side error (SDK retries automatically) |
Examples
See examples/ in the downloaded SDK for complete working scripts:
list_forms.py— Browse available forms and credit costsgenerate_w2s.py— Full pipeline: create job, wait, download PDF + JSONtabular_dataset.py— Parse NL description, generate 5k rows, download CSVtrain_kie_model.py— Create job with NIST3 labels, iterate training examples with BIO labels and spatial annotations
Documentation
- API User Manual — long-form guide with worked examples
- API Explorer — interactive Swagger UI
- API Reference — three-panel ReDoc reference
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file symagedocs-1.0.4.tar.gz.
File metadata
- Download URL: symagedocs-1.0.4.tar.gz
- Upload date:
- Size: 62.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40a71ed9432246f953cc61ebea2445e4dc6a49445925b933c10cdb688a7ba288
|
|
| MD5 |
790920d5d6e29a2d7ba2116115545475
|
|
| BLAKE2b-256 |
6314ed71ab11894b4230a7603b3b8263409f83609a2653fe85f382cdbe8b88b0
|
File details
Details for the file symagedocs-1.0.4-py3-none-any.whl.
File metadata
- Download URL: symagedocs-1.0.4-py3-none-any.whl
- Upload date:
- Size: 33.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
def34a70c7ae17bc24f5fb7a237a7b09592ae40288ea48051908691969e4a8f4
|
|
| MD5 |
dec1e58ecede7890416abe7e728fa728
|
|
| BLAKE2b-256 |
14e209acb7477cb700b8473ca512a84f2fa9f6e9001ff49dd28a1616670f3923
|