Generate synthetic spoken transcript variants for structured values and alphanumeric sequences.
Project description
Synthetic spoken transcript variants for structured entity extraction.
The Problem
Voice AI systems do not fail only because they miss words. They fail because structured values are spoken in messy, inconsistent, human ways.
ZIP code:
94101
nine four one zero one
nine four one oh one
ninety four one oh one
SSN:
900-12-3456
nine zero zero, one two, three four five six
nine zero zero dash one two dash three four five six
Credit card:
4242 4242 4242 4242
four two four two, four two four two, four two four two, four two four two
Real phone transcripts are private, expensive, sparse, and slow to annotate. SpokenForms lets teams generate synthetic direct-answer data before production transcripts exist, then validate every transcript back to the intended canonical value.
What It Builds
canonical structured value
-> verbalization patterns
-> phone-call-style transcript candidates
-> deterministic consistency validation
-> JSONL / CSV / Parquet dataset
SpokenForms is built around the LingVarBench-style three-stage pipeline:
| Stage | Job |
|---|---|
| Value Generator | Creates canonical values for an entity schema. |
| Transcript Generator | Applies reusable and entity-specific spoken patterns. |
| Consistency Checker | Keeps only transcripts recoverable to the intended value. |
Install
uv add spokenforms
For local development:
uv sync --all-extras --dev
If you use OpenAI later, put credentials in .env. The mock provider works offline and
does not need network access.
Quick Start
uv run spokenforms init --output demo
cd demo
uv run spokenforms build \
--config config.yaml \
--entity ssn \
--provider mock \
--num-values 3 \
--target-per-pattern 2 \
--output-dir runs/demo_ssn
Expected output:
manifest.json
config.resolved.yaml
values.jsonl
candidates.jsonl
validated.jsonl
dataset.jsonl
dataset.csv
dataset.parquet
stats.json
stats.md
logs.jsonl
Generated Examples
These examples were generated from this package with:
uv run spokenforms build --entity ssn --provider mock --num-values 1 --target-per-pattern 1
uv run spokenforms build --entity credit_card_number --provider mock --num-values 1 --target-per-pattern 1
Synthetic SSN
| Ground truth | Pattern | Generated transcript |
|---|---|---|
900-12-3456 |
direct_and_simple |
nine zero zero, one two, three four five six |
900-12-3456 |
filler_words |
Um, it is nine zero zero, one two, three four five six. |
900-12-3456 |
formal |
The Social Security number is nine zero zero, one two, three four five six. |
All records are tagged:
{
"synthetic_sensitive_value": true,
"sensitive_type": "ssn",
"real_world_safe": true,
"generation_mode": "reserved_or_invalid"
}
Payment-Test Card Number
| Ground truth | Pattern | Generated transcript |
|---|---|---|
4242424242424242 |
direct_and_simple |
four two four two, four two four two, four two four two, four two four two |
4242424242424242 |
filler_words |
Um, it is four two four two, four two four two, four two four two, four two four two. |
4242424242424242 |
card_correction |
one two four, sorry, four two four two, four two four two, four two four two, four two four two. |
All card records are tagged:
{
"synthetic_sensitive_value": true,
"sensitive_type": "credit_card_number",
"real_world_safe": true,
"generation_mode": "payment_test_numbers"
}
Generated Stats
SSN run:
total records: 16
total patterns: 16
validation pass rate: 100.00%
sensitive synthetic records: 16
Credit-card run:
total records: 21
total patterns: 21
validation pass rate: 100.00%
sensitive synthetic records: 21
Built-In Entities
confirmation_code account_number member_id
claim_id policy_number zip_code
date_of_birth full_name phone_number
ssn credit_card_number boolean_answer
enum_answer multi_select_answer pain_rating
respiratory_issues hearing_issues
Pattern Inventory
General patterns:
direct_and_simple filler_words hesitation correction repetition
formal casual polite confident uncertain
confirmation digit_by_digit grouped_two grouped_four nato_letters
Sensitive entity patterns:
ssn_grouped_3_2_4
ssn_digit_by_digit
ssn_with_dashes
ssn_correction
ssn_repetition_for_confirmation
card_grouped_4_4_4_4
card_digit_by_digit
card_with_spaces
card_last_four_repetition
card_correction
card_issuer_style_grouping
Safety Model
SpokenForms is synthetic-first.
| Entity | Default mode | Real-world data? |
|---|---|---|
ssn |
reserved_or_invalid |
No |
credit_card_number |
payment_test_numbers |
No |
Unsafe generation flags are reserved in the CLI but intentionally rejected in v0.1:
spokenforms build --entity ssn --allow-potentially-real-sensitive-values
# exits with an error
CLI
spokenforms init --output demo
spokenforms build \
--config config.yaml \
--entity credit_card_number \
--provider mock \
--num-values 3 \
--target-per-pattern 2 \
--output-dir runs/cards
spokenforms stats runs/cards/dataset.jsonl
Python API
from spokenforms.config import apply_cli_overrides, default_config
from spokenforms.generation import run_pipeline
from spokenforms.models import ProviderName
from spokenforms.providers import create_provider
config = apply_cli_overrides(
default_config(),
provider=ProviderName.MOCK,
num_values=2,
target_per_pattern=2,
output_dir=None,
)
provider = create_provider(ProviderName.MOCK, "mock")
result = run_pipeline("readme-demo", "ssn", config, provider)
print(result.records[0].transcript)
Test Cases
The repository ships with checks for:
| Area | Covered behavior |
|---|---|
| CLI | init, build, stats, unsafe flag rejection |
| Generation | mock value generation, pattern application, balancing |
| Validation | deterministic consistency checks |
| Normalization | numbers, alphanumeric values, SSN, card numbers, dates, names, enums, booleans |
| Safety | synthetic-only sensitive policy enforcement |
| Storage | JSONL, CSV, Parquet, manifest, stats |
| Packaging | Ruff, mypy strict mode, coverage, wheel/sdist build |
Run the full suite:
uv run ruff check .
uv run ruff format --check .
uv run mypy src tests
uv run pytest
uv build
Current local verification:
ruff: passing
format: passing
mypy: passing
pytest: 5 passed, 94.92% coverage
build: wheel and sdist generated
Release Slugs
| Release | Slug | Theme |
|---|---|---|
0.1.0 |
safe-synthetic-seed |
Offline mock generation, typed package, synthetic SSN/card guardrails |
0.2.0 |
provider-lift |
OpenAI provider, richer prompt contracts, retry/cache hardening |
0.3.0 |
bench-runner |
Evaluation harnesses, train/validation/test workflows, extraction prompt baselines |
1.0.0 |
voice-data-foundry |
Stable API for production synthetic transcript generation |
Roadmap
- OpenAI provider implementation.
- User-defined entity and pattern YAML loading.
- Richer consistency checking for correction-style transcripts.
- DSPy/SIMBA prompt optimization hooks.
- More locale-specific readout styles.
- Larger generated example gallery.
Project Layout
src/spokenforms/
cli.py
config.py
models.py
providers/
generation/
patterns/
entities/
normalizers/
validators/
storage/
stats/
tests/
examples/
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spokenforms-0.1.0.tar.gz.
File metadata
- Download URL: spokenforms-0.1.0.tar.gz
- Upload date:
- Size: 203.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d02bb874fabc3d455d241c0e7f0901a4e6e05f67b9cb2e7a9b7d14276038825a
|
|
| MD5 |
1177c4708b581a160823a2cf0fd135c5
|
|
| BLAKE2b-256 |
73453e7f4d27b469156468cc2530d3ecb9bcf173ca41c588a1e266a30dec656f
|
Provenance
The following attestation bundles were made for spokenforms-0.1.0.tar.gz:
Publisher:
publish.yml on spokenforms/spokenforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spokenforms-0.1.0.tar.gz -
Subject digest:
d02bb874fabc3d455d241c0e7f0901a4e6e05f67b9cb2e7a9b7d14276038825a - Sigstore transparency entry: 1631665462
- Sigstore integration time:
-
Permalink:
spokenforms/spokenforms@ec4d448366ba7b36117e6d8120b641e5c3d02fcf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/spokenforms
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ec4d448366ba7b36117e6d8120b641e5c3d02fcf -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file spokenforms-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spokenforms-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec648cd6719919e19d9c0dce7230f230b6b5134933551eebbe3edcec085283df
|
|
| MD5 |
31d9f8b50272226ec5222be8f6ebd37c
|
|
| BLAKE2b-256 |
8ae664263650abff065ec1ee2706a340f94e4ac958b98871a60fb722d7127970
|
Provenance
The following attestation bundles were made for spokenforms-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on spokenforms/spokenforms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spokenforms-0.1.0-py3-none-any.whl -
Subject digest:
ec648cd6719919e19d9c0dce7230f230b6b5134933551eebbe3edcec085283df - Sigstore transparency entry: 1631665471
- Sigstore integration time:
-
Permalink:
spokenforms/spokenforms@ec4d448366ba7b36117e6d8120b641e5c3d02fcf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/spokenforms
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ec4d448366ba7b36117e6d8120b641e5c3d02fcf -
Trigger Event:
workflow_dispatch
-
Statement type: