Skip to main content

Generate synthetic spoken transcript variants for structured values and alphanumeric sequences.

Project description

SpokenForms

Synthetic spoken transcript variants for structured entity extraction.

CI PyPI Python Typed Coverage License


The Problem

Voice AI systems do not fail only because they miss words. They fail because structured values are spoken in messy, inconsistent, human ways.

ZIP code:
  94101
  nine four one zero one
  nine four one oh one
  ninety four one oh one

SSN:
  900-12-3456
  nine zero zero, one two, three four five six
  nine zero zero dash one two dash three four five six

Credit card:
  4242 4242 4242 4242
  four two four two, four two four two, four two four two, four two four two

Real phone transcripts are private, expensive, sparse, and slow to annotate. SpokenForms lets teams generate synthetic direct-answer data before production transcripts exist, then validate every transcript back to the intended canonical value.

What It Builds

canonical structured value
  -> verbalization patterns
  -> phone-call-style transcript candidates
  -> deterministic consistency validation
  -> JSONL / CSV / Parquet dataset

SpokenForms is built around the LingVarBench-style three-stage pipeline:

Stage Job
Value Generator Creates canonical values for an entity schema.
Transcript Generator Applies reusable and entity-specific spoken patterns.
Consistency Checker Keeps only transcripts recoverable to the intended value.

Install

uv add spokenforms

For local development:

uv sync --all-extras --dev

If you use OpenAI later, put credentials in .env. The mock provider works offline and does not need network access.

Quick Start

uv run spokenforms init --output demo
cd demo

uv run spokenforms build \
  --config config.yaml \
  --entity ssn \
  --provider mock \
  --num-values 3 \
  --target-per-pattern 2 \
  --output-dir runs/demo_ssn

Expected output:

manifest.json
config.resolved.yaml
values.jsonl
candidates.jsonl
validated.jsonl
dataset.jsonl
dataset.csv
dataset.parquet
stats.json
stats.md
logs.jsonl

Generated Examples

These examples were generated from this package with:

uv run spokenforms build --entity ssn --provider mock --num-values 1 --target-per-pattern 1
uv run spokenforms build --entity credit_card_number --provider mock --num-values 1 --target-per-pattern 1

Synthetic SSN

Ground truth Pattern Generated transcript
900-12-3456 direct_and_simple nine zero zero, one two, three four five six
900-12-3456 filler_words Um, it is nine zero zero, one two, three four five six.
900-12-3456 formal The Social Security number is nine zero zero, one two, three four five six.

All records are tagged:

{
  "synthetic_sensitive_value": true,
  "sensitive_type": "ssn",
  "real_world_safe": true,
  "generation_mode": "reserved_or_invalid"
}

Payment-Test Card Number

Ground truth Pattern Generated transcript
4242424242424242 direct_and_simple four two four two, four two four two, four two four two, four two four two
4242424242424242 filler_words Um, it is four two four two, four two four two, four two four two, four two four two.
4242424242424242 card_correction one two four, sorry, four two four two, four two four two, four two four two, four two four two.

All card records are tagged:

{
  "synthetic_sensitive_value": true,
  "sensitive_type": "credit_card_number",
  "real_world_safe": true,
  "generation_mode": "payment_test_numbers"
}

Generated Stats

SSN run:
  total records: 16
  total patterns: 16
  validation pass rate: 100.00%
  sensitive synthetic records: 16

Credit-card run:
  total records: 21
  total patterns: 21
  validation pass rate: 100.00%
  sensitive synthetic records: 21

Built-In Entities

confirmation_code       account_number          member_id
claim_id                policy_number           zip_code
date_of_birth           full_name               phone_number
ssn                     credit_card_number      boolean_answer
enum_answer             multi_select_answer     pain_rating
respiratory_issues      hearing_issues

Pattern Inventory

General patterns:

direct_and_simple  filler_words  hesitation  correction  repetition
formal             casual        polite      confident   uncertain
confirmation       digit_by_digit grouped_two grouped_four nato_letters

Sensitive entity patterns:

ssn_grouped_3_2_4
ssn_digit_by_digit
ssn_with_dashes
ssn_correction
ssn_repetition_for_confirmation

card_grouped_4_4_4_4
card_digit_by_digit
card_with_spaces
card_last_four_repetition
card_correction
card_issuer_style_grouping

Safety Model

SpokenForms is synthetic-first.

Entity Default mode Real-world data?
ssn reserved_or_invalid No
credit_card_number payment_test_numbers No

Unsafe generation flags are reserved in the CLI but intentionally rejected in v0.1:

spokenforms build --entity ssn --allow-potentially-real-sensitive-values
# exits with an error

CLI

spokenforms init --output demo

spokenforms build \
  --config config.yaml \
  --entity credit_card_number \
  --provider mock \
  --num-values 3 \
  --target-per-pattern 2 \
  --output-dir runs/cards

spokenforms stats runs/cards/dataset.jsonl

Python API

from spokenforms.config import apply_cli_overrides, default_config
from spokenforms.generation import run_pipeline
from spokenforms.models import ProviderName
from spokenforms.providers import create_provider

config = apply_cli_overrides(
    default_config(),
    provider=ProviderName.MOCK,
    num_values=2,
    target_per_pattern=2,
    output_dir=None,
)
provider = create_provider(ProviderName.MOCK, "mock")
result = run_pipeline("readme-demo", "ssn", config, provider)

print(result.records[0].transcript)

Test Cases

The repository ships with checks for:

Area Covered behavior
CLI init, build, stats, unsafe flag rejection
Generation mock value generation, pattern application, balancing
Validation deterministic consistency checks
Normalization numbers, alphanumeric values, SSN, card numbers, dates, names, enums, booleans
Safety synthetic-only sensitive policy enforcement
Storage JSONL, CSV, Parquet, manifest, stats
Packaging Ruff, mypy strict mode, coverage, wheel/sdist build

Run the full suite:

uv run ruff check .
uv run ruff format --check .
uv run mypy src tests
uv run pytest
uv build

Current local verification:

ruff:   passing
format: passing
mypy:   passing
pytest: 5 passed, 94.92% coverage
build:  wheel and sdist generated

Release Slugs

Release Slug Theme
0.1.0 safe-synthetic-seed Offline mock generation, typed package, synthetic SSN/card guardrails
0.2.0 provider-lift OpenAI provider, richer prompt contracts, retry/cache hardening
0.3.0 bench-runner Evaluation harnesses, train/validation/test workflows, extraction prompt baselines
1.0.0 voice-data-foundry Stable API for production synthetic transcript generation

Roadmap

  • OpenAI provider implementation.
  • User-defined entity and pattern YAML loading.
  • Richer consistency checking for correction-style transcripts.
  • DSPy/SIMBA prompt optimization hooks.
  • More locale-specific readout styles.
  • Larger generated example gallery.

Project Layout

src/spokenforms/
  cli.py
  config.py
  models.py
  providers/
  generation/
  patterns/
  entities/
  normalizers/
  validators/
  storage/
  stats/
tests/
examples/

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spokenforms-0.1.0.tar.gz (203.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spokenforms-0.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file spokenforms-0.1.0.tar.gz.

File metadata

  • Download URL: spokenforms-0.1.0.tar.gz
  • Upload date:
  • Size: 203.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spokenforms-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d02bb874fabc3d455d241c0e7f0901a4e6e05f67b9cb2e7a9b7d14276038825a
MD5 1177c4708b581a160823a2cf0fd135c5
BLAKE2b-256 73453e7f4d27b469156468cc2530d3ecb9bcf173ca41c588a1e266a30dec656f

See more details on using hashes here.

Provenance

The following attestation bundles were made for spokenforms-0.1.0.tar.gz:

Publisher: publish.yml on spokenforms/spokenforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spokenforms-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: spokenforms-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spokenforms-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec648cd6719919e19d9c0dce7230f230b6b5134933551eebbe3edcec085283df
MD5 31d9f8b50272226ec5222be8f6ebd37c
BLAKE2b-256 8ae664263650abff065ec1ee2706a340f94e4ac958b98871a60fb722d7127970

See more details on using hashes here.

Provenance

The following attestation bundles were made for spokenforms-0.1.0-py3-none-any.whl:

Publisher: publish.yml on spokenforms/spokenforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page