spokenforms

Generate synthetic spoken transcript variants for structured values and alphanumeric sequences.

These details have not been verified by PyPI

Project description

SpokenForms

Synthetic spoken transcript variants for structured entity extraction.

Python Typed Coverage License

The Problem

Voice AI systems do not fail only because they miss words. They fail because structured values are spoken in messy, inconsistent, human ways.

ZIP code:
  94101
  nine four one zero one
  nine four one oh one
  ninety four one oh one

SSN:
  900-12-3456
  nine zero zero, one two, three four five six
  nine zero zero dash one two dash three four five six

Credit card:
  4242 4242 4242 4242
  four two four two, four two four two, four two four two, four two four two

Real phone transcripts are private, expensive, sparse, and slow to annotate. SpokenForms lets teams generate synthetic direct-answer data before production transcripts exist, then validate every transcript back to the intended canonical value.

What It Builds

canonical structured value
  -> verbalization patterns
  -> phone-call-style transcript candidates
  -> deterministic consistency validation
  -> JSONL / CSV / Parquet dataset

SpokenForms is built around the LingVarBench-style three-stage pipeline:

Stage	Job
Value Generator	Creates canonical values for an entity schema.
Transcript Generator	Applies reusable and entity-specific spoken patterns.
Consistency Checker	Keeps only transcripts recoverable to the intended value.

Install

uv add spokenforms

For local development:

uv sync --all-extras --dev

If you use OpenAI later, put credentials in .env. The mock provider works offline and does not need network access.

Quick Start

uv run spokenforms init --output demo
cd demo

uv run spokenforms build \
  --config config.yaml \
  --entity ssn \
  --provider mock \
  --num-values 3 \
  --target-per-pattern 2 \
  --output-dir runs/demo_ssn

Expected output:

manifest.json
config.resolved.yaml
values.jsonl
candidates.jsonl
validated.jsonl
dataset.jsonl
dataset.csv
dataset.parquet
stats.json
stats.md
logs.jsonl

Generated Examples

These examples were generated from this package with:

uv run spokenforms build --entity ssn --provider mock --num-values 1 --target-per-pattern 1
uv run spokenforms build --entity credit_card_number --provider mock --num-values 1 --target-per-pattern 1

Synthetic SSN

Ground truth	Pattern	Generated transcript
`900-12-3456`	`direct_and_simple`	`nine zero zero, one two, three four five six`
`900-12-3456`	`filler_words`	`Um, it is nine zero zero, one two, three four five six.`
`900-12-3456`	`formal`	`The Social Security number is nine zero zero, one two, three four five six.`

All records are tagged:

{
  "synthetic_sensitive_value": true,
  "sensitive_type": "ssn",
  "real_world_safe": true,
  "generation_mode": "reserved_or_invalid"
}

Payment-Test Card Number

Ground truth	Pattern	Generated transcript
`4242424242424242`	`direct_and_simple`	`four two four two, four two four two, four two four two, four two four two`
`4242424242424242`	`filler_words`	`Um, it is four two four two, four two four two, four two four two, four two four two.`
`4242424242424242`	`card_correction`	`one two four, sorry, four two four two, four two four two, four two four two, four two four two.`

All card records are tagged:

{
  "synthetic_sensitive_value": true,
  "sensitive_type": "credit_card_number",
  "real_world_safe": true,
  "generation_mode": "payment_test_numbers"
}

Generated Stats

SSN run:
  total records: 16
  total patterns: 16
  validation pass rate: 100.00%
  sensitive synthetic records: 16

Credit-card run:
  total records: 21
  total patterns: 21
  validation pass rate: 100.00%
  sensitive synthetic records: 21

Built-In Entities

confirmation_code       account_number          member_id
claim_id                policy_number           zip_code
date_of_birth           full_name               phone_number
ssn                     credit_card_number      boolean_answer
enum_answer             multi_select_answer     pain_rating
respiratory_issues      hearing_issues

Pattern Inventory

General patterns:

direct_and_simple  filler_words  hesitation  correction  repetition
formal             casual        polite      confident   uncertain
confirmation       digit_by_digit grouped_two grouped_four nato_letters

Sensitive entity patterns:

ssn_grouped_3_2_4
ssn_digit_by_digit
ssn_with_dashes
ssn_correction
ssn_repetition_for_confirmation

card_grouped_4_4_4_4
card_digit_by_digit
card_with_spaces
card_last_four_repetition
card_correction
card_issuer_style_grouping

Safety Model

SpokenForms is synthetic-first.

Entity	Default mode	Real-world data?
`ssn`	`reserved_or_invalid`	No
`credit_card_number`	`payment_test_numbers`	No

Unsafe generation flags are reserved in the CLI but intentionally rejected in v0.1:

spokenforms build --entity ssn --allow-potentially-real-sensitive-values
# exits with an error

CLI

spokenforms init --output demo

spokenforms build \
  --config config.yaml \
  --entity credit_card_number \
  --provider mock \
  --num-values 3 \
  --target-per-pattern 2 \
  --output-dir runs/cards

spokenforms stats runs/cards/dataset.jsonl

Python API

from spokenforms.config import apply_cli_overrides, default_config
from spokenforms.generation import run_pipeline
from spokenforms.models import ProviderName
from spokenforms.providers import create_provider

config = apply_cli_overrides(
    default_config(),
    provider=ProviderName.MOCK,
    num_values=2,
    target_per_pattern=2,
    output_dir=None,
)
provider = create_provider(ProviderName.MOCK, "mock")
result = run_pipeline("readme-demo", "ssn", config, provider)

print(result.records[0].transcript)

Test Cases

The repository ships with checks for:

Area	Covered behavior
CLI	`init`, `build`, `stats`, unsafe flag rejection
Generation	mock value generation, pattern application, balancing
Validation	deterministic consistency checks
Normalization	numbers, alphanumeric values, SSN, card numbers, dates, names, enums, booleans
Safety	synthetic-only sensitive policy enforcement
Storage	JSONL, CSV, Parquet, manifest, stats
Packaging	Ruff, mypy strict mode, coverage, wheel/sdist build

Run the full suite:

uv run ruff check .
uv run ruff format --check .
uv run mypy src tests
uv run pytest
uv build

Current local verification:

ruff:   passing
format: passing
mypy:   passing
pytest: 5 passed, 94.92% coverage
build:  wheel and sdist generated

Release Slugs

Release	Slug	Theme
`0.1.0`	`safe-synthetic-seed`	Offline mock generation, typed package, synthetic SSN/card guardrails
`0.2.0`	`provider-lift`	OpenAI provider, richer prompt contracts, retry/cache hardening
`0.3.0`	`bench-runner`	Evaluation harnesses, train/validation/test workflows, extraction prompt baselines
`1.0.0`	`voice-data-foundry`	Stable API for production synthetic transcript generation

Roadmap

OpenAI provider implementation.
User-defined entity and pattern YAML loading.
Richer consistency checking for correction-style transcripts.
DSPy/SIMBA prompt optimization hooks.
More locale-specific readout styles.
Larger generated example gallery.

Project Layout

src/spokenforms/
  cli.py
  config.py
  models.py
  providers/
  generation/
  patterns/
  entities/
  normalizers/
  validators/
  storage/
  stats/
tests/
examples/

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spokenforms-0.1.0.tar.gz (203.8 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spokenforms-0.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file spokenforms-0.1.0.tar.gz.

File metadata

Download URL: spokenforms-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 203.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spokenforms-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d02bb874fabc3d455d241c0e7f0901a4e6e05f67b9cb2e7a9b7d14276038825a`
MD5	`1177c4708b581a160823a2cf0fd135c5`
BLAKE2b-256	`73453e7f4d27b469156468cc2530d3ecb9bcf173ca41c588a1e266a30dec656f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spokenforms-0.1.0.tar.gz:

Publisher: publish.yml on spokenforms/spokenforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spokenforms-0.1.0.tar.gz
- Subject digest: d02bb874fabc3d455d241c0e7f0901a4e6e05f67b9cb2e7a9b7d14276038825a
- Sigstore transparency entry: 1631665462
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: spokenforms/spokenforms@ec4d448366ba7b36117e6d8120b641e5c3d02fcf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/spokenforms
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ec4d448366ba7b36117e6d8120b641e5c3d02fcf
- Trigger Event: workflow_dispatch

File details

Details for the file spokenforms-0.1.0-py3-none-any.whl.

File metadata

Download URL: spokenforms-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 38.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spokenforms-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec648cd6719919e19d9c0dce7230f230b6b5134933551eebbe3edcec085283df`
MD5	`31d9f8b50272226ec5222be8f6ebd37c`
BLAKE2b-256	`8ae664263650abff065ec1ee2706a340f94e4ac958b98871a60fb722d7127970`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spokenforms-0.1.0-py3-none-any.whl:

Publisher: publish.yml on spokenforms/spokenforms

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spokenforms-0.1.0-py3-none-any.whl
- Subject digest: ec648cd6719919e19d9c0dce7230f230b6b5134933551eebbe3edcec085283df
- Sigstore transparency entry: 1631665471
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: spokenforms/spokenforms@ec4d448366ba7b36117e6d8120b641e5c3d02fcf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/spokenforms
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ec4d448366ba7b36117e6d8120b641e5c3d02fcf
- Trigger Event: workflow_dispatch

spokenforms 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

The Problem

What It Builds

Install

Quick Start

Generated Examples

Synthetic SSN

Payment-Test Card Number

Generated Stats

Built-In Entities

Pattern Inventory

Safety Model

CLI

Python API

Test Cases

Release Slugs

Roadmap

Project Layout

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance