Skip to main content

Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage

Project description

corpusgen

Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage.

corpusgen helps you build phonetically-balanced text corpora for speech synthesis (TTS), speech recognition (ASR), and clinical speech assessment — in any language.

Features

  • Evaluate any text corpus for phoneme, diphone, or triphone coverage

  • PHOIBLE integration — phoneme inventories for 2,186 languages (3,020 inventories)

  • Grapheme-to-phoneme via espeak-ng for 100+ languages

  • Espeak ↔ PHOIBLE mapping — seamless bridge between G2P and phonological databases

  • Structured reports — three verbosity levels, JSON export, JSON-LD-EX compatibility

  • 40-language test suite — validated across 12 language families

  • 6 selection algorithms for corpus optimization:

    • Greedy Set Cover — ln(n)+1 approximation, the standard workhorse
    • CELF — lazy evaluation speedup, identical results up to 700× faster
    • Stochastic Greedy — (1-1/e-ε) approximation, scales to massive corpora
    • ILP — exact optimal solutions via Integer Linear Programming (ground truth)
    • Distribution-Aware — KL-divergence minimization for frequency matching
    • NSGA-II — multi-objective Pareto optimization (coverage × cost × distribution)
  • Phoneme weighting — uniform, frequency-inverse, and linguistic class strategies

  • Phon-CTG generation framework — orchestrated corpus generation with pluggable backends:

    • Repository backend — select from sentence pools (pre-phonemized, raw text, or HuggingFace datasets)
    • LLM API backend — generate targeted sentences via OpenAI/Anthropic/Ollama (BYO API key)
    • Local model backend — HuggingFace transformers with CUDA auto-detect and 4-bit/8-bit quantization
  • Phon-DATG — inference-time logit steering for phonetically-targeted local generation

  • Phon-RL — PPO-based policy fine-tuning with composite phonetic reward (custom implementation, no trl dependency)

Coming Soon

  • CLI — command-line interface for all operations

Prerequisites

espeak-ng (required)

corpusgen uses espeak-ng for grapheme-to-phoneme conversion. Install it before using corpusgen.

Windows
  1. Download the latest .msi installer from espeak-ng releases
  2. Run the installer (default path: C:\Program Files\eSpeak NG\)
  3. Set the environment variable so Python can find the shared library:
[Environment]::SetEnvironmentVariable("PHONEMIZER_ESPEAK_LIBRARY", "C:\Program Files\eSpeak NG\libespeak-ng.dll", "User")
  1. Restart your terminal and verify:
espeak-ng --version
macOS
brew install espeak-ng
Linux (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y espeak-ng
Docker / CI
RUN apt-get update && apt-get install -y espeak-ng && rm -rf /var/lib/apt/lists/*

PHOIBLE data (recommended)

To use PHOIBLE phoneme inventories (2,186 languages), download the data on first use:

from corpusgen.inventory import PhoibleDataset
PhoibleDataset().download()  # cached at ~/.corpusgen/phoible.csv (~24 MB)

This only needs to be done once.

Installation

From PyPI

pip install corpusgen

Development setup

git clone https://github.com/jemsbhai/corpusgen.git
cd corpusgen
poetry install
poetry run pytest

With local model support (GPU recommended)

For Phon-RL training and Phon-DATG logit steering with local models:

# 1. Install corpusgen with local model dependencies
poetry install --with local

# 2. IMPORTANT: Replace CPU torch with CUDA torch for GPU acceleration.
#    The default Poetry install pulls CPU-only torch from PyPI.
#    For NVIDIA GPUs (CUDA 12.1):
pip install torch --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

# Verify GPU is available:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

Note: Check pytorch.org/get-started for the correct CUDA version matching your driver. Common options: cu118, cu121, cu124.

Quick Start

Evaluate a corpus for phoneme coverage

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox jumps over the lazy dog.",
     "She sells seashells by the seashore.",
     "Pack my box with five dozen liquor jugs."],
    language="en-us",
    target_phonemes="phoible",
)

print(report.render())
print(report.coverage)           # 0.65
print(report.missing_phonemes)   # {'ʒ', 'ð', 'θ', ...}

Select optimal sentences from a candidate pool

import corpusgen

candidates = [
    "The quick brown fox jumps over the lazy dog.",
    "She sells seashells by the seashore.",
    "Peter Piper picked a peck of pickled peppers.",
    "How much wood would a woodchuck chuck?",
    "To be or not to be, that is the question.",
]

result = corpusgen.select_sentences(
    candidates,
    language="en-us",
    algorithm="greedy",  # or "celf", "stochastic", "ilp", "distribution", "nsga2"
)

print(f"Selected {result.num_selected} of {len(candidates)} sentences")
print(f"Coverage: {result.coverage:.1%}")

Generate a corpus from a sentence pool (Phon-CTG + Repository)

from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_ctg.scorer import PhoneticScorer
from corpusgen.generate.phon_ctg.loop import GenerationLoop, StoppingCriteria
from corpusgen.generate.backends.repository import RepositoryBackend
from corpusgen.g2p.manager import G2PManager

# 1. Phonemize a sentence pool
g2p = G2PManager()
sentences = ["The cat sat on the mat.", "Big dogs bark loudly.", ...]
results = g2p.phonemize_batch(sentences, language="en-us")
pool = [
    {"text": s, "phonemes": r.phonemes}
    for s, r in zip(sentences, results) if r.phonemes
]

# 2. Set up targets, scorer, and backend
targets = PhoneticTargetInventory(
    target_phonemes=["p", "b", "t", "d", "k", "g"],
    unit="phoneme",
)
scorer = PhoneticScorer(targets=targets, coverage_weight=1.0)
backend = RepositoryBackend(pool=pool)

# 3. Run the generation loop
loop = GenerationLoop(
    backend=backend,
    targets=targets,
    scorer=scorer,
    stopping_criteria=StoppingCriteria(
        target_coverage=0.9,
        max_sentences=20,
    ),
)
result = loop.run()

print(f"Generated {result.num_generated} sentences, coverage: {result.coverage:.1%}")

Generate with an LLM API (Phon-CTG + LLM)

from corpusgen.generate.backends.llm_api import LLMBackend

# Requires: poetry install --with llm
# Set your API key: export OPENAI_API_KEY=...
backend = LLMBackend(
    model="gpt-4o-mini",
    language="en-us",
)

# Use with the same GenerationLoop as above
loop = GenerationLoop(
    backend=backend,
    targets=targets,
    scorer=scorer,
    stopping_criteria=StoppingCriteria(target_coverage=0.9),
)
result = loop.run()

Fine-tune a model with phonetic reward (Phon-RL)

from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_rl.reward import PhoneticReward
from corpusgen.generate.phon_rl.trainer import PhonRLTrainer, TrainingConfig

# Requires: poetry install --with local

# 1. Define targets and reward
targets = PhoneticTargetInventory(
    target_phonemes=["p", "b", "t", "d", "k"],
    unit="phoneme",
)
reward = PhoneticReward(targets=targets, coverage_weight=1.0)

# 2. Configure PPO training
config = TrainingConfig(
    model_name="gpt2",
    num_steps=100,
    learning_rate=1e-5,
    kl_coeff=0.1,
    use_peft=True,     # LoRA for parameter-efficient training
    peft_r=8,
    peft_alpha=16,
    device=None,        # auto-detect GPU
)

# 3. Train with dynamic prompts that adapt to coverage gaps
def make_prompt(targets):
    missing = targets.next_targets(5)
    return f"Write a sentence using these sounds: {', '.join(missing)}"

trainer = PhonRLTrainer(reward=reward, config=config)
result = trainer.train(prompt_fn=make_prompt)

print(f"Final coverage: {result.final_coverage:.1%}")
trainer.save_checkpoint("./phon_rl_checkpoint")

Use PHOIBLE inventories directly

from corpusgen import get_inventory

inv = get_inventory("en-us")
print(inv.language_name)          # 'English'
print(inv.consonants)             # ['p', 'b', 't', 'd', 'k', ...]
print(inv.vowels)                 # ['iː', 'ɪ', 'ɛ', 'æ', ...]

# Query by distinctive features
nasals = inv.segments_with_feature("nasal", "+")
print([s.phoneme for s in nasals])  # ['m', 'n', 'ŋ']

Evaluate with diphone or triphone coverage

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox jumps."],
    language="en-us",
    target_phonemes="phoible",
    unit="diphone",
)
print(f"Diphone coverage: {report.coverage:.1%}")

Export reports

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox."],
    language="en-us",
    target_phonemes="phoible",
)

# JSON
print(report.to_json(indent=2))

# JSON-LD (linked data)
doc = report.to_jsonld_ex()

# Human-readable at different verbosity levels
from corpusgen.evaluate.report import Verbosity
print(report.render(verbosity=Verbosity.MINIMAL))
print(report.render(verbosity=Verbosity.NORMAL))
print(report.render(verbosity=Verbosity.VERBOSE))

Architecture

corpusgen/
├── g2p/                  # Grapheme-to-phoneme conversion
│   ├── manager.py        # G2PManager — multi-backend G2P (espeak-ng)
│   └── result.py         # G2PResult — phonemes, diphones, triphones
├── coverage/
│   └── tracker.py        # CoverageTracker — phoneme/diphone/triphone tracking
├── evaluate/
│   ├── evaluate.py       # evaluate() — top-level API
│   └── report.py         # EvaluationReport, Verbosity
├── inventory/
│   ├── models.py         # Segment (38 features), Inventory
│   ├── phoible.py        # PhoibleDataset — PHOIBLE loader/cache/query
│   └── mapping.py        # EspeakMapping — espeak ↔ ISO 639-3
├── select/
│   ├── greedy.py         # GreedySelector
│   ├── celf.py           # CELFSelector (lazy evaluation)
│   ├── stochastic.py     # StochasticGreedySelector
│   ├── ilp.py            # ILPSelector (exact, optional: pulp)
│   ├── distribution.py   # DistributionAwareSelector (KL-divergence)
│   └── nsga2.py          # NSGA2Selector (Pareto, optional: pymoo)
├── weights/              # Phoneme weighting strategies
├── generate/
│   ├── phon_ctg/         # Orchestration framework
│   │   ├── targets.py    # PhoneticTargetInventory
│   │   ├── scorer.py     # PhoneticScorer (coverage + phonotactic + fluency)
│   │   ├── constraints.py # PhonotacticConstraint ABC + N-gram model
│   │   └── loop.py       # GenerationLoop + StoppingCriteria
│   ├── phon_rl/          # RL-based guidance (PPO)
│   │   ├── reward.py     # PhoneticReward (composite, hierarchical)
│   │   ├── trainer.py    # PhonRLTrainer (custom PPO, no trl)
│   │   ├── policy.py     # PhonRLStrategy (GuidanceStrategy wrapper)
│   │   └── value_head.py # ValueHead (nn.Module for GAE)
│   ├── phon_datg/        # Inference-time logit steering
│   │   ├── attribute_words.py  # Vocabulary phonemization + index
│   │   ├── modulator.py  # Additive logit modulation
│   │   └── graph.py      # DATGStrategy (GuidanceStrategy)
│   ├── guidance.py       # GuidanceStrategy ABC
│   └── backends/         # Pluggable generation engines
│       ├── repository.py # Sentence pool selection
│       ├── llm_api.py    # Multi-provider LLM API (litellm)
│       └── local.py      # HuggingFace transformers + quantization

Language Support

corpusgen supports any language available in both espeak-ng and PHOIBLE:

  • G2P (espeak-ng): 100+ languages
  • Inventories (PHOIBLE): 2,186 languages, 3,020 inventories, 8 sources
  • Tested across: 40 languages, 12 language families, 10+ scripts

The espeak-to-PHOIBLE mapping covers 85+ languages with automatic macrolanguage resolution (e.g., ms → Standard Malay, sw → Swahili).

Reproducibility

For reproducible results across machines:

  1. Pin corpusgen version in your dependency file
  2. Pin espeak-ng version: Record espeak-ng --version in experiment logs
  3. Use poetry.lock: Pins all transitive dependencies
  4. Record PHOIBLE version: Note the download date of ~/.corpusgen/phoible.csv

Citation

If you use corpusgen in your research, please cite:

@software{corpusgen2025,
  title={corpusgen: Language-Agnostic Speech Corpus Generation with Maximal Phoneme Coverage},
  author={Syed, Muntaser},
  year={2025},
  url={https://github.com/jemsbhai/corpusgen}
}

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusgen-0.1.0.tar.gz (74.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpusgen-0.1.0-py3-none-any.whl (96.2 kB view details)

Uploaded Python 3

File details

Details for the file corpusgen-0.1.0.tar.gz.

File metadata

  • Download URL: corpusgen-0.1.0.tar.gz
  • Upload date:
  • Size: 74.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11

File hashes

Hashes for corpusgen-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b1c791c39f28340f9bb2f508e944104018d6d653d7562805677c5d7bb9687651
MD5 7b008f17c423c785256605aa641c0e9c
BLAKE2b-256 a2ff9814aeda14e60c0dc6999b20cc4f367dbfdd80f1a3b1cc6ab92dad7c6d30

See more details on using hashes here.

File details

Details for the file corpusgen-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corpusgen-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 96.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11

File hashes

Hashes for corpusgen-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2ad04858d4408cb74cf985804605dd5bffea1df5d6ece236981311b43a897dc
MD5 c8aac793fd869325b52c1dd73dd43b45
BLAKE2b-256 a0291443756e385f4aad77bc73a5207c9ca0f8a67de4118deafbe80c067ebebb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page