Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage
Project description
corpusgen
Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage.
corpusgen helps you build phonetically-balanced text corpora for speech synthesis (TTS), speech recognition (ASR), and clinical speech assessment — in any language.
Features
-
Evaluate any text corpus for phoneme, diphone, or triphone coverage
-
PHOIBLE integration — phoneme inventories for 2,186 languages (3,020 inventories)
-
Grapheme-to-phoneme via espeak-ng for 100+ languages
-
Espeak ↔ PHOIBLE mapping — seamless bridge between G2P and phonological databases
-
Structured reports — three verbosity levels, JSON export, JSON-LD-EX compatibility
-
40-language test suite — validated across 12 language families
-
6 selection algorithms for corpus optimization:
- Greedy Set Cover — ln(n)+1 approximation, the standard workhorse
- CELF — lazy evaluation speedup, identical results up to 700× faster
- Stochastic Greedy — (1-1/e-ε) approximation, scales to massive corpora
- ILP — exact optimal solutions via Integer Linear Programming (ground truth)
- Distribution-Aware — KL-divergence minimization for frequency matching
- NSGA-II — multi-objective Pareto optimization (coverage × cost × distribution)
-
Phoneme weighting — uniform, frequency-inverse, and linguistic class strategies
-
Phon-CTG generation framework — orchestrated corpus generation with pluggable backends:
- Repository backend — select from sentence pools (pre-phonemized, raw text, or HuggingFace datasets)
- LLM API backend — generate targeted sentences via OpenAI/Anthropic/Ollama (BYO API key)
- Local model backend — HuggingFace transformers with CUDA auto-detect and 4-bit/8-bit quantization
-
Phon-DATG — inference-time logit steering for phonetically-targeted local generation
-
Phon-RL — PPO-based policy fine-tuning with composite phonetic reward (custom implementation, no trl dependency)
Coming Soon
- CLI — command-line interface for all operations
Prerequisites
espeak-ng (required)
corpusgen uses espeak-ng for grapheme-to-phoneme conversion. Install it before using corpusgen.
Windows
- Download the latest
.msiinstaller from espeak-ng releases - Run the installer (default path:
C:\Program Files\eSpeak NG\) - Set the environment variable so Python can find the shared library:
[Environment]::SetEnvironmentVariable("PHONEMIZER_ESPEAK_LIBRARY", "C:\Program Files\eSpeak NG\libespeak-ng.dll", "User")
- Restart your terminal and verify:
espeak-ng --version
macOS
brew install espeak-ng
Linux (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y espeak-ng
Docker / CI
RUN apt-get update && apt-get install -y espeak-ng && rm -rf /var/lib/apt/lists/*
PHOIBLE data (recommended)
To use PHOIBLE phoneme inventories (2,186 languages), download the data on first use:
from corpusgen.inventory import PhoibleDataset
PhoibleDataset().download() # cached at ~/.corpusgen/phoible.csv (~24 MB)
This only needs to be done once.
Installation
From PyPI
pip install corpusgen
Development setup
git clone https://github.com/jemsbhai/corpusgen.git
cd corpusgen
poetry install
poetry run pytest
With local model support (GPU recommended)
For Phon-RL training and Phon-DATG logit steering with local models:
# 1. Install corpusgen with local model dependencies
poetry install --with local
# 2. IMPORTANT: Replace CPU torch with CUDA torch for GPU acceleration.
# The default Poetry install pulls CPU-only torch from PyPI.
# For NVIDIA GPUs (CUDA 12.1):
pip install torch --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
# Verify GPU is available:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
Note: Check pytorch.org/get-started for the correct CUDA version matching your driver. Common options:
cu118,cu121,cu124.
Quick Start
Evaluate a corpus for phoneme coverage
import corpusgen
report = corpusgen.evaluate(
["The quick brown fox jumps over the lazy dog.",
"She sells seashells by the seashore.",
"Pack my box with five dozen liquor jugs."],
language="en-us",
target_phonemes="phoible",
)
print(report.render())
print(report.coverage) # 0.65
print(report.missing_phonemes) # {'ʒ', 'ð', 'θ', ...}
Select optimal sentences from a candidate pool
import corpusgen
candidates = [
"The quick brown fox jumps over the lazy dog.",
"She sells seashells by the seashore.",
"Peter Piper picked a peck of pickled peppers.",
"How much wood would a woodchuck chuck?",
"To be or not to be, that is the question.",
]
result = corpusgen.select_sentences(
candidates,
language="en-us",
algorithm="greedy", # or "celf", "stochastic", "ilp", "distribution", "nsga2"
)
print(f"Selected {result.num_selected} of {len(candidates)} sentences")
print(f"Coverage: {result.coverage:.1%}")
Generate a corpus from a sentence pool (Phon-CTG + Repository)
from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_ctg.scorer import PhoneticScorer
from corpusgen.generate.phon_ctg.loop import GenerationLoop, StoppingCriteria
from corpusgen.generate.backends.repository import RepositoryBackend
from corpusgen.g2p.manager import G2PManager
# 1. Phonemize a sentence pool
g2p = G2PManager()
sentences = ["The cat sat on the mat.", "Big dogs bark loudly.", ...]
results = g2p.phonemize_batch(sentences, language="en-us")
pool = [
{"text": s, "phonemes": r.phonemes}
for s, r in zip(sentences, results) if r.phonemes
]
# 2. Set up targets, scorer, and backend
targets = PhoneticTargetInventory(
target_phonemes=["p", "b", "t", "d", "k", "g"],
unit="phoneme",
)
scorer = PhoneticScorer(targets=targets, coverage_weight=1.0)
backend = RepositoryBackend(pool=pool)
# 3. Run the generation loop
loop = GenerationLoop(
backend=backend,
targets=targets,
scorer=scorer,
stopping_criteria=StoppingCriteria(
target_coverage=0.9,
max_sentences=20,
),
)
result = loop.run()
print(f"Generated {result.num_generated} sentences, coverage: {result.coverage:.1%}")
Generate with an LLM API (Phon-CTG + LLM)
from corpusgen.generate.backends.llm_api import LLMBackend
# Requires: poetry install --with llm
# Set your API key: export OPENAI_API_KEY=...
backend = LLMBackend(
model="gpt-4o-mini",
language="en-us",
)
# Use with the same GenerationLoop as above
loop = GenerationLoop(
backend=backend,
targets=targets,
scorer=scorer,
stopping_criteria=StoppingCriteria(target_coverage=0.9),
)
result = loop.run()
Fine-tune a model with phonetic reward (Phon-RL)
from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_rl.reward import PhoneticReward
from corpusgen.generate.phon_rl.trainer import PhonRLTrainer, TrainingConfig
# Requires: poetry install --with local
# 1. Define targets and reward
targets = PhoneticTargetInventory(
target_phonemes=["p", "b", "t", "d", "k"],
unit="phoneme",
)
reward = PhoneticReward(targets=targets, coverage_weight=1.0)
# 2. Configure PPO training
config = TrainingConfig(
model_name="gpt2",
num_steps=100,
learning_rate=1e-5,
kl_coeff=0.1,
use_peft=True, # LoRA for parameter-efficient training
peft_r=8,
peft_alpha=16,
device=None, # auto-detect GPU
)
# 3. Train with dynamic prompts that adapt to coverage gaps
def make_prompt(targets):
missing = targets.next_targets(5)
return f"Write a sentence using these sounds: {', '.join(missing)}"
trainer = PhonRLTrainer(reward=reward, config=config)
result = trainer.train(prompt_fn=make_prompt)
print(f"Final coverage: {result.final_coverage:.1%}")
trainer.save_checkpoint("./phon_rl_checkpoint")
Use PHOIBLE inventories directly
from corpusgen import get_inventory
inv = get_inventory("en-us")
print(inv.language_name) # 'English'
print(inv.consonants) # ['p', 'b', 't', 'd', 'k', ...]
print(inv.vowels) # ['iː', 'ɪ', 'ɛ', 'æ', ...]
# Query by distinctive features
nasals = inv.segments_with_feature("nasal", "+")
print([s.phoneme for s in nasals]) # ['m', 'n', 'ŋ']
Evaluate with diphone or triphone coverage
import corpusgen
report = corpusgen.evaluate(
["The quick brown fox jumps."],
language="en-us",
target_phonemes="phoible",
unit="diphone",
)
print(f"Diphone coverage: {report.coverage:.1%}")
Export reports
import corpusgen
report = corpusgen.evaluate(
["The quick brown fox."],
language="en-us",
target_phonemes="phoible",
)
# JSON
print(report.to_json(indent=2))
# JSON-LD (linked data)
doc = report.to_jsonld_ex()
# Human-readable at different verbosity levels
from corpusgen.evaluate.report import Verbosity
print(report.render(verbosity=Verbosity.MINIMAL))
print(report.render(verbosity=Verbosity.NORMAL))
print(report.render(verbosity=Verbosity.VERBOSE))
Architecture
corpusgen/
├── g2p/ # Grapheme-to-phoneme conversion
│ ├── manager.py # G2PManager — multi-backend G2P (espeak-ng)
│ └── result.py # G2PResult — phonemes, diphones, triphones
├── coverage/
│ └── tracker.py # CoverageTracker — phoneme/diphone/triphone tracking
├── evaluate/
│ ├── evaluate.py # evaluate() — top-level API
│ └── report.py # EvaluationReport, Verbosity
├── inventory/
│ ├── models.py # Segment (38 features), Inventory
│ ├── phoible.py # PhoibleDataset — PHOIBLE loader/cache/query
│ └── mapping.py # EspeakMapping — espeak ↔ ISO 639-3
├── select/
│ ├── greedy.py # GreedySelector
│ ├── celf.py # CELFSelector (lazy evaluation)
│ ├── stochastic.py # StochasticGreedySelector
│ ├── ilp.py # ILPSelector (exact, optional: pulp)
│ ├── distribution.py # DistributionAwareSelector (KL-divergence)
│ └── nsga2.py # NSGA2Selector (Pareto, optional: pymoo)
├── weights/ # Phoneme weighting strategies
├── generate/
│ ├── phon_ctg/ # Orchestration framework
│ │ ├── targets.py # PhoneticTargetInventory
│ │ ├── scorer.py # PhoneticScorer (coverage + phonotactic + fluency)
│ │ ├── constraints.py # PhonotacticConstraint ABC + N-gram model
│ │ └── loop.py # GenerationLoop + StoppingCriteria
│ ├── phon_rl/ # RL-based guidance (PPO)
│ │ ├── reward.py # PhoneticReward (composite, hierarchical)
│ │ ├── trainer.py # PhonRLTrainer (custom PPO, no trl)
│ │ ├── policy.py # PhonRLStrategy (GuidanceStrategy wrapper)
│ │ └── value_head.py # ValueHead (nn.Module for GAE)
│ ├── phon_datg/ # Inference-time logit steering
│ │ ├── attribute_words.py # Vocabulary phonemization + index
│ │ ├── modulator.py # Additive logit modulation
│ │ └── graph.py # DATGStrategy (GuidanceStrategy)
│ ├── guidance.py # GuidanceStrategy ABC
│ └── backends/ # Pluggable generation engines
│ ├── repository.py # Sentence pool selection
│ ├── llm_api.py # Multi-provider LLM API (litellm)
│ └── local.py # HuggingFace transformers + quantization
Language Support
corpusgen supports any language available in both espeak-ng and PHOIBLE:
- G2P (espeak-ng): 100+ languages
- Inventories (PHOIBLE): 2,186 languages, 3,020 inventories, 8 sources
- Tested across: 40 languages, 12 language families, 10+ scripts
The espeak-to-PHOIBLE mapping covers 85+ languages with automatic macrolanguage resolution (e.g., ms → Standard Malay, sw → Swahili).
Reproducibility
For reproducible results across machines:
- Pin corpusgen version in your dependency file
- Pin espeak-ng version: Record
espeak-ng --versionin experiment logs - Use
poetry.lock: Pins all transitive dependencies - Record PHOIBLE version: Note the download date of
~/.corpusgen/phoible.csv
Citation
If you use corpusgen in your research, please cite:
@software{corpusgen2025,
title={corpusgen: Language-Agnostic Speech Corpus Generation with Maximal Phoneme Coverage},
author={Syed, Muntaser},
year={2025},
url={https://github.com/jemsbhai/corpusgen}
}
License
Apache 2.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corpusgen-0.1.1.tar.gz.
File metadata
- Download URL: corpusgen-0.1.1.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc6ca53d3836ca3d07c92c770749cf2cf924e14566789a708659e4cd8a23035e
|
|
| MD5 |
a85f2ddfede199272a256d2a76e20624
|
|
| BLAKE2b-256 |
1a451fc0db1e23323f17a8d541514ed806ea1387afd972512f64fc6e8ee05ec1
|
File details
Details for the file corpusgen-0.1.1-py3-none-any.whl.
File metadata
- Download URL: corpusgen-0.1.1-py3-none-any.whl
- Upload date:
- Size: 96.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.2 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82b9c5f36aa653b236a2b68d7260f9eb3fc521a8b592f792259267e49dab52b7
|
|
| MD5 |
a7a3e81bba75fc3d2dd88768cbba45ba
|
|
| BLAKE2b-256 |
5da519826c188c24f5f51fd1bd51ade21ff02dba893f8bb2f35a13164f4c742d
|