A bootstrapping language detection system that builds dictionaries from noisy and ambiguous training data
Project description
Vocabulous
A bootstrapping language detection system that builds high-quality dictionaries from noisy training data.
Overview
Vocabulous addresses a common challenge in NLP: building reliable language detection systems when you only have noisy, potentially mislabeled training data. Traditional approaches either require clean, manually curated datasets or sophisticated neural networks. Vocabulous takes a different approach by using iterative dictionary building and progressive data cleaning to bootstrap accurate language detection from imperfect data.
Key Features
- Bootstrapping from Noisy Data: Starts with potentially mislabeled training data and iteratively improves
- Dictionary-Based Detection: Uses word frequency dictionaries for fast, interpretable language detection
- Progressive Data Cleaning: Removes ambiguous and mislabeled samples across training cycles
- Multi-Script Support: Handles both Latin and Arabic scripts with appropriate text normalization
- Configurable Training: Adjustable confidence thresholds and confidence margin parameters for different scenarios
- Model Persistence: Save and load trained models for reuse
Installation
uv pip install vocabulous
Development Installation
git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
uv pip install -e ".[dev]"
Quick Start
from vocabulous import Vocabulous
# Initialize model
model = Vocabulous()
# Prepare training data (list of dicts with 'text' and 'lang' keys)
train_data = [
{'text': 'Hello world', 'lang': 'en'},
{'text': 'Bonjour le monde', 'lang': 'fr'},
{'text': 'Hola mundo', 'lang': 'es'},
# ... more training examples
]
# Evaluation data for monitoring training progress
eval_data = [
{'text': 'Good morning', 'lang': 'en'},
{'text': 'Bon matin', 'lang': 'fr'},
{'text': 'Buenos días', 'lang': 'es'},
]
# Train the model (enable parallel cleaning/tokenization when working at scale)
model, report = model.train(
train_data=train_data,
eval_data=eval_data,
cycles=3,
base_confidence=0.5,
confidence_margin=0.3,
clean_workers=4, # optional multiprocessing for text cleaning
token_workers=4, # optional multiprocessing for sentence tokenization
)
# Use for language detection
scores = model._score_sentence("Hello there")
print(scores) # {'en': 1.0}
# Save the model
model.save('my_model.json')
# Load later
loaded_model = Vocabulous.load('my_model.json')
Methodology
The Bootstrapping Approach
Vocabulous implements a novel bootstrapping methodology for language detection:
- Initial Dictionary Building: Creates word-language frequency dictionaries from all training data
- Scoring & Evaluation: Scores evaluation data to measure current model performance
- Data Cleaning: Removes training samples that contradict the current dictionaries
- Iteration: Repeats the process with cleaned data to progressively improve quality
Why This Works
The approach is based on several key insights:
- Majority Signal: Even noisy datasets typically contain more correct than incorrect labels
- Word Uniqueness: Many words are language-specific and provide strong signals
- Progressive Refinement: Each iteration removes the most problematic samples first
- Convergence: The process naturally converges when no more samples can be confidently removed
Training Parameters
cycles: Number of training iterations (default: 2)base_confidence: Minimum score threshold for keeping samples (0-1)confidence_margin: Minimum difference between top two language scores (0-1)
Higher values make the filtering more aggressive, while lower values are more permissive.
Use Cases
1. Bootstrapping Language Detection
Scenario: You have a large dataset of multilingual text with potentially noisy language labels.
# Start with noisy data
noisy_data = [
{'text': 'Hello world', 'lang': 'en'},
{'text': 'Bonjour', 'lang': 'en'}, # Mislabeled!
{'text': 'Hello', 'lang': 'fr'}, # Mislabeled!
{'text': 'Comment ça va?', 'lang': 'fr'},
# ... thousands more with ~10% label noise
]
model = Vocabulous()
model, report = model.train(noisy_data, eval_data, cycles=3)
# The model learns to ignore mislabeled samples
print(f"Dictionary size: {len(model.word_lang_freq)}")
print(f"Final accuracy: {report['cycle_reports'][-1]['accuracy']:.3f}")
2. Data Cleaning Pipeline
Scenario: Clean a noisy multilingual dataset before using it for other NLP tasks.
# Train model on subset of data
model, _ = model.train(sample_data, eval_data)
# Clean the full dataset
cleaned_dataset = model.clean(full_noisy_dataset)
# Now use cleaned_dataset for training other models
print(f"Kept {len(cleaned_dataset)}/{len(full_noisy_dataset)} samples")
3. Incremental Learning
Scenario: Continuously improve language detection as new data becomes available.
# Initial training
model, _ = model.train(initial_data, eval_data)
# Later, integrate new data
model, updated_report = model.train(
new_data + initial_data, # Combine old and new
eval_data,
cycles=2
)
4. Cross-Domain Adaptation
Scenario: Adapt a model trained on one domain (e.g., news) to another (e.g., social media).
# Train on news data
news_model, _ = news_model.train(news_data, news_eval)
# Adapt to social media by combining datasets
adapted_model = Vocabulous()
adapted_model, _ = adapted_model.train(
social_media_data + news_data,
social_media_eval,
cycles=3,
base_confidence=0.3 # Lower threshold for noisy social media text
)
Advanced Usage
Custom Text Preprocessing
# Subclass to customize text cleaning
class CustomVocabulous(Vocabulous):
def _clean_text(self, text):
# Add custom preprocessing
text = super()._clean_text(text)
# Your custom logic here
return text
Training Monitoring
model, report = model.train(train_data, eval_data, cycles=5)
# Analyze training progress
for i, cycle_report in enumerate(report['cycle_reports']):
print(f"Cycle {i+1}:")
print(f" Accuracy: {cycle_report['accuracy']:.3f}")
print(f" F1 Score: {cycle_report['f1']:.3f}")
print(f" Samples removed: {cycle_report['removed_samples']}")
print(f" Confidence Margin: {cycle_report['confidence_margin']:.3f}")
Scoring backends
- Default: swifter-accelerated Pandas apply via
model._score(...). - Alternatives (experimental):
vectorized,numba,sparsebackends exist and are used in benchmarks.
Planned API to switch backends:
model.set_scoring_mode("vectorized") # or: "apply", "numba", "sparse", "auto"
scored = model._score(df)
Until switching is wired end-to-end, call experimental methods directly:
# Vectorized scoring for a text Series -> Series[dict]
scores_vec = model._score_vectorized(df["text"]) # experimental API
df = df.copy()
df["scores"] = scores_vec
# Numba-backed scoring (if numba installed) -> Series[dict]
scores_numba = model._score_numba(df["text"]) # experimental API
# Note: _score(...) remains the default swifter-apply path as of 0.1.2
Confidence Scoring
# Get detailed scores for a sentence
scores = model._score_sentence("Hello world")
# Returns: {'en': 0.75, 'fr': 0.25}
# For datasets
scored_df = model._score(test_data)
print(scored_df[['text', 'scores', 'lang']])
Performance Tips
Memory Optimization
# For large datasets, disable training data storage
model = Vocabulous(store_training_data=False)
Speed Optimization
# Use fewer cycles for faster training
model, _ = model.train(data, eval_data, cycles=1)
# Lower confidence margin for less aggressive filtering
model, _ = model.train(data, eval_data, confidence_margin=0.1)
# Enable multiprocessing for cleaning/tokenization on large corpora
model, _ = model.train(
data,
eval_data,
cycles=1,
clean_workers=4,
token_workers=4,
)
Quality Optimization
# More cycles for higher quality
model, _ = model.train(data, eval_data, cycles=5)
# Higher confidence threshold for cleaner dictionaries
model, _ = model.train(data, eval_data, base_confidence=0.7)
Evaluation Metrics
Benchmarks
- Full results: See benchmark.md for the complete output and methodology.
- Parallel training report: parallel_training_report.md documents sequential vs multiprocessing runs using
clean_workers/token_workers. - Highlights:
- 20k rows clean+score: ~454k rows/s
- Apply vs Vectorized: ~450–500k rows/s on 5k–50k rows
- Longer sentences reduce throughput (len=50 ~190k rows/s)
- Dictionary size (50→5000 per language): near ~450–480k rows/s
- Large-n vectorized batched (100k): ~403k rows/s
- Large-n compare (200k, dict=1000, len=20): ~224k–225k rows/s across modes
- Parallel training (200k synthetic sentences with 10k word vocabulary): Sequential 393 s vs parallel (4×4 workers) 120 s → 3.26× speedup with identical dictionaries/predictions.
Run locally with uv:
uv run python benchmarks/benchmark_vocabulous.py | tee benchmarks/benchmark_output.txt
uv run python benchmarks/train_parallel_compare.py --rows 200000 --clean-workers 4 --token-workers 4
Classification Performance
Vocabulous provides comprehensive evaluation metrics:
- Accuracy: Overall classification accuracy
- Precision/Recall/F1: Per-language and macro-averaged metrics
- Confusion Score: Measures how often languages are confused with each other
- Confidence Margin: Average difference between top two language scores (higher = more confident)
Limitations
- Vocabulary-Based: Works best with languages that have distinct vocabularies
- Training Data Size: Requires sufficient training data for each language
- Script Mixing: May struggle with code-switched text within sentences
- Short Text: Performance degrades on very short texts (1-2 words)
API Reference
Core Classes
Vocabulous(store_training_data=False)
Main class for language detection and training.
Parameters:
store_training_data(bool): Whether to store training data internally
Methods
train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)
Train the model on provided data.
Parameters:
train_df: Training data (list of dicts or DataFrame)eval_df: Evaluation data (list of dicts or DataFrame)cycles(int): Number of training cyclesbase_confidence(float): Minimum confidence thresholdconfidence_margin(float): Minimum score difference threshold
Returns:
(model, report): Updated model and training report
clean(dataset)
Clean a dataset by filtering confident predictions.
Parameters:
dataset: DataFrame with 'text' and 'lang' columns
Returns:
- DataFrame with confident predictions only
save(path) / load(path)
Save/load model to/from JSON file.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Development Setup
git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
pip install -e ".[dev]"
pytest tests/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
This project is supported by Omneity Labs, a research lab focused on building NLP and generative AI models for low-resource languages and techniques for cultural alignment.
Contributors
Citation
If you use Vocabulous in your research, please cite:
@software{vocabulous2025,
title={Vocabulous: Bootstrapping Language Detection from Noisy \& Ambiguous Data},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/vocabulous},
note={Project developed under Omneity Labs}
}
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: GitHub README
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocabulous-0.1.2.tar.gz.
File metadata
- Download URL: vocabulous-0.1.2.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ac4f7d7efc732f1bbd262d195a46f622a22e01481d84c09bfab78dc16452c6d
|
|
| MD5 |
c35faa20712d010b58f93819804a1ca3
|
|
| BLAKE2b-256 |
4956c34a61ad069e68fdfb39061ae6aab220b9d559d37cc225b965c10e59315f
|
Provenance
The following attestation bundles were made for vocabulous-0.1.2.tar.gz:
Publisher:
publish.yml on omarkamali/vocabulous
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocabulous-0.1.2.tar.gz -
Subject digest:
5ac4f7d7efc732f1bbd262d195a46f622a22e01481d84c09bfab78dc16452c6d - Sigstore transparency entry: 747587390
- Sigstore integration time:
-
Permalink:
omarkamali/vocabulous@f00fa3d97929ef1751b25acaf1363fc8c9217a51 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f00fa3d97929ef1751b25acaf1363fc8c9217a51 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vocabulous-0.1.2-py3-none-any.whl.
File metadata
- Download URL: vocabulous-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5035373e3074a94c87fc1bacbe981a031dc8e05dc5604d83080b95a970ddcb4c
|
|
| MD5 |
be5dfd912fd219cfdb3e46f8d9012679
|
|
| BLAKE2b-256 |
855768cb073d260701374a18b2d396c9ad981fa5990dc90b10deb71e1fb51078
|
Provenance
The following attestation bundles were made for vocabulous-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on omarkamali/vocabulous
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocabulous-0.1.2-py3-none-any.whl -
Subject digest:
5035373e3074a94c87fc1bacbe981a031dc8e05dc5604d83080b95a970ddcb4c - Sigstore transparency entry: 747587391
- Sigstore integration time:
-
Permalink:
omarkamali/vocabulous@f00fa3d97929ef1751b25acaf1363fc8c9217a51 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f00fa3d97929ef1751b25acaf1363fc8c9217a51 -
Trigger Event:
release
-
Statement type: