A bootstrapping language detection system that builds dictionaries from noisy and ambiguous training data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

omarkamali

These details have not been verified by PyPI

Project description

Vocabulous

A bootstrapping language detection system that builds high-quality dictionaries from noisy training data.

Overview

Vocabulous addresses a common challenge in NLP: building reliable language detection systems when you only have noisy, potentially mislabeled training data. Traditional approaches either require clean, manually curated datasets or sophisticated neural networks. Vocabulous takes a different approach by using iterative dictionary building and progressive data cleaning to bootstrap accurate language detection from imperfect data.

Key Features

Bootstrapping from Noisy Data: Starts with potentially mislabeled training data and iteratively improves
Dictionary-Based Detection: Uses word frequency dictionaries for fast, interpretable language detection
Progressive Data Cleaning: Removes ambiguous and mislabeled samples across training cycles
Multi-Script Support: Handles both Latin and Arabic scripts with appropriate text normalization
Configurable Training: Adjustable confidence thresholds and confidence margin parameters for different scenarios
Model Persistence: Save and load trained models for reuse

Installation

uv pip install vocabulous

Development Installation

git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
uv pip install -e ".[dev]"

Quick Start

from vocabulous import Vocabulous

# Initialize model
model = Vocabulous()

# Prepare training data (list of dicts with 'text' and 'lang' keys)
train_data = [
    {'text': 'Hello world', 'lang': 'en'},
    {'text': 'Bonjour le monde', 'lang': 'fr'},
    {'text': 'Hola mundo', 'lang': 'es'},
    # ... more training examples
]

# Evaluation data for monitoring training progress
eval_data = [
    {'text': 'Good morning', 'lang': 'en'},
    {'text': 'Bon matin', 'lang': 'fr'},
    {'text': 'Buenos días', 'lang': 'es'},
]

# Train the model (enable parallel cleaning/tokenization when working at scale)
model, report = model.train(
    train_data=train_data,
    eval_data=eval_data,
    cycles=3,
    base_confidence=0.5,
    confidence_margin=0.3,
    clean_workers=4,  # optional multiprocessing for text cleaning
    token_workers=4,  # optional multiprocessing for sentence tokenization
)

# Use for language detection
scores = model._score_sentence("Hello there")
print(scores)  # {'en': 1.0}

# Save the model
model.save('my_model.json')

# Load later
loaded_model = Vocabulous.load('my_model.json')

Methodology

The Bootstrapping Approach

Vocabulous implements a novel bootstrapping methodology for language detection:

Initial Dictionary Building: Creates word-language frequency dictionaries from all training data
Scoring & Evaluation: Scores evaluation data to measure current model performance
Data Cleaning: Removes training samples that contradict the current dictionaries
Iteration: Repeats the process with cleaned data to progressively improve quality

Why This Works

The approach is based on several key insights:

Majority Signal: Even noisy datasets typically contain more correct than incorrect labels
Word Uniqueness: Many words are language-specific and provide strong signals
Progressive Refinement: Each iteration removes the most problematic samples first
Convergence: The process naturally converges when no more samples can be confidently removed

Training Parameters

cycles: Number of training iterations (default: 2)
base_confidence: Minimum score threshold for keeping samples (0-1)
confidence_margin: Minimum difference between top two language scores (0-1)

Higher values make the filtering more aggressive, while lower values are more permissive.

Use Cases

1. Bootstrapping Language Detection

Scenario: You have a large dataset of multilingual text with potentially noisy language labels.

# Start with noisy data
noisy_data = [
    {'text': 'Hello world', 'lang': 'en'},
    {'text': 'Bonjour', 'lang': 'en'},  # Mislabeled!
    {'text': 'Hello', 'lang': 'fr'},    # Mislabeled!
    {'text': 'Comment ça va?', 'lang': 'fr'},
    # ... thousands more with ~10% label noise
]

model = Vocabulous()
model, report = model.train(noisy_data, eval_data, cycles=3)

# The model learns to ignore mislabeled samples
print(f"Dictionary size: {len(model.word_lang_freq)}")
print(f"Final accuracy: {report['cycle_reports'][-1]['accuracy']:.3f}")

2. Data Cleaning Pipeline

Scenario: Clean a noisy multilingual dataset before using it for other NLP tasks.

# Train model on subset of data
model, _ = model.train(sample_data, eval_data)

# Clean the full dataset
cleaned_dataset = model.clean(full_noisy_dataset)

# Now use cleaned_dataset for training other models
print(f"Kept {len(cleaned_dataset)}/{len(full_noisy_dataset)} samples")

3. Incremental Learning

Scenario: Continuously improve language detection as new data becomes available.

# Initial training
model, _ = model.train(initial_data, eval_data)

# Later, integrate new data
model, updated_report = model.train(
    new_data + initial_data,  # Combine old and new
    eval_data,
    cycles=2
)

4. Cross-Domain Adaptation

Scenario: Adapt a model trained on one domain (e.g., news) to another (e.g., social media).

# Train on news data
news_model, _ = news_model.train(news_data, news_eval)

# Adapt to social media by combining datasets
adapted_model = Vocabulous()
adapted_model, _ = adapted_model.train(
    social_media_data + news_data,
    social_media_eval,
    cycles=3,
    base_confidence=0.3  # Lower threshold for noisy social media text
)

Advanced Usage

Custom Text Preprocessing

# Subclass to customize text cleaning
class CustomVocabulous(Vocabulous):
    def _clean_text(self, text):
        # Add custom preprocessing
        text = super()._clean_text(text)
        # Your custom logic here
        return text

Training Monitoring

model, report = model.train(train_data, eval_data, cycles=5)

# Analyze training progress
for i, cycle_report in enumerate(report['cycle_reports']):
    print(f"Cycle {i+1}:")
    print(f"  Accuracy: {cycle_report['accuracy']:.3f}")
    print(f"  F1 Score: {cycle_report['f1']:.3f}")
    print(f"  Samples removed: {cycle_report['removed_samples']}")
    print(f"  Confidence Margin: {cycle_report['confidence_margin']:.3f}")

Scoring backends

Default: swifter-accelerated Pandas apply via model._score(...).
Alternatives (experimental): vectorized, numba, sparse backends exist and are used in benchmarks.

Planned API to switch backends:

model.set_scoring_mode("vectorized")  # or: "apply", "numba", "sparse", "auto"
scored = model._score(df)

Until switching is wired end-to-end, call experimental methods directly:

# Vectorized scoring for a text Series -> Series[dict]
scores_vec = model._score_vectorized(df["text"])  # experimental API
df = df.copy()
df["scores"] = scores_vec

# Numba-backed scoring (if numba installed) -> Series[dict]
scores_numba = model._score_numba(df["text"])  # experimental API

# Note: _score(...) remains the default swifter-apply path as of 0.1.2

Confidence Scoring

# Get detailed scores for a sentence
scores = model._score_sentence("Hello world")
# Returns: {'en': 0.75, 'fr': 0.25}

# For datasets
scored_df = model._score(test_data)
print(scored_df[['text', 'scores', 'lang']])

Performance Tips

Memory Optimization

# For large datasets, disable training data storage
model = Vocabulous(store_training_data=False)

Speed Optimization

# Use fewer cycles for faster training
model, _ = model.train(data, eval_data, cycles=1)

# Lower confidence margin for less aggressive filtering
model, _ = model.train(data, eval_data, confidence_margin=0.1)

# Enable multiprocessing for cleaning/tokenization on large corpora
model, _ = model.train(
    data,
    eval_data,
    cycles=1,
    clean_workers=4,
    token_workers=4,
)

Quality Optimization

# More cycles for higher quality
model, _ = model.train(data, eval_data, cycles=5)

# Higher confidence threshold for cleaner dictionaries
model, _ = model.train(data, eval_data, base_confidence=0.7)

Evaluation Metrics

Benchmarks

Full results: See benchmark.md for the complete output and methodology.
Parallel training report: parallel_training_report.md documents sequential vs multiprocessing runs using clean_workers/token_workers.
Highlights:
- 20k rows clean+score: ~454k rows/s
- Apply vs Vectorized: ~450–500k rows/s on 5k–50k rows
- Longer sentences reduce throughput (len=50 ~190k rows/s)
- Dictionary size (50→5000 per language): near ~450–480k rows/s
- Large-n vectorized batched (100k): ~403k rows/s
- Large-n compare (200k, dict=1000, len=20): ~224k–225k rows/s across modes
- Parallel training (200k synthetic sentences with 10k word vocabulary): Sequential 393 s vs parallel (4×4 workers) 120 s → 3.26× speedup with identical dictionaries/predictions.

Run locally with uv:

uv run python benchmarks/benchmark_vocabulous.py | tee benchmarks/benchmark_output.txt
uv run python benchmarks/train_parallel_compare.py --rows 200000 --clean-workers 4 --token-workers 4

Classification Performance

Vocabulous provides comprehensive evaluation metrics:

Accuracy: Overall classification accuracy
Precision/Recall/F1: Per-language and macro-averaged metrics
Confusion Score: Measures how often languages are confused with each other
Confidence Margin: Average difference between top two language scores (higher = more confident)

Limitations

Vocabulary-Based: Works best with languages that have distinct vocabularies
Training Data Size: Requires sufficient training data for each language
Script Mixing: May struggle with code-switched text within sentences
Short Text: Performance degrades on very short texts (1-2 words)

API Reference

Core Classes

`Vocabulous(store_training_data=False)`

Main class for language detection and training.

Parameters:

store_training_data (bool): Whether to store training data internally

Methods

`train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)`

Train the model on provided data.

Parameters:

train_df: Training data (list of dicts or DataFrame)
eval_df: Evaluation data (list of dicts or DataFrame)
cycles (int): Number of training cycles
base_confidence (float): Minimum confidence threshold
confidence_margin (float): Minimum score difference threshold

Returns:

(model, report): Updated model and training report

`clean(dataset)`

Clean a dataset by filtering confident predictions.

Parameters:

dataset: DataFrame with 'text' and 'lang' columns

Returns:

DataFrame with confident predictions only

`save(path)` / `load(path)`

Save/load model to/from JSON file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
pip install -e ".[dev]"
pytest tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project is supported by Omneity Labs, a research lab focused on building NLP and generative AI models for low-resource languages and techniques for cultural alignment.

Contributors

Omar Kamali

Citation

If you use Vocabulous in your research, please cite:

@software{vocabulous2025,
  title={Vocabulous: Bootstrapping Language Detection from Noisy \& Ambiguous Data},
  author={Omar Kamali},
  year={2025},
  url={https://github.com/omarkamali/vocabulous},
  note={Project developed under Omneity Labs}
}

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: GitHub README

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

omarkamali

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Dec 12, 2025

This version

0.1.2

Dec 7, 2025

0.1.1

Nov 5, 2025

0.0.1

Jul 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocabulous-0.1.2.tar.gz (26.4 kB view details)

Uploaded Dec 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocabulous-0.1.2-py3-none-any.whl (21.4 kB view details)

Uploaded Dec 7, 2025 Python 3

File details

Details for the file vocabulous-0.1.2.tar.gz.

File metadata

Download URL: vocabulous-0.1.2.tar.gz
Upload date: Dec 7, 2025
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocabulous-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5ac4f7d7efc732f1bbd262d195a46f622a22e01481d84c09bfab78dc16452c6d`
MD5	`c35faa20712d010b58f93819804a1ca3`
BLAKE2b-256	`4956c34a61ad069e68fdfb39061ae6aab220b9d559d37cc225b965c10e59315f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocabulous-0.1.2.tar.gz:

Publisher: publish.yml on omarkamali/vocabulous

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocabulous-0.1.2.tar.gz
- Subject digest: 5ac4f7d7efc732f1bbd262d195a46f622a22e01481d84c09bfab78dc16452c6d
- Sigstore transparency entry: 747587390
- Sigstore integration time: Dec 7, 2025
Source repository:
- Permalink: omarkamali/vocabulous@f00fa3d97929ef1751b25acaf1363fc8c9217a51
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f00fa3d97929ef1751b25acaf1363fc8c9217a51
- Trigger Event: release

File details

Details for the file vocabulous-0.1.2-py3-none-any.whl.

File metadata

Download URL: vocabulous-0.1.2-py3-none-any.whl
Upload date: Dec 7, 2025
Size: 21.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocabulous-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5035373e3074a94c87fc1bacbe981a031dc8e05dc5604d83080b95a970ddcb4c`
MD5	`be5dfd912fd219cfdb3e46f8d9012679`
BLAKE2b-256	`855768cb073d260701374a18b2d396c9ad981fa5990dc90b10deb71e1fb51078`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocabulous-0.1.2-py3-none-any.whl:

Publisher: publish.yml on omarkamali/vocabulous

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocabulous-0.1.2-py3-none-any.whl
- Subject digest: 5035373e3074a94c87fc1bacbe981a031dc8e05dc5604d83080b95a970ddcb4c
- Sigstore transparency entry: 747587391
- Sigstore integration time: Dec 7, 2025
Source repository:
- Permalink: omarkamali/vocabulous@f00fa3d97929ef1751b25acaf1363fc8c9217a51
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f00fa3d97929ef1751b25acaf1363fc8c9217a51
- Trigger Event: release

vocabulous 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Vocabulous

Overview

Key Features

Installation

Development Installation

Quick Start

Methodology

The Bootstrapping Approach

Why This Works

Training Parameters

Use Cases

1. Bootstrapping Language Detection

2. Data Cleaning Pipeline

3. Incremental Learning

4. Cross-Domain Adaptation

Advanced Usage

Custom Text Preprocessing

Training Monitoring

Scoring backends

Confidence Scoring

Performance Tips

Memory Optimization

Speed Optimization

Quality Optimization

Evaluation Metrics

Benchmarks

Classification Performance

Limitations

API Reference

Core Classes

Vocabulous(store_training_data=False)

Methods

train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)

clean(dataset)

save(path) / load(path)

Contributing

Development Setup

License

Acknowledgments

Contributors

Citation

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`Vocabulous(store_training_data=False)`

`train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)`

`clean(dataset)`

`save(path)` / `load(path)`