Language Quality Toolkit for Low-Resource Languages
Project description
LangQuality - Language Quality Toolkit for Low-Resource Languages
A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.
✨ Key Features
🌍 Multi-Language Support via Language Packs
- Language-agnostic architecture: Works with any language through configurable Language Packs
- Pre-built packs: Fongbe, French, English, and more
- Easy customization: Create your own Language Pack in minutes
- Community-driven: Share and discover Language Packs from the community
🔍 Comprehensive Quality Analysis
- Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
- Linguistic Analysis: Readability scores, lexical complexity, morphological features
- Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
- Domain Analysis: Thematic balance, under/over-represented categories
- Gender Bias Detection: Gender representation, stereotype detection, balance metrics
🔌 Extensible Plugin System
- Custom analyzers: Add your own analysis modules without modifying core code
- Automatic discovery: Drop plugins into a directory and they're automatically loaded
- Language-specific analyzers: Create analyzers tailored to specific languages
📊 Rich Output Formats
- Interactive Dashboard: Beautiful HTML visualizations with Plotly
- Actionable Recommendations: Prioritized suggestions based on best practices
- Multiple Exports: JSON, CSV, PDF reports, execution logs
- Per-sentence annotations: Quality scores and flags for each sentence
🚀 Quick Start
Installation
# Install from PyPI
pip install langquality
# Install with all optional dependencies
pip install langquality[all]
# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md # For French
python -m spacy download en_core_web_md # For English
Basic Usage
Analyze a dataset with a specific language:
# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon
# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra
# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng
View Results
# Open the interactive dashboard
open results/dashboard.html
Python API
from langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader
# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")
# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")
# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)
# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")
📦 Language Packs
Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:
- Language-specific configuration (tokenization, thresholds, etc.)
- Linguistic resources (lexicons, stopwords, gender terms, etc.)
- Optional custom analyzers
Available Language Packs
| Language | Code | Status | Resources |
|---|---|---|---|
| Fongbe | fon |
✅ Stable | Full (lexicon, gender terms, ASR vocabulary) |
| French | fra |
✅ Stable | Full (lexicon, stopwords, gender terms, professions) |
| English | eng |
✅ Stable | Full (lexicon, stopwords, gender terms, professions) |
| Minangkabau | min |
🚧 Minimal | Basic configuration only |
| Your Language | xxx |
💡 Create one! | See Language Pack Guide |
Managing Language Packs
# List installed packs
langquality pack list
# Show pack details
langquality pack info fon
# Create a new pack template
langquality pack create <language_code>
# Validate a pack
langquality pack validate path/to/pack
Creating Your Own Language Pack
Creating a Language Pack for your language is straightforward:
-
Generate a template:
langquality pack create <your_language_code>
-
Configure the pack: Edit
config.yamlwith language-specific settings -
Add resources (optional): Add lexicons, stopwords, or other linguistic resources
-
Test it:
langquality pack validate path/to/your_pack langquality analyze --input test_data --output results --language <your_language_code>
See the Language Pack Guide for detailed instructions.
📖 Documentation
- Quickstart Guide: Get up and running in 5 minutes
- User Guide: Comprehensive usage documentation
- Language Pack Guide: Create and customize Language Packs
- Developer Guide: Extend LangQuality
- API Reference: Complete API documentation
- FAQ: Frequently asked questions
- Migration Guide: Migrating from fongbe-data-quality
🎯 Use Cases
LangQuality is designed for researchers and developers working with low-resource languages:
- ASR Dataset Preparation: Ensure text quality before audio recording
- Machine Translation: Validate parallel corpora quality
- Language Model Training: Assess dataset diversity and balance
- Corpus Linguistics: Analyze linguistic properties of text collections
- Data Curation: Filter and improve existing datasets
🔧 Advanced Features
Custom Configuration
Override default thresholds and settings:
langquality analyze --input data --output results --language fon --config my_config.yaml
Example configuration:
thresholds:
structural:
min_words: 5
max_words: 15
diversity:
target_ttr: 0.65
gender:
target_ratio: [0.45, 0.55]
Custom Analyzers
Create custom analyzers for specialized analysis:
from langquality.analyzers import Analyzer
class ToneAnalyzer(Analyzer):
"""Analyze tone and sentiment of sentences."""
def analyze(self, sentences):
# Your analysis logic
return metrics
def get_requirements(self):
return ["tone_lexicon"] # Required resources
Place your analyzer in the plugins directory and it will be automatically discovered.
See Creating Analyzers for details.
🤝 Contributing
We welcome contributions from the community! Whether you're:
- 🌍 Creating a Language Pack for your language
- 🔧 Adding new analyzers or features
- 📝 Improving documentation
- 🐛 Reporting bugs or issues
- 💡 Suggesting enhancements
Please see our Contributing Guide for:
- Code of Conduct
- Development setup
- Contribution workflow
- Coding standards
- Testing requirements
Quick Contribution Steps
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Ensure tests pass:
pytest - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
👥 Community
Join our community to get help, share ideas, and collaborate:
- GitHub Discussions: Ask questions, share ideas, showcase your Language Packs
- Issue Tracker: Report bugs, request features
- Documentation: Comprehensive guides and API reference
- Contributing Guide: Learn how to contribute
- Code of Conduct: Our community standards
Support Channels
- 💬 Questions: Use GitHub Discussions Q&A
- 🐛 Bug Reports: Open an issue
- 💡 Feature Requests: Open an issue
- 🌍 Language Pack Submissions: Use our Language Pack template
📊 Project Status
LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.
Current version: 1.0.0 (Stable)
📜 License
LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.
🙏 Acknowledgments
LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:
- The linguistic community working on African language preservation and NLP development
- Contributors who have created Language Packs and shared their expertise
- The open-source NLP community for tools and libraries that make this work possible
📚 Citation
If you use LangQuality in your research, please cite:
@software{langquality_toolkit,
title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
author={LangQuality Community},
year={2024},
url={https://github.com/langquality/langquality},
version={1.0.0}
}
🔗 Related Projects
- Common Voice: Crowdsourced voice dataset
- FLORES: Multilingual translation benchmark
- Masakhane: African NLP community
Made with ❤️ for low-resource language communities worldwide
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langquality-1.0.1.tar.gz.
File metadata
- Download URL: langquality-1.0.1.tar.gz
- Upload date:
- Size: 139.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dcda6992967da3a9ba6bc0d78f17890f4b0d0f6186dfccee2ae725b33f5bb5b
|
|
| MD5 |
3c096d049900752ec5dc40b45988b5d1
|
|
| BLAKE2b-256 |
1e1f125a6d9eb3dff92a87d8af270992e38158d34b1a4870923ba7292a333ca6
|
Provenance
The following attestation bundles were made for langquality-1.0.1.tar.gz:
Publisher:
release.yml on laleye/langquality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langquality-1.0.1.tar.gz -
Subject digest:
6dcda6992967da3a9ba6bc0d78f17890f4b0d0f6186dfccee2ae725b33f5bb5b - Sigstore transparency entry: 724742024
- Sigstore integration time:
-
Permalink:
laleye/langquality@04a45dac226e760149074ce58cfc70c77f71e1d3 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/laleye
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@04a45dac226e760149074ce58cfc70c77f71e1d3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file langquality-1.0.1-py3-none-any.whl.
File metadata
- Download URL: langquality-1.0.1-py3-none-any.whl
- Upload date:
- Size: 148.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e848a0be40ab0b639ef1633f40405e73fa2a94d4f678e5382af5a907f2665835
|
|
| MD5 |
e99ac1afee6649754fec175507bcb25d
|
|
| BLAKE2b-256 |
1b1f90da93785d3ed2f9e1a2cf3a333078e71d9252d2feb4c5a14ee7c4665f9c
|
Provenance
The following attestation bundles were made for langquality-1.0.1-py3-none-any.whl:
Publisher:
release.yml on laleye/langquality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langquality-1.0.1-py3-none-any.whl -
Subject digest:
e848a0be40ab0b639ef1633f40405e73fa2a94d4f678e5382af5a907f2665835 - Sigstore transparency entry: 724742045
- Sigstore integration time:
-
Permalink:
laleye/langquality@04a45dac226e760149074ce58cfc70c77f71e1d3 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/laleye
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@04a45dac226e760149074ce58cfc70c77f71e1d3 -
Trigger Event:
release
-
Statement type: