Language Quality Toolkit for Low-Resource Languages

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

laleye

These details have not been verified by PyPI

Project links

Documentation

Project description

LangQuality - Language Quality Toolkit for Low-Resource Languages

A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.

✨ Key Features

🌍 Multi-Language Support via Language Packs

Language-agnostic architecture: Works with any language through configurable Language Packs
Pre-built packs: Fongbe, French, English, and more
Easy customization: Create your own Language Pack in minutes
Community-driven: Share and discover Language Packs from the community

🔍 Comprehensive Quality Analysis

Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
Linguistic Analysis: Readability scores, lexical complexity, morphological features
Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
Domain Analysis: Thematic balance, under/over-represented categories
Gender Bias Detection: Gender representation, stereotype detection, balance metrics

🔌 Extensible Plugin System

Custom analyzers: Add your own analysis modules without modifying core code
Automatic discovery: Drop plugins into a directory and they're automatically loaded
Language-specific analyzers: Create analyzers tailored to specific languages

📊 Rich Output Formats

Interactive Dashboard: Beautiful HTML visualizations with Plotly
Actionable Recommendations: Prioritized suggestions based on best practices
Multiple Exports: JSON, CSV, PDF reports, execution logs
Per-sentence annotations: Quality scores and flags for each sentence

🚀 Quick Start

Installation

# Install from PyPI
pip install langquality

# Install with all optional dependencies
pip install langquality[all]

# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md  # For French
python -m spacy download en_core_web_md   # For English

Basic Usage

Analyze a dataset with a specific language:

# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon

# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra

# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng

View Results

# Open the interactive dashboard
open results/dashboard.html

Python API

from langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader

# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")

# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")

# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)

# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")

📦 Language Packs

Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:

Language-specific configuration (tokenization, thresholds, etc.)
Linguistic resources (lexicons, stopwords, gender terms, etc.)
Optional custom analyzers

Available Language Packs

Language	Code	Status	Resources
Fongbe	`fon`	✅ Stable	Full (lexicon, gender terms, ASR vocabulary)
French	`fra`	✅ Stable	Full (lexicon, stopwords, gender terms, professions)
English	`eng`	✅ Stable	Full (lexicon, stopwords, gender terms, professions)
Minangkabau	`min`	🚧 Minimal	Basic configuration only
Your Language	`xxx`	💡 Create one!	See Language Pack Guide

Managing Language Packs

# List installed packs
langquality pack list

# Show pack details
langquality pack info fon

# Create a new pack template
langquality pack create <language_code>

# Validate a pack
langquality pack validate path/to/pack

Creating Your Own Language Pack

Creating a Language Pack for your language is straightforward:

Generate a template:

langquality pack create <your_language_code>

Configure the pack: Edit config.yaml with language-specific settings
Add resources (optional): Add lexicons, stopwords, or other linguistic resources

Test it:

langquality pack validate path/to/your_pack
langquality analyze --input test_data --output results --language <your_language_code>

See the Language Pack Guide for detailed instructions.

📖 Documentation

Quickstart Guide: Get up and running in 5 minutes
User Guide: Comprehensive usage documentation
- Installation
- Analyzing Data
Language Pack Guide: Create and customize Language Packs
Developer Guide: Extend LangQuality
API Reference: Complete API documentation
FAQ: Frequently asked questions
Migration Guide: Migrating from fongbe-data-quality

🎯 Use Cases

LangQuality is designed for researchers and developers working with low-resource languages:

ASR Dataset Preparation: Ensure text quality before audio recording
Machine Translation: Validate parallel corpora quality
Language Model Training: Assess dataset diversity and balance
Corpus Linguistics: Analyze linguistic properties of text collections
Data Curation: Filter and improve existing datasets

🔧 Advanced Features

Custom Configuration

Override default thresholds and settings:

langquality analyze --input data --output results --language fon --config my_config.yaml

Example configuration:

thresholds:
  structural:
    min_words: 5
    max_words: 15
  diversity:
    target_ttr: 0.65
  gender:
    target_ratio: [0.45, 0.55]

Custom Analyzers

Create custom analyzers for specialized analysis:

from langquality.analyzers import Analyzer

class ToneAnalyzer(Analyzer):
    """Analyze tone and sentiment of sentences."""
    
    def analyze(self, sentences):
        # Your analysis logic
        return metrics
    
    def get_requirements(self):
        return ["tone_lexicon"]  # Required resources

Place your analyzer in the plugins directory and it will be automatically discovered.

See Creating Analyzers for details.

🤝 Contributing

We welcome contributions from the community! Whether you're:

🌍 Creating a Language Pack for your language
🔧 Adding new analyzers or features
📝 Improving documentation
🐛 Reporting bugs or issues
💡 Suggesting enhancements

Please see our Contributing Guide for:

Code of Conduct
Development setup
Contribution workflow
Coding standards
Testing requirements

Quick Contribution Steps

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Ensure tests pass: pytest
Commit: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open a Pull Request

👥 Community

Join our community to get help, share ideas, and collaborate:

GitHub Discussions: Ask questions, share ideas, showcase your Language Packs
Issue Tracker: Report bugs, request features
Documentation: Comprehensive guides and API reference
Contributing Guide: Learn how to contribute
Code of Conduct: Our community standards

Support Channels

💬 Questions: Use GitHub Discussions Q&A
🐛 Bug Reports: Open an issue
💡 Feature Requests: Open an issue
🌍 Language Pack Submissions: Use our Language Pack template

📊 Project Status

LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.

Current version: 1.0.0 (Stable)

📜 License

LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.

🙏 Acknowledgments

LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:

The linguistic community working on African language preservation and NLP development
Contributors who have created Language Packs and shared their expertise
The open-source NLP community for tools and libraries that make this work possible

📚 Citation

If you use LangQuality in your research, please cite:

@software{langquality_toolkit,
  title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
  author={LangQuality Community},
  year={2024},
  url={https://github.com/langquality/langquality},
  version={1.0.0}
}

🔗 Related Projects

Common Voice: Crowdsourced voice dataset
FLORES: Multilingual translation benchmark
Masakhane: African NLP community

Made with ❤️ for low-resource language communities worldwide

Get Started | Documentation | Community | Contributing

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

laleye

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

1.0.1

Nov 25, 2025

1.0.0

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langquality-1.0.1.tar.gz (139.2 kB view details)

Uploaded Nov 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langquality-1.0.1-py3-none-any.whl (148.5 kB view details)

Uploaded Nov 25, 2025 Python 3

File details

Details for the file langquality-1.0.1.tar.gz.

File metadata

Download URL: langquality-1.0.1.tar.gz
Upload date: Nov 25, 2025
Size: 139.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6dcda6992967da3a9ba6bc0d78f17890f4b0d0f6186dfccee2ae725b33f5bb5b`
MD5	`3c096d049900752ec5dc40b45988b5d1`
BLAKE2b-256	`1e1f125a6d9eb3dff92a87d8af270992e38158d34b1a4870923ba7292a333ca6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.1.tar.gz:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langquality-1.0.1.tar.gz
- Subject digest: 6dcda6992967da3a9ba6bc0d78f17890f4b0d0f6186dfccee2ae725b33f5bb5b
- Sigstore transparency entry: 724742024
- Sigstore integration time: Nov 25, 2025
Source repository:
- Permalink: laleye/langquality@04a45dac226e760149074ce58cfc70c77f71e1d3
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/laleye
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04a45dac226e760149074ce58cfc70c77f71e1d3
- Trigger Event: release

File details

Details for the file langquality-1.0.1-py3-none-any.whl.

File metadata

Download URL: langquality-1.0.1-py3-none-any.whl
Upload date: Nov 25, 2025
Size: 148.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e848a0be40ab0b639ef1633f40405e73fa2a94d4f678e5382af5a907f2665835`
MD5	`e99ac1afee6649754fec175507bcb25d`
BLAKE2b-256	`1b1f90da93785d3ed2f9e1a2cf3a333078e71d9252d2feb4c5a14ee7c4665f9c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.1-py3-none-any.whl:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langquality-1.0.1-py3-none-any.whl
- Subject digest: e848a0be40ab0b639ef1633f40405e73fa2a94d4f678e5382af5a907f2665835
- Sigstore transparency entry: 724742045
- Sigstore integration time: Nov 25, 2025
Source repository:
- Permalink: laleye/langquality@04a45dac226e760149074ce58cfc70c77f71e1d3
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/laleye
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04a45dac226e760149074ce58cfc70c77f71e1d3
- Trigger Event: release

langquality 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LangQuality - Language Quality Toolkit for Low-Resource Languages

✨ Key Features

🌍 Multi-Language Support via Language Packs

🔍 Comprehensive Quality Analysis

🔌 Extensible Plugin System

📊 Rich Output Formats

🚀 Quick Start

Installation

Basic Usage

View Results

Python API

📦 Language Packs

Available Language Packs

Managing Language Packs

Creating Your Own Language Pack

📖 Documentation

🎯 Use Cases

🔧 Advanced Features

Custom Configuration

Custom Analyzers

🤝 Contributing

Quick Contribution Steps

👥 Community

Support Channels

📊 Project Status

📜 License

🙏 Acknowledgments

📚 Citation

🔗 Related Projects

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance