Skip to main content

Language Quality Toolkit for Low-Resource Languages

Project description

LangQuality - Language Quality Toolkit for Low-Resource Languages

PyPI version CI Status Coverage License: MIT Python 3.8+ Documentation

A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.

✨ Key Features

🌍 Multi-Language Support via Language Packs

  • Language-agnostic architecture: Works with any language through configurable Language Packs
  • Pre-built packs: Fongbe, French, English, and more
  • Easy customization: Create your own Language Pack in minutes
  • Community-driven: Share and discover Language Packs from the community

🔍 Comprehensive Quality Analysis

  • Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
  • Linguistic Analysis: Readability scores, lexical complexity, morphological features
  • Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
  • Domain Analysis: Thematic balance, under/over-represented categories
  • Gender Bias Detection: Gender representation, stereotype detection, balance metrics

🔌 Extensible Plugin System

  • Custom analyzers: Add your own analysis modules without modifying core code
  • Automatic discovery: Drop plugins into a directory and they're automatically loaded
  • Language-specific analyzers: Create analyzers tailored to specific languages

📊 Rich Output Formats

  • Interactive Dashboard: Beautiful HTML visualizations with Plotly
  • Actionable Recommendations: Prioritized suggestions based on best practices
  • Multiple Exports: JSON, CSV, PDF reports, execution logs
  • Per-sentence annotations: Quality scores and flags for each sentence

🚀 Quick Start

Installation

# Install from PyPI
pip install langquality

# Install with all optional dependencies
pip install langquality[all]

# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md  # For French
python -m spacy download en_core_web_md   # For English

Basic Usage

Analyze a dataset with a specific language:

# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon

# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra

# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng

View Results

# Open the interactive dashboard
open results/dashboard.html

Python API

from langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader

# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")

# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")

# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)

# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")

📦 Language Packs

Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:

  • Language-specific configuration (tokenization, thresholds, etc.)
  • Linguistic resources (lexicons, stopwords, gender terms, etc.)
  • Optional custom analyzers

Available Language Packs

Language Code Status Resources
Fongbe fon ✅ Stable Full (lexicon, gender terms, ASR vocabulary)
French fra ✅ Stable Full (lexicon, stopwords, gender terms, professions)
English eng ✅ Stable Full (lexicon, stopwords, gender terms, professions)
Minangkabau min 🚧 Minimal Basic configuration only
Your Language xxx 💡 Create one! See Language Pack Guide

Managing Language Packs

# List installed packs
langquality pack list

# Show pack details
langquality pack info fon

# Create a new pack template
langquality pack create <language_code>

# Validate a pack
langquality pack validate path/to/pack

Creating Your Own Language Pack

Creating a Language Pack for your language is straightforward:

  1. Generate a template:

    langquality pack create <your_language_code>
    
  2. Configure the pack: Edit config.yaml with language-specific settings

  3. Add resources (optional): Add lexicons, stopwords, or other linguistic resources

  4. Test it:

    langquality pack validate path/to/your_pack
    langquality analyze --input test_data --output results --language <your_language_code>
    

See the Language Pack Guide for detailed instructions.

📖 Documentation

🎯 Use Cases

LangQuality is designed for researchers and developers working with low-resource languages:

  • ASR Dataset Preparation: Ensure text quality before audio recording
  • Machine Translation: Validate parallel corpora quality
  • Language Model Training: Assess dataset diversity and balance
  • Corpus Linguistics: Analyze linguistic properties of text collections
  • Data Curation: Filter and improve existing datasets

🔧 Advanced Features

Custom Configuration

Override default thresholds and settings:

langquality analyze --input data --output results --language fon --config my_config.yaml

Example configuration:

thresholds:
  structural:
    min_words: 5
    max_words: 15
  diversity:
    target_ttr: 0.65
  gender:
    target_ratio: [0.45, 0.55]

Custom Analyzers

Create custom analyzers for specialized analysis:

from langquality.analyzers import Analyzer

class ToneAnalyzer(Analyzer):
    """Analyze tone and sentiment of sentences."""
    
    def analyze(self, sentences):
        # Your analysis logic
        return metrics
    
    def get_requirements(self):
        return ["tone_lexicon"]  # Required resources

Place your analyzer in the plugins directory and it will be automatically discovered.

See Creating Analyzers for details.

🤝 Contributing

We welcome contributions from the community! Whether you're:

  • 🌍 Creating a Language Pack for your language
  • 🔧 Adding new analyzers or features
  • 📝 Improving documentation
  • 🐛 Reporting bugs or issues
  • 💡 Suggesting enhancements

Please see our Contributing Guide for:

  • Code of Conduct
  • Development setup
  • Contribution workflow
  • Coding standards
  • Testing requirements

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Ensure tests pass: pytest
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

👥 Community

Join our community to get help, share ideas, and collaborate:

Support Channels

📊 Project Status

LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.

Current version: 1.0.0 (Stable)

📜 License

LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.

🙏 Acknowledgments

LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:

  • The linguistic community working on African language preservation and NLP development
  • Contributors who have created Language Packs and shared their expertise
  • The open-source NLP community for tools and libraries that make this work possible

📚 Citation

If you use LangQuality in your research, please cite:

@software{langquality_toolkit,
  title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
  author={LangQuality Community},
  year={2024},
  url={https://github.com/langquality/langquality},
  version={1.0.0}
}

🔗 Related Projects


Made with ❤️ for low-resource language communities worldwide

Get Started | Documentation | Community | Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langquality-1.0.0.tar.gz (136.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langquality-1.0.0-py3-none-any.whl (148.1 kB view details)

Uploaded Python 3

File details

Details for the file langquality-1.0.0.tar.gz.

File metadata

  • Download URL: langquality-1.0.0.tar.gz
  • Upload date:
  • Size: 136.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6af12d8578fd09ba40c2c11874cb6d9d57babd924feb6870f98fbd66f1b23b85
MD5 c895c2bbb4fa06f51922b33c030643f1
BLAKE2b-256 42ea105236e11a426d6ccc701f3e42ff9822170a767541374e9a4211692d4527

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.0.tar.gz:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langquality-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: langquality-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 148.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb84ecb3917d77fdcc08d75d8d9066f3566cd01ed91b9964ad5e1f8b6ffc7ef8
MD5 c19c8d915fa20a0f6562d91316886bb3
BLAKE2b-256 a53cab6b0f782cedb40ee54c58b62f5cdab1deb982f2984ad30ea7ba0d542a61

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.0-py3-none-any.whl:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page