Skip to main content

Language Quality Toolkit for Low-Resource Languages

Project description

LangQuality - Language Quality Toolkit for Low-Resource Languages

PyPI version CI Status Coverage License: MIT Python 3.8+ Documentation

A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.

✨ Key Features

🌍 Multi-Language Support via Language Packs

  • Language-agnostic architecture: Works with any language through configurable Language Packs
  • Pre-built packs: Fongbe, French, English, and more
  • Easy customization: Create your own Language Pack in minutes
  • Community-driven: Share and discover Language Packs from the community

🔍 Comprehensive Quality Analysis

  • Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
  • Linguistic Analysis: Readability scores, lexical complexity, morphological features
  • Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
  • Domain Analysis: Thematic balance, under/over-represented categories
  • Gender Bias Detection: Gender representation, stereotype detection, balance metrics

🔌 Extensible Plugin System

  • Custom analyzers: Add your own analysis modules without modifying core code
  • Automatic discovery: Drop plugins into a directory and they're automatically loaded
  • Language-specific analyzers: Create analyzers tailored to specific languages

📊 Rich Output Formats

  • Interactive Dashboard: Beautiful HTML visualizations with Plotly
  • Actionable Recommendations: Prioritized suggestions based on best practices
  • Multiple Exports: JSON, CSV, PDF reports, execution logs
  • Per-sentence annotations: Quality scores and flags for each sentence

🚀 Quick Start

Installation

# Install from PyPI
pip install langquality

# Install with all optional dependencies
pip install langquality[all]

# Download language models (if using spaCy-based packs)
python -m spacy download fr_core_news_md  # For French
python -m spacy download en_core_web_md   # For English

Basic Usage

Analyze a dataset with a specific language:

# Analyze Fongbe data
langquality analyze --input data/fongbe_sentences --output results --language fon

# Analyze French data
langquality analyze --input data/french_sentences --output results --language fra

# Analyze English data
langquality analyze --input data/english_sentences --output results --language eng

View Results

# Open the interactive dashboard
open results/dashboard.html

Python API

from langquality.pipeline import PipelineController
from langquality.language_packs import LanguagePackManager
from langquality.data import GenericDataLoader

# Load a language pack
pack_manager = LanguagePackManager()
language_pack = pack_manager.load_language_pack("fon")

# Load your data
loader = GenericDataLoader(language_pack)
sentences = loader.load_from_csv("data/sentences.csv")

# Run analysis
controller = PipelineController(language_pack)
results = controller.run(sentences)

# Access results
print(f"Total sentences: {results.structural.total_sentences}")
print(f"Average readability: {results.linguistic.avg_readability_score}")

📦 Language Packs

Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:

  • Language-specific configuration (tokenization, thresholds, etc.)
  • Linguistic resources (lexicons, stopwords, gender terms, etc.)
  • Optional custom analyzers

Available Language Packs

Language Code Status Resources
Fongbe fon ✅ Stable Full (lexicon, gender terms, ASR vocabulary)
French fra ✅ Stable Full (lexicon, stopwords, gender terms, professions)
English eng ✅ Stable Full (lexicon, stopwords, gender terms, professions)
Minangkabau min 🚧 Minimal Basic configuration only
Your Language xxx 💡 Create one! See Language Pack Guide

Managing Language Packs

# List installed packs
langquality pack list

# Show pack details
langquality pack info fon

# Create a new pack template
langquality pack create <language_code>

# Validate a pack
langquality pack validate path/to/pack

Creating Your Own Language Pack

Creating a Language Pack for your language is straightforward:

  1. Generate a template:

    langquality pack create <your_language_code>
    
  2. Configure the pack: Edit config.yaml with language-specific settings

  3. Add resources (optional): Add lexicons, stopwords, or other linguistic resources

  4. Test it:

    langquality pack validate path/to/your_pack
    langquality analyze --input test_data --output results --language <your_language_code>
    

See the Language Pack Guide for detailed instructions.

📖 Documentation

🎯 Use Cases

LangQuality is designed for researchers and developers working with low-resource languages:

  • ASR Dataset Preparation: Ensure text quality before audio recording
  • Machine Translation: Validate parallel corpora quality
  • Language Model Training: Assess dataset diversity and balance
  • Corpus Linguistics: Analyze linguistic properties of text collections
  • Data Curation: Filter and improve existing datasets

🔧 Advanced Features

Custom Configuration

Override default thresholds and settings:

langquality analyze --input data --output results --language fon --config my_config.yaml

Example configuration:

thresholds:
  structural:
    min_words: 5
    max_words: 15
  diversity:
    target_ttr: 0.65
  gender:
    target_ratio: [0.45, 0.55]

Custom Analyzers

Create custom analyzers for specialized analysis:

from langquality.analyzers import Analyzer

class ToneAnalyzer(Analyzer):
    """Analyze tone and sentiment of sentences."""
    
    def analyze(self, sentences):
        # Your analysis logic
        return metrics
    
    def get_requirements(self):
        return ["tone_lexicon"]  # Required resources

Place your analyzer in the plugins directory and it will be automatically discovered.

See Creating Analyzers for details.

🤝 Contributing

We welcome contributions from the community! Whether you're:

  • 🌍 Creating a Language Pack for your language
  • 🔧 Adding new analyzers or features
  • 📝 Improving documentation
  • 🐛 Reporting bugs or issues
  • 💡 Suggesting enhancements

Please see our Contributing Guide for:

  • Code of Conduct
  • Development setup
  • Contribution workflow
  • Coding standards
  • Testing requirements

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Ensure tests pass: pytest
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

👥 Community

Join our community to get help, share ideas, and collaborate:

Support Channels

📊 Project Status

LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.

Current version: 1.0.0 (Stable)

📜 License

LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.

🙏 Acknowledgments

LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:

  • The linguistic community working on African language preservation and NLP development
  • Contributors who have created Language Packs and shared their expertise
  • The open-source NLP community for tools and libraries that make this work possible

📚 Citation

If you use LangQuality in your research, please cite:

@software{langquality_toolkit,
  title={LangQuality: Language Quality Toolkit for Low-Resource Languages},
  author={LangQuality Community},
  year={2024},
  url={https://github.com/langquality/langquality},
  version={1.0.0}
}

🔗 Related Projects


Made with ❤️ for low-resource language communities worldwide

Get Started | Documentation | Community | Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langquality-1.0.1.tar.gz (139.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langquality-1.0.1-py3-none-any.whl (148.5 kB view details)

Uploaded Python 3

File details

Details for the file langquality-1.0.1.tar.gz.

File metadata

  • Download URL: langquality-1.0.1.tar.gz
  • Upload date:
  • Size: 139.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6dcda6992967da3a9ba6bc0d78f17890f4b0d0f6186dfccee2ae725b33f5bb5b
MD5 3c096d049900752ec5dc40b45988b5d1
BLAKE2b-256 1e1f125a6d9eb3dff92a87d8af270992e38158d34b1a4870923ba7292a333ca6

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.1.tar.gz:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langquality-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: langquality-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 148.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langquality-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e848a0be40ab0b639ef1633f40405e73fa2a94d4f678e5382af5a907f2665835
MD5 e99ac1afee6649754fec175507bcb25d
BLAKE2b-256 1b1f90da93785d3ed2f9e1a2cf3a333078e71d9252d2feb4c5a14ee7c4665f9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for langquality-1.0.1-py3-none-any.whl:

Publisher: release.yml on laleye/langquality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page