Skip to main content

K'Cho linguistic analysis toolkit for low-resource NLP with collocation extraction, morphological analysis, and corpus processing

Project description

K'Cho Linguistic Toolkit

A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.

Based on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.

๐ŸŽฏ What This Toolkit Does

This is a single, integrated package that provides:

  • โœ… Collocation Extraction - Extract meaningful word combinations using multiple association measures
  • โœ… Morphological Analysis - Analyze K'Cho word structure (stems, affixes, particles)
  • โœ… Text Normalization - Clean and normalize K'Cho text for analysis
  • โœ… Corpus Building - Create annotated datasets with quality control
  • โœ… Lexicon Management - Build and manage digital K'Cho dictionaries
  • โœ… Data Export - Export to standard formats (JSON, CoNLL-U, CSV)
  • โœ… Evaluation Tools - Evaluate collocation extraction quality
  • โœ… Parallel Corpus Processing - Process aligned K'Cho-English texts
  • โœ… ML-Ready Output - Prepare data for machine learning training

๐Ÿš€ Quick Start

Installation

# Install in development mode
pip install -e .

# Or install from PyPI (when published)
pip install kcho-linguistic-toolkit

Basic Usage

from kcho import CollocationExtractor, KChoSystem

# Initialize the system
system = KChoSystem()

# Extract collocations from corpus
extractor = CollocationExtractor()
corpus = ["Om noh Yong am paapai pe ci", "Ak'hmรณ lรนum ci"]
results = extractor.extract(corpus)

# Use advanced defaultdict functionality
pos_patterns = system.corpus.analyze_pos_patterns()
word_contexts = extractor.analyze_word_contexts(corpus)

Command Line Interface

# Run collocation extraction
python -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt

# Use the main CLI
kcho analyze --corpus data/sample_corpus.txt --output results/

๐Ÿ“ฆ Installation

Install the package in development mode:

# Clone the repository
git clone https://github.com/HungOm/kcho-linguistic-toolkit.git
cd kcho-linguistic-toolkit

# Install in development mode
pip install -e .

# Verify installation
python -c "from kcho import CollocationExtractor; print('โœ… Installation successful!')"

๐Ÿ“ Project Structure

The toolkit is organized following Python packaging best practices:

KchoLinguisticToolkit/
โ”œโ”€โ”€ kcho/                           # Main package
โ”‚   โ”œโ”€โ”€ __init__.py                 # Package initialization
โ”‚   โ”œโ”€โ”€ collocation.py              # Collocation extraction
โ”‚   โ”œโ”€โ”€ kcho_system.py              # Core system
โ”‚   โ”œโ”€โ”€ normalize.py                # Text normalization
โ”‚   โ”œโ”€โ”€ evaluation.py               # Evaluation utilities
โ”‚   โ”œโ”€โ”€ export.py                   # Export functions
โ”‚   โ”œโ”€โ”€ eng_kcho_parallel_extractor.py
โ”‚   โ”œโ”€โ”€ export_training_csv.py
โ”‚   โ”œโ”€โ”€ create_gold_standard.py     # Gold standard helper
โ”‚   โ”œโ”€โ”€ kcho_app.py                 # CLI entry point
โ”‚   โ””โ”€โ”€ data/                       # Package data
โ”‚       โ”œโ”€โ”€ linguistic_data.json
โ”‚       โ””โ”€โ”€ word_frequency_top_1000.csv
โ”œโ”€โ”€ examples/                       # Example scripts
โ”‚   โ””โ”€โ”€ defaultdict_usage.py
โ”œโ”€โ”€ data/                           # External data (not in package)
โ”‚   โ”œโ”€โ”€ README.md                   # Data documentation
โ”‚   โ”œโ”€โ”€ sample_corpus.txt           # Small, keep in git
โ”‚   โ”œโ”€โ”€ gold_standard_collocations.txt
โ”‚   โ”œโ”€โ”€ bible_versions/             # Large, .gitignored
โ”‚   โ”œโ”€โ”€ parallel_corpora/           # Medium, .gitignored
โ”‚   โ””โ”€โ”€ research_outputs/           # Generated, .gitignored
โ”œโ”€โ”€ .gitignore                      # Comprehensive ignore rules
โ”œโ”€โ”€ pyproject.toml                  # Package configuration
โ””โ”€โ”€ README.md                       # This file

๐ŸŒŸ Key Features

1. Collocation Extraction

Advanced collocation extraction with multiple association measures:

  • PMI (Pointwise Mutual Information) - Classical measure for word association
  • NPMI (Normalized PMI) - Bounded [0,1] variant for comparison
  • t-score - Statistical significance testing
  • Dice Coefficient - Symmetric association measure
  • Log-likelihood Ratio (Gยฒ) - Asymptotic significance testing
from kcho import CollocationExtractor

extractor = CollocationExtractor()
results = extractor.extract(corpus)

# Group by POS patterns using defaultdict
pos_groups = extractor.group_collocations_by_pos_pattern(corpus)

# Analyze word contexts
contexts = extractor.analyze_word_contexts(corpus, context_window=3)

2. Morphological Analysis

Based on K'Cho linguistic research, the toolkit understands:

  • Applicative Suffix (-na/-nรกk)

    • luum-na = "play with"
    • Automatically detects and analyzes
  • Agreement Particles (ka, na, a)

  • Postpositions (noh, ah, am, on)

  • Tense Markers (ci, khai)

Example:

sentence = toolkit.analyze("Ak'hmรณ noh k'khรฌm luum-na ci")
# Automatically identifies: subject + postposition + instrument + verb-APPL + tense

2. Text Validation

Automatically detects K'cho text with confidence scoring:

is_kcho, confidence, metrics = toolkit.validate("Om noh Yong am paapai pe ci")
# Returns: (True, 0.875, {...detailed metrics...})

Validation Features:

  • Character set validation
  • K'cho marker detection (postpositions, particles)
  • Pattern matching for K'cho structures
  • Confidence scoring (0-100%)

3. Corpus Building

Build clean, annotated K'cho datasets:

# Add with automatic analysis
toolkit.add_to_corpus(
    "Om noh Yong am paapai pe ci",
    translation="Om gave Yong flowers"
)

# Get statistics
stats = toolkit.corpus_stats()
# Returns: total_sentences, vocabulary_size, POS distribution, etc.

# Create ML splits
splits = toolkit.corpus.create_splits(train_ratio=0.8)
# Returns: {'train': [...], 'dev': [...], 'test': [...]}

4. Lexicon Management

SQLite-based dictionary with full search:

from kcho_toolkit import LexiconEntry

# Add words
entry = LexiconEntry(
    headword="paapai",
    pos="N",
    gloss_en="flower",
    gloss_my="แ€•แ€”แ€บแ€ธ",  # Myanmar translation
    examples=["Om noh Yong am paapai pe ci"]
)
toolkit.lexicon.add_entry(entry)

# Search
results = toolkit.search_lexicon("flower")

# Get frequency list
top_words = toolkit.lexicon.get_frequency_list(100)

5. Data Export

Export to multiple standard formats:

# JSON (for ML training)
toolkit.corpus.export_json("corpus.json")

# CoNLL-U (for linguistic research)
toolkit.corpus.export_conllu("corpus.conllu")

# CSV (for spreadsheet analysis)
toolkit.corpus.export_csv("corpus.csv")

# Or export everything at once
toolkit.export_all()

๐Ÿ“Š Use Cases

Machine Translation Training

# Build parallel corpus
for kcho, english in parallel_sentences:
    toolkit.add_to_corpus(kcho, translation=english)

# Create splits
splits = toolkit.corpus.create_splits()

# Export for training
for split_name, sentences in splits.items():
    data = [{'source': s.text, 'target': s.translation} for s in sentences]
    # Use with Hugging Face, Fairseq, etc.

Linguistic Research

# Analyze corpus
stats = toolkit.corpus_stats()
print(f"POS distribution: {stats['pos_distribution']}")

# Study verb paradigms
paradigm = toolkit.get_verb_forms('lรนum')
# Returns complete conjugation tables

# Export to CoNLL-U for dependency parsing research
toolkit.corpus.export_conllu("research_corpus.conllu")

Dictionary Application Backend

# Search API
results = toolkit.search_lexicon(query)

# Morphological analysis API
analysis = toolkit.analyze(user_input)

# Validation API
is_valid, confidence, _ = toolkit.validate(user_text)

๐Ÿ“ File Structure

The toolkit creates this organized structure:

your_project/
โ”œโ”€โ”€ kcho_lexicon.db          # SQLite dictionary
โ”œโ”€โ”€ corpus/                  # Raw corpus data
โ”œโ”€โ”€ exports/                 # Exported datasets
โ”‚   โ”œโ”€โ”€ corpus_*.json
โ”‚   โ”œโ”€โ”€ corpus_*.conllu
โ”‚   โ”œโ”€โ”€ corpus_*.csv
โ”‚   โ””โ”€โ”€ lexicon_*.json
โ””โ”€โ”€ reports/                 # Quality reports
    โ””โ”€โ”€ report_*.json

๐ŸŽ“ Examples

See kcho_examples.py for 8 complete examples:

  1. Basic Analysis - Analyze K'cho sentences
  2. Build Corpus - Create annotated corpus
  3. Validate Text - Detect K'cho text
  4. Lexicon Management - Work with dictionary
  5. Verb Paradigms - Generate conjugation tables
  6. Data Export - Export to different formats
  7. Quality Control - Validate corpus quality
  8. ML Preparation - Prepare training data

Run examples:

python kcho_examples.py

๐Ÿ“– Documentation

๐Ÿ“Š Data Organization

The toolkit includes several types of data:

Package Data (included in installation)

  • kcho/data/linguistic_data.json - Core linguistic knowledge base
  • kcho/data/word_frequency_top_1000.csv - High-frequency word list

External Data (not in package)

  • data/sample_corpus.txt - Small sample corpus for testing
  • data/gold_standard_collocations.txt - Gold standard annotations
  • data/bible_versions/ - Bible translations (public domain, large files)
  • data/parallel_corpora/ - Aligned parallel texts
  • data/research_outputs/ - Generated analysis results

Note: Large data files are not included in the package to keep it lightweight. See data/README.md for details on data sources and copyright information.

๐Ÿ”ฌ Based on Research

This toolkit implements findings from:

  • Bedell, G. & Mang, K. S. (2012). "The Applicative Suffix -na in K'cho"
  • Jordan, M. (1969). "Chin Dictionary and Grammar"
  • K'cho linguistic research on verb stem alternation and morphology

๐ŸŽฏ What You Can Build

With this toolkit, you can create:

  1. K'cho-English Machine Translation

    • Generate parallel corpus
    • Export in ML-ready format
    • Train transformer models
  2. K'cho Dictionary App

    • SQLite backend ready
    • Full-text search
    • Multi-lingual support
  3. Text Analysis Tools

    • Morphological analyzer
    • Grammar checker
    • Spell checker (with lexicon validation)
  4. Linguistic Research Tools

    • Annotated corpus
    • Statistical analysis
    • Pattern discovery
  5. Language Learning Apps

    • Verb conjugation practice
    • Example sentence database
    • Vocabulary lists by frequency

๐Ÿ“ˆ Data Quality

Built-in quality control:

  • โœ… Text validation with confidence scoring
  • โœ… Morphological validation (checks grammatical structure)
  • โœ… Character set validation (ensures K'cho characters)
  • โœ… Quality reports (identifies issues in corpus)

Example:

quality = toolkit.corpus.quality_report()
print(f"Validated: {quality['validated_sentences']}/{quality['total_sentences']}")
print(f"Avg confidence: {quality['avg_confidence']:.2%}")

๐Ÿšฆ Project Status

Status: Production Ready โœ…

  • โœ… Core features complete
  • โœ… Fully documented
  • โœ… Example code provided
  • โœ… Based on peer-reviewed research
  • โœ… No external dependencies

๐Ÿค Contributing

To extend the toolkit:

  1. Add vocabulary: Extend KchoConfig.VERB_STEMS
  2. Add patterns: Update validation patterns
  3. Add languages: Add more gloss languages to LexiconEntry
  4. Report issues: Document any K'cho linguistic features not yet handled

๐Ÿ“ Citation

If you use this toolkit in research, please cite:

@misc{kcho_toolkit_2025,
  title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},
  author={Based on research by Bedell, George and Mang, Kee Shein},
  year={2025},
  note={Linguistic analysis based on "The Applicative Suffix -na in K'cho" (2012)}
}

โš ๏ธ Important Notes

  • K'cho has no standard orthography - this toolkit handles common variants
  • The toolkit focuses on Mindat Township dialect (southern Chin State)
  • Based on research from early 2000s - contemporary usage may vary
  • Speaker population: approximately 10,000-20,000

๐Ÿ”ฎ Future Enhancements

Potential additions (not yet implemented):

  • Audio processing (speech recognition/synthesis)
  • Neural morphological analyzer
  • Automatic tokenization improvements
  • More comprehensive verb stem database
  • Integration with existing Chin language tools

๐Ÿ“ž Support

For K'cho linguistic questions, refer to:

  • Published papers by George Bedell and Kee Shein Mang
  • Jordan's Chin Dictionary and Grammar (1969)
  • K'cho community language documentation

๐Ÿ“„ License

This toolkit is provided for K'cho language research, documentation, and preservation.


Version: 1.0.0
Language: K'cho (Kuki-Chin family)
Region: Mindat Township, Southern Chin State, Myanmar
Speakers: ~10,000-20,000


Quick Links


"Preserving K'cho for future generations through technology"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kcho_linguistic_toolkit-0.1.0.tar.gz (60.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kcho_linguistic_toolkit-0.1.0-py3-none-any.whl (59.2 kB view details)

Uploaded Python 3

File details

Details for the file kcho_linguistic_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: kcho_linguistic_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 60.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for kcho_linguistic_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 77b3aee5155b119d0468774f7b2de88d3548daac8217057207549d4d63f64929
MD5 e48d7969f6ef135c95c9b4d1bde3ae8e
BLAKE2b-256 fe1d7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443

See more details on using hashes here.

File details

Details for the file kcho_linguistic_toolkit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kcho_linguistic_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97c9127fc59b09eb3a0844c30a4d9822d15146a7788281f067c934f7d36dff34
MD5 2b9ea9ab55ab2bb5708f026380491d44
BLAKE2b-256 de5b02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page