K'Cho linguistic analysis toolkit for low-resource NLP with collocation extraction, morphological analysis, and corpus processing
Project description
K'Cho Linguistic Toolkit
A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.
Based on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.
๐ฏ What This Toolkit Does
This is a single, integrated package that provides:
- โ Collocation Extraction - Extract meaningful word combinations using multiple association measures
- โ Morphological Analysis - Analyze K'Cho word structure (stems, affixes, particles)
- โ Text Normalization - Clean and normalize K'Cho text for analysis
- โ Corpus Building - Create annotated datasets with quality control
- โ Lexicon Management - Build and manage digital K'Cho dictionaries
- โ Data Export - Export to standard formats (JSON, CoNLL-U, CSV)
- โ Evaluation Tools - Evaluate collocation extraction quality
- โ Parallel Corpus Processing - Process aligned K'Cho-English texts
- โ ML-Ready Output - Prepare data for machine learning training
๐ Quick Start
Installation
# Install in development mode
pip install -e .
# Or install from PyPI (when published)
pip install kcho-linguistic-toolkit
Basic Usage
from kcho import CollocationExtractor, KChoSystem
# Initialize the system
system = KChoSystem()
# Extract collocations from corpus
extractor = CollocationExtractor()
corpus = ["Om noh Yong am paapai pe ci", "Ak'hmรณ lรนum ci"]
results = extractor.extract(corpus)
# Use advanced defaultdict functionality
pos_patterns = system.corpus.analyze_pos_patterns()
word_contexts = extractor.analyze_word_contexts(corpus)
Command Line Interface
# Run collocation extraction
python -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt
# Use the main CLI
kcho analyze --corpus data/sample_corpus.txt --output results/
๐ฆ Installation
Install the package in development mode:
# Clone the repository
git clone https://github.com/HungOm/kcho-linguistic-toolkit.git
cd kcho-linguistic-toolkit
# Install in development mode
pip install -e .
# Verify installation
python -c "from kcho import CollocationExtractor; print('โ
Installation successful!')"
๐ Project Structure
The toolkit is organized following Python packaging best practices:
KchoLinguisticToolkit/
โโโ kcho/ # Main package
โ โโโ __init__.py # Package initialization
โ โโโ collocation.py # Collocation extraction
โ โโโ kcho_system.py # Core system
โ โโโ normalize.py # Text normalization
โ โโโ evaluation.py # Evaluation utilities
โ โโโ export.py # Export functions
โ โโโ eng_kcho_parallel_extractor.py
โ โโโ export_training_csv.py
โ โโโ create_gold_standard.py # Gold standard helper
โ โโโ kcho_app.py # CLI entry point
โ โโโ data/ # Package data
โ โโโ linguistic_data.json
โ โโโ word_frequency_top_1000.csv
โโโ examples/ # Example scripts
โ โโโ defaultdict_usage.py
โโโ data/ # External data (not in package)
โ โโโ README.md # Data documentation
โ โโโ sample_corpus.txt # Small, keep in git
โ โโโ gold_standard_collocations.txt
โ โโโ bible_versions/ # Large, .gitignored
โ โโโ parallel_corpora/ # Medium, .gitignored
โ โโโ research_outputs/ # Generated, .gitignored
โโโ .gitignore # Comprehensive ignore rules
โโโ pyproject.toml # Package configuration
โโโ README.md # This file
๐ Key Features
1. Collocation Extraction
Advanced collocation extraction with multiple association measures:
- PMI (Pointwise Mutual Information) - Classical measure for word association
- NPMI (Normalized PMI) - Bounded [0,1] variant for comparison
- t-score - Statistical significance testing
- Dice Coefficient - Symmetric association measure
- Log-likelihood Ratio (Gยฒ) - Asymptotic significance testing
from kcho import CollocationExtractor
extractor = CollocationExtractor()
results = extractor.extract(corpus)
# Group by POS patterns using defaultdict
pos_groups = extractor.group_collocations_by_pos_pattern(corpus)
# Analyze word contexts
contexts = extractor.analyze_word_contexts(corpus, context_window=3)
2. Morphological Analysis
Based on K'Cho linguistic research, the toolkit understands:
-
Applicative Suffix (-na/-nรกk)
luum-na= "play with"- Automatically detects and analyzes
-
Agreement Particles (ka, na, a)
-
Postpositions (noh, ah, am, on)
-
Tense Markers (ci, khai)
Example:
sentence = toolkit.analyze("Ak'hmรณ noh k'khรฌm luum-na ci")
# Automatically identifies: subject + postposition + instrument + verb-APPL + tense
2. Text Validation
Automatically detects K'cho text with confidence scoring:
is_kcho, confidence, metrics = toolkit.validate("Om noh Yong am paapai pe ci")
# Returns: (True, 0.875, {...detailed metrics...})
Validation Features:
- Character set validation
- K'cho marker detection (postpositions, particles)
- Pattern matching for K'cho structures
- Confidence scoring (0-100%)
3. Corpus Building
Build clean, annotated K'cho datasets:
# Add with automatic analysis
toolkit.add_to_corpus(
"Om noh Yong am paapai pe ci",
translation="Om gave Yong flowers"
)
# Get statistics
stats = toolkit.corpus_stats()
# Returns: total_sentences, vocabulary_size, POS distribution, etc.
# Create ML splits
splits = toolkit.corpus.create_splits(train_ratio=0.8)
# Returns: {'train': [...], 'dev': [...], 'test': [...]}
4. Lexicon Management
SQLite-based dictionary with full search:
from kcho_toolkit import LexiconEntry
# Add words
entry = LexiconEntry(
headword="paapai",
pos="N",
gloss_en="flower",
gloss_my="แแแบแธ", # Myanmar translation
examples=["Om noh Yong am paapai pe ci"]
)
toolkit.lexicon.add_entry(entry)
# Search
results = toolkit.search_lexicon("flower")
# Get frequency list
top_words = toolkit.lexicon.get_frequency_list(100)
5. Data Export
Export to multiple standard formats:
# JSON (for ML training)
toolkit.corpus.export_json("corpus.json")
# CoNLL-U (for linguistic research)
toolkit.corpus.export_conllu("corpus.conllu")
# CSV (for spreadsheet analysis)
toolkit.corpus.export_csv("corpus.csv")
# Or export everything at once
toolkit.export_all()
๐ Use Cases
Machine Translation Training
# Build parallel corpus
for kcho, english in parallel_sentences:
toolkit.add_to_corpus(kcho, translation=english)
# Create splits
splits = toolkit.corpus.create_splits()
# Export for training
for split_name, sentences in splits.items():
data = [{'source': s.text, 'target': s.translation} for s in sentences]
# Use with Hugging Face, Fairseq, etc.
Linguistic Research
# Analyze corpus
stats = toolkit.corpus_stats()
print(f"POS distribution: {stats['pos_distribution']}")
# Study verb paradigms
paradigm = toolkit.get_verb_forms('lรนum')
# Returns complete conjugation tables
# Export to CoNLL-U for dependency parsing research
toolkit.corpus.export_conllu("research_corpus.conllu")
Dictionary Application Backend
# Search API
results = toolkit.search_lexicon(query)
# Morphological analysis API
analysis = toolkit.analyze(user_input)
# Validation API
is_valid, confidence, _ = toolkit.validate(user_text)
๐ File Structure
The toolkit creates this organized structure:
your_project/
โโโ kcho_lexicon.db # SQLite dictionary
โโโ corpus/ # Raw corpus data
โโโ exports/ # Exported datasets
โ โโโ corpus_*.json
โ โโโ corpus_*.conllu
โ โโโ corpus_*.csv
โ โโโ lexicon_*.json
โโโ reports/ # Quality reports
โโโ report_*.json
๐ Examples
See kcho_examples.py for 8 complete examples:
- Basic Analysis - Analyze K'cho sentences
- Build Corpus - Create annotated corpus
- Validate Text - Detect K'cho text
- Lexicon Management - Work with dictionary
- Verb Paradigms - Generate conjugation tables
- Data Export - Export to different formats
- Quality Control - Validate corpus quality
- ML Preparation - Prepare training data
Run examples:
python kcho_examples.py
๐ Documentation
- KCHO_TOOLKIT_DOCS.md - Complete API reference and usage guide
- kcho_examples.py - 8 practical examples
- kcho_toolkit.py - Main source code (well-documented)
๐ Data Organization
The toolkit includes several types of data:
Package Data (included in installation)
kcho/data/linguistic_data.json- Core linguistic knowledge basekcho/data/word_frequency_top_1000.csv- High-frequency word list
External Data (not in package)
data/sample_corpus.txt- Small sample corpus for testingdata/gold_standard_collocations.txt- Gold standard annotationsdata/bible_versions/- Bible translations (public domain, large files)data/parallel_corpora/- Aligned parallel textsdata/research_outputs/- Generated analysis results
Note: Large data files are not included in the package to keep it lightweight. See data/README.md for details on data sources and copyright information.
๐ฌ Based on Research
This toolkit implements findings from:
- Bedell, G. & Mang, K. S. (2012). "The Applicative Suffix -na in K'cho"
- Jordan, M. (1969). "Chin Dictionary and Grammar"
- K'cho linguistic research on verb stem alternation and morphology
๐ฏ What You Can Build
With this toolkit, you can create:
-
K'cho-English Machine Translation
- Generate parallel corpus
- Export in ML-ready format
- Train transformer models
-
K'cho Dictionary App
- SQLite backend ready
- Full-text search
- Multi-lingual support
-
Text Analysis Tools
- Morphological analyzer
- Grammar checker
- Spell checker (with lexicon validation)
-
Linguistic Research Tools
- Annotated corpus
- Statistical analysis
- Pattern discovery
-
Language Learning Apps
- Verb conjugation practice
- Example sentence database
- Vocabulary lists by frequency
๐ Data Quality
Built-in quality control:
- โ Text validation with confidence scoring
- โ Morphological validation (checks grammatical structure)
- โ Character set validation (ensures K'cho characters)
- โ Quality reports (identifies issues in corpus)
Example:
quality = toolkit.corpus.quality_report()
print(f"Validated: {quality['validated_sentences']}/{quality['total_sentences']}")
print(f"Avg confidence: {quality['avg_confidence']:.2%}")
๐ฆ Project Status
Status: Production Ready โ
- โ Core features complete
- โ Fully documented
- โ Example code provided
- โ Based on peer-reviewed research
- โ No external dependencies
๐ค Contributing
To extend the toolkit:
- Add vocabulary: Extend
KchoConfig.VERB_STEMS - Add patterns: Update validation patterns
- Add languages: Add more gloss languages to
LexiconEntry - Report issues: Document any K'cho linguistic features not yet handled
๐ Citation
If you use this toolkit in research, please cite:
@misc{kcho_toolkit_2025,
title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},
author={Based on research by Bedell, George and Mang, Kee Shein},
year={2025},
note={Linguistic analysis based on "The Applicative Suffix -na in K'cho" (2012)}
}
โ ๏ธ Important Notes
- K'cho has no standard orthography - this toolkit handles common variants
- The toolkit focuses on Mindat Township dialect (southern Chin State)
- Based on research from early 2000s - contemporary usage may vary
- Speaker population: approximately 10,000-20,000
๐ฎ Future Enhancements
Potential additions (not yet implemented):
- Audio processing (speech recognition/synthesis)
- Neural morphological analyzer
- Automatic tokenization improvements
- More comprehensive verb stem database
- Integration with existing Chin language tools
๐ Support
For K'cho linguistic questions, refer to:
- Published papers by George Bedell and Kee Shein Mang
- Jordan's Chin Dictionary and Grammar (1969)
- K'cho community language documentation
๐ License
This toolkit is provided for K'cho language research, documentation, and preservation.
Version: 1.0.0
Language: K'cho (Kuki-Chin family)
Region: Mindat Township, Southern Chin State, Myanmar
Speakers: ~10,000-20,000
Quick Links
"Preserving K'cho for future generations through technology"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kcho_linguistic_toolkit-0.1.0.tar.gz.
File metadata
- Download URL: kcho_linguistic_toolkit-0.1.0.tar.gz
- Upload date:
- Size: 60.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77b3aee5155b119d0468774f7b2de88d3548daac8217057207549d4d63f64929
|
|
| MD5 |
e48d7969f6ef135c95c9b4d1bde3ae8e
|
|
| BLAKE2b-256 |
fe1d7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443
|
File details
Details for the file kcho_linguistic_toolkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kcho_linguistic_toolkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 59.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97c9127fc59b09eb3a0844c30a4d9822d15146a7788281f067c934f7d36dff34
|
|
| MD5 |
2b9ea9ab55ab2bb5708f026380491d44
|
|
| BLAKE2b-256 |
de5b02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57
|