Natural Language Processing for Zomi Language (Zopau)
Project description
Zomi NLP
Natural Language Processing toolkit for the Zomi language (Zopau).
Features
- 🔤 Tokenization - Smart tokenization with clitic splitting, reduplication handling, and compound word support
- 🏷️ POS Tagging - Rule-based part-of-speech tagging with 600+ lexicon entries
- 📖 Lemmatization - Morphological lemmatization with clitic removal and affix stripping
- 🌲 Dependency Parsing - Grammatical structure analysis with Zomi-specific rules
- 📍 Named Entity Recognition - Entity extraction for PERSON, LOCATION, GPE, DATE, NUMERIC
- 🔬 Morphological Analysis - Morpheme segmentation and feature extraction
- 🔌 Pluggable Backends - Use native Zomi, spaCy, or Stanza backends
- 📊 CoNLL-U Export - Standard 10-column and extended 16-column formats
- 🚀 Production Ready - CI/CD, type hints, comprehensive testing
Coming Soon (v0.5.0+)
- 🔤 Word Sense Disambiguation - Context-aware meaning disambiguation
- 📚 Sense Lexicon - Word sense inventory with examples
- 📈 Statistical Disambiguation - Frequency-based sense prediction
- 🏷️ Sense Tagger - Automatic sense annotation
- 🔧 Nominalizer Detector - Rule-based
-nasuffix detection with stem alternation handling
Requirements
- Python 3.9 or higher
- pip (latest version recommended)
Dependencies
Zomi NLP works with either spaCy or Stanza as backends. If both are installed, it will prefer Stanza (more accurate) but fall back to spaCy (faster) if needed.
Installation Options
Minimal Installation (Native Only)
pip install zomi-nlp
With spaCy (Recommended for Speed)
pip install 'zomi-nlp[spacy]'
python -m spacy download en_core_web_sm
With Stanza (Recommended for Accuracy)
pip install 'zomi-nlp[stanza]'
Full installation (Both Backends)
pip install 'zomi-nlp[full]'
Quick Start
from zomi_nlp import load
# Load the pipeline (auto-selects best available backend)
nlp = load()
# Process text
text = "Tuni an ka ne hi."
doc = nlp(text)
# Access tokens
for token in doc:
print(f"{token.text}\t{token.pos_}\t{token.lemma_}\t{token.ent_type_ or 'N/A'}")
# Output:
# Tuni DATE tuni DATE
# an NOUN an N/A
# ka PRON ka N/A
# ne VERB ne N/A
# hi PART hi N/A
# . PUNCT . N/A
Native Pipeline Components
Zomi NLP v0.4.0 introduces a complete native pipeline with no external dependencies:
| Component | Description |
|---|---|
| ZomiTokenizer | Clitic splitting, reduplication, compound words, punctuation |
| ZomiPOSTagger | Rule-based POS tagging with 600+ lexicon entries |
| ZomiLemmatizer | Morphological lemmatization with irregular form handling |
| ZomiDependencyParser | Zomi-specific dependency relations (nsubj, obj, case, etc.) |
| ZomiNER | Named entity recognition for 6+ entity types |
| ZomiMorphologicalAnalyzer | Morpheme segmentation and feature extraction |
CoNLL-U Export
from zomi_nlp import load
nlp = load()
doc = nlp("Ka pai ve.")
# Export to standard CoNLL-U format
for token in doc:
print(f"{token.text}\t{token.lemma_}\t{token.pos_}\t{token.head}\t{token.dep_}")
# Output format: ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
Configuration
from zomi_nlp import ZomiConfig, ZomiPipeline
# Use native Zomi pipeline (default, no dependencies)
config = ZomiConfig(parser_backend="native")
nlp = ZomiPipeline(config)
# Use spaCy for speed
config = ZomiConfig(parser_backend="spacy")
nlp = ZomiPipeline(config)
# Use Stanza for accuracy
config = ZomiConfig(parser_backend="stanza")
nlp = ZomiPipeline(config)
# Auto-select best available
config = ZomiConfig(parser_backend="auto")
nlp = ZomiPipeline(config)
CLI Usage
# Check installation status
zomi-nlp --check
# Diagnose issues
zomi-nlp --doctor
# Process text directly
zomi-nlp "Tuni ka pai ve."
# Output:
# Tuni DATE tuni
# ka PRON ka
# pai VERB pai
# hi PART hi
# . PUNCT .
Checking Installation
from zomi_nlp import check_installation
# Check what's installed
check_installation()
# Get status as dict
status = check_installation(verbose=False)
print(status)
Troubleshooting
"stanza not installed" Warning
If you see warnings about stanza, you have two options:
- Install stanza (better accuracy):
pip install stanza
- Use spaCy instead (change your config):
config = ZomiConfig(tokenizer_backend="spacy")
"No backend available" Error
Install at least one backend:
pip install 'zomi-nlp[full]'
Getting None Values for POS Tags
This happens when no backend is available. The library falls back to a simple tokenizer. Install spaCy or stanza for full functionality.
Development
# Clone repository
git clone https://github.com/ZomiCommunity/zomi-nlp.git
cd zomi-nlp
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run linting
ruff check zomi_nlp/
# Format code
black zomi_nlp/ tests/
Roadmap
| Version | Features | Status |
|---|---|---|
| v0.1.0 | Core architecture + spaCy/Stanza adapters | ✅ Released |
| v0.2.0 | spaCy/Stanza backends | ✅ Released |
| v0.3.0 | ZomiRuleBasedParser | ✅ Released |
| v0.4.0 | Complete native pipeline | ✅ Current |
| v0.5.0 | Word embeddings, sense disambiguation | 🔜 Planned |
| v0.6.0 | ML-based components | 🔜 Planned |
| v1.0.0 | Production ready | 🔜 Planned |
Planned Features for v0.5.0
- ZomiWordSenseDisambiguator - Context-aware meaning disambiguation
- ZOMI_SENSE_LEXICON - Word sense inventory with examples
- StatisticalDisambiguator - Frequency-based sense prediction
- ZomiSenseTagger - Automatic sense annotation
- ZomiNominalizerDetector - Rule-based -na suffix detection with stem alternation handling (e.g., pia → piakna, um → upna)
Contributing
Contributions welcome! See CONTRIBUTING for guidelines.
License
Apache License 2.0
Citation
@software{zomi_nlp_2026,
title={Zomi NLP: Natural Language Processing for Zomi Language},
author={Zomi NLP Community},
year={2026},
url={https://github.com/ZomiCommunity/zomi-nlp}
}
Acknowledgments
- Built with ❤️ for the Zomi community
- Uses spaCy and Stanza as backends
- Inspired by universal dependencies framework
📝 Summary of Changes
| Section | Change |
|---|---|
| Features | Added lemmatization, morphological analysis, CoNLL-U export |
| Coming Soon | New section listing planned features (disambiguator, sense lexicon, etc.) |
| Native Pipeline | New section documenting all native components |
| CoNLL-U Export | New section with example |
| CLI Usage | New section with command examples |
| Roadmap | Converted to table format, marked v0.4.0 as current |
| Planned Features | Detailed list of v0.5.0 features including those you asked about |
The planned features section clearly indicates that ZomiWordSenseDisambiguator, ZOMI_SENSE_LEXICON, StatisticalDisambiguator, ZomiSenseTagger, and ZomiNominalizerDetector are coming in v0.5.0, not yet available in v0.4.0. 🚀
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zomi_nlp-0.4.1.tar.gz.
File metadata
- Download URL: zomi_nlp-0.4.1.tar.gz
- Upload date:
- Size: 54.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a1b82deb6f987c90efa77254f9ae6e20ee12c1e57f1310831b72921631b525c
|
|
| MD5 |
0580ed43ce6031ebb3e18c82430ab808
|
|
| BLAKE2b-256 |
a2f1276c14a60ebc84654618b8a4eb805263b67fc949df92b99b4eef0f1bd341
|
File details
Details for the file zomi_nlp-0.4.1-py3-none-any.whl.
File metadata
- Download URL: zomi_nlp-0.4.1-py3-none-any.whl
- Upload date:
- Size: 54.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be6ac3af27f3d882364d185a14cddc7f550a3cb93453dbda4121ce63b74c51d2
|
|
| MD5 |
a658b114df8e98fa3eef961ae7c33b0c
|
|
| BLAKE2b-256 |
c2a4bfbe5367c48ca559ce0e97a2e9858833d60e5cf956d5da2f11e78ac1c5be
|