A lightweight sentence boundary detector for Meitei Mayek (Manipuri) text
Project description
Meitei Senter
A lightweight sentence boundary detector for Meitei Mayek (Manipuri) text.
Features
- 🚀 Lightweight - Only ~1MB model, minimal dependencies
- 🎯 Accurate - 94.7% F-Score on Meitei text
- 🔧 Easy to use - Simple Python API and CLI
- ⚡ Fast - Optimized for quick inference
Installation
pip install meitei-senter
Optional: spaCy Backend (for higher accuracy)
pip install meitei-senter[spacy]
Quick Start
Python API
from meitei_senter import MeiteiSentenceSplitter
# Initialize the splitter
splitter = MeiteiSentenceSplitter()
# Split text into sentences
text = "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"
sentences = splitter.split_sentences(text)
for i, sent in enumerate(sentences, 1):
print(f"{i}. {sent}")
Output:
1. ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫
2. ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫
Command Line
# Interactive mode
meitei-senter --interactive
# Direct text input
meitei-senter --text "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"
# Show version
meitei-senter --version
Advanced Usage
Using the Convenient Loader
from meitei_senter import load_splitter
# Load with default (delimiter-based) backend
splitter = load_splitter()
# Or with spaCy backend (requires spacy extra)
splitter = load_splitter(use_spacy=True)
sentences = splitter.split_sentences("Your Meitei text here ꯫")
Using Neural Network Mode
from meitei_senter import MeiteiSentenceSplitter
# Enable neural mode for context-aware splitting
splitter = MeiteiSentenceSplitter(use_neural=True)
sentences = splitter.split_sentences(text)
Direct Callable Interface
from meitei_senter import MeiteiSentenceSplitter
splitter = MeiteiSentenceSplitter()
# Call splitter directly
sentences = splitter("ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ... ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫")
With spaCy (Custom Tokenizer)
import spacy
import os
from meitei_senter import MeiteiTokenizer, get_model_path
# Get path to bundled model
model_path = os.path.join(get_model_path(), 'meitei_tokenizer.model')
# Create blank spaCy model with custom tokenizer
nlp = spacy.blank("xx")
nlp.tokenizer = MeiteiTokenizer(model_path, nlp.vocab)
doc = nlp("ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫")
print([token.text for token in doc])
# Output: ['ꯆꯦ', 'ꯔꯣ', 'ꯀꯤ', 'ꯑꯁꯤ', 'ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥ', 'ꯒꯤ', 'ꯁꯍꯔ', 'ꯅꯤ', '꯫']
🌐 REST API Server
Start the FastAPI server for HTTP-based sentence splitting:
Start Server
# Using CLI
meitei-senter-server --port 8000
# Or with uvicorn directly
uvicorn meitei_senter.server:app --host 0.0.0.0 --port 8000
# With auto-reload for development
meitei-senter-server --port 8000 --reload
API Endpoints (POST only)
| Endpoint | Method | Description |
|---|---|---|
/ |
POST | API info |
/health |
POST | Health check |
/split |
POST | Split text into sentences |
/tokenize |
POST | Tokenize text |
/docs |
GET | Swagger UI |
Example Requests
POST /split
curl -X POST "http://localhost:8000/split" \
-H "Content-Type: application/json" \
-d '{"text": "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"}'
Response:
{
"sentences": ["ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ꯫", "ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ꯫"],
"count": 2
}
POST /tokenize
curl -X POST "http://localhost:8000/tokenize" \
-H "Content-Type: application/json" \
-d '{"text": "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ"}'
Response:
{
"tokens": ["▁ꯆꯦ", "ꯔꯣ", "ꯀꯤ", "▁ꯑꯁꯤ"],
"token_ids": [460, 390, 42, 3],
"count": 4
}
POST /health
curl -X POST "http://localhost:8000/health"
Response:
{"status": "ok", "version": "1.1.0", "model_loaded": true}
📊 Model Details
| Feature | Specification |
|---|---|
| Model Size | ~1 MB |
| Tokenizer | SentencePiece (Unigram, 8K vocab) |
| Architecture | CNN (HashEmbedCNN) |
| F-Score | 94.71% |
| Precision | 93.94% |
| Recall | 95.49% |
📂 Repository Structure
mni_tokenizer/
├── meitei_senter/ # Main package
│ ├── __init__.py # Package exports
│ ├── cli.py # Command-line interface
│ ├── model.py # PyTorch model & splitter
│ ├── tokenizer.py # spaCy tokenizer
│ ├── meitei_tokenizer.model # SentencePiece model
│ ├── meitei_senter.pth # PyTorch weights
│ └── meitei_senter.json # Model config
├── pyproject.toml # Build configuration
└── README.md # This file
API Reference
MeiteiSentenceSplitter
Main class for sentence splitting.
MeiteiSentenceSplitter(
pth_path: str = None, # Path to PyTorch model
spm_path: str = None, # Path to SentencePiece model
config_path: str = None, # Path to config JSON
use_neural: bool = False # Enable neural network mode
)
Methods:
| Method | Description |
|---|---|
split_sentences(text) |
Split text into list of sentences |
tokenize(text) |
Tokenize text into pieces and IDs |
__call__(text) |
Direct callable interface |
MeiteiTokenizer
spaCy-compatible tokenizer using SentencePiece.
MeiteiTokenizer(model_path: str, vocab: spacy.Vocab)
load_splitter
Convenience function to load a pre-configured splitter.
load_splitter(use_spacy: bool = False)
🔧 Development
# Clone repository
git clone https://github.com/Okramjimmy/mni_tokenizer.git
cd mni_tokenizer
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Build package
python -m build
# Upload to PyPI
twine upload dist/*
📜 License
MIT License - see LICENSE for details.
📚 Citation
If you use this in your research, please cite:
@software{meitei_senter,
author = {Okram Jimmy},
title = {Meitei Senter: Sentence Boundary Detection for Meitei Mayek},
year = {2024},
url = {https://github.com/Okramjimmy/mni_tokenizer}
}
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📧 Contact
- Author: Okram Jimmy
- Email: okramjimmy@gmail.com
- GitHub: @Okramjimmy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file meitei_senter-1.1.2.tar.gz.
File metadata
- Download URL: meitei_senter-1.1.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08dc6d1dc69b565a72f072c0948362b075aeb6868eb18b4a330c4cd982948e04
|
|
| MD5 |
9550ecdc719e4350f18ba8629b726026
|
|
| BLAKE2b-256 |
a5b16a9b80ce4466c5d2cb693e6666efd868dd94ae4c8abee4faed10510f99e0
|
File details
Details for the file meitei_senter-1.1.2-py3-none-any.whl.
File metadata
- Download URL: meitei_senter-1.1.2-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21134126e876e7c8ea46b73f4f461ce61852b31109cbc805ff80ed0874a0dc21
|
|
| MD5 |
adfa94d1e0077613e1de4c8bc8b66a29
|
|
| BLAKE2b-256 |
91e98c36eb7baffcd17a82cdae775af1d6bd3a81bb0c3ddc77f8237702a4eed3
|