Unsupervised morpheme discovery for Uralic languages using the IMDP algorithm
Project description
SampoNLP
Unsupervised Morpheme Discovery for Uralic Languages
SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.
The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.
🌟 Features
- ✨ Unsupervised Learning: No annotated data required
- 🚀 High Performance: Rust-powered core with Python bindings via PyO3
- 🔬 Linguistically Motivated: Incorporates positional priors for roots vs. affixes
- 🌍 Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
- 📊 Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
- 🔄 Iterative Refinement: Converges to stable morpheme representations
📦 Installation
From PyPI (recommended)
pip install samponlp
From source
git clone https://github.com/yourusername/samponlp.git
cd samponlp
pip install maturin
maturin develop --release
🚀 Quick Start
Basic Usage
from samponlp import MorphemeCleaner
# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
language='estonian',
min_length=1,
min_type_support=3,
max_iterations=100,
convergence_threshold=1e-7
)
# Process morphemes from a file
results = cleaner.process(
raw_morphemes_path='data/estonian_morphemes.txt',
output_dir='results/estonian_output'
)
print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")
Analyzing Results
# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
print(morpheme)
# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
print(f"{token}: {reason}")
# Examine final scores
print(results.final_scores['ház']) # 0.334
📚 Supported Languages
SampoNLP comes with pre-configured settings for:
- 🇫🇮 Finnish (
language='finnish') - 🇪🇪 Estonian (
language='estonian') - 🇭🇺 Hungarian (
language='hungarian') - 🌐 General Uralic (
language='uralic')
Each language has customized:
- Alphabet validation patterns
- Single-character morpheme whitelists
- Language-specific filtering rules
🔬 Algorithm Overview
SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:
- Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
- Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
- Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
- Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes
For detailed algorithm description, see our paper (link coming soon).
📖 Documentation
Comprehensive documentation is available in the docs/ folder:
- Usage Guide - Detailed examples and API reference
- Algorithm Details - Mathematical formulation
- Contributing Guide - How to contribute
🛠️ Development
Building from Source
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone the repository
git clone https://github.com/yourusername/samponlp.git
cd samponlp
# Build with maturin
pip install maturin
maturin develop --release
# Run tests
pytest tests/
Running the Pipeline
python run_pipeline.py
📊 Performance
On a typical corpus of 50,000 morpheme candidates:
- Processing time: ~2-5 minutes
- Memory usage: ~500MB
- Convergence: Usually within 20-50 iterations
📝 Citation
If you use SampoNLP in your research, please cite:
@article{samponlp2025,
title={SampoNLP: Unsupervised Morpheme Discovery for Uralic Languages},
author={Your Name},
journal={Journal Name},
year={2025}
}
📄 License
SampoNLP is released under the Apache 2.0 License.
🤝 Contributing
Contributions are welcome! Please see our Contributing Guide for details.
💖 Support
If you find SampoNLP useful, please consider:
- ⭐ Starring the repository
- 📢 Sharing it with colleagues
- 💬 Providing feedback via issues
- 🙏 Sponsoring the project
🙏 Acknowledgments
This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.
📬 Contact
- Issues: GitHub Issues
- Email: your.email@example.com
- Website: your-website.com
Made with ❤️ for the Uralic NLP community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file samponlp-0.3.0-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: samponlp-0.3.0-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 233.1 kB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0611f9f6cd838504a1843d92e32b568a4cbc12f5149702980d6f94dffa65cc0
|
|
| MD5 |
3ac2deaf827ac42d5a1378827de16d04
|
|
| BLAKE2b-256 |
b2abc6bb5bb5d0c20336994d40e65b0dd01ad0e511e34ae289a87066248ce524
|