Skip to main content

Unsupervised morpheme discovery for Uralic languages using the IMDP algorithm

Project description

SampoNLP

Unsupervised Morpheme Discovery for Uralic Languages

PyPI version Downloads License Python 3.8+

SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.

The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.

🌟 Features

  • Unsupervised Learning: No annotated data required
  • 🚀 High Performance: Rust-powered core with Python bindings via PyO3
  • 🔬 Linguistically Motivated: Incorporates positional priors for roots vs. affixes
  • 🌍 Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
  • 📊 Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
  • 🔄 Iterative Refinement: Converges to stable morpheme representations

📦 Installation

From PyPI (recommended)

pip install samponlp

From source

git clone https://github.com/yourusername/samponlp.git
cd samponlp
pip install maturin
maturin develop --release

🚀 Quick Start

Basic Usage

from samponlp import MorphemeCleaner

# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
    language='estonian',
    min_length=1,
    min_type_support=3,
    max_iterations=100,
    convergence_threshold=1e-7
)

# Process morphemes from a file
results = cleaner.process(
    raw_morphemes_path='data/estonian_morphemes.txt',
    output_dir='results/estonian_output'
)

print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")

Analyzing Results

# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
    print(morpheme)

# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
    print(f"{token}: {reason}")

# Examine final scores
print(results.final_scores['ház'])  # 0.334

📚 Supported Languages

SampoNLP comes with pre-configured settings for:

  • 🇫🇮 Finnish (language='finnish')
  • 🇪🇪 Estonian (language='estonian')
  • 🇭🇺 Hungarian (language='hungarian')
  • 🌐 General Uralic (language='uralic')

Each language has customized:

  • Alphabet validation patterns
  • Single-character morpheme whitelists
  • Language-specific filtering rules

🔬 Algorithm Overview

SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:

  1. Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
  2. Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
  3. Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
  4. Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes

For detailed algorithm description, see our paper (link coming soon).

📖 Documentation

Comprehensive documentation is available in the docs/ folder:

🛠️ Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/yourusername/samponlp.git
cd samponlp

# Build with maturin
pip install maturin
maturin develop --release

# Run tests
pytest tests/

Running the Pipeline

python run_pipeline.py

📊 Performance

On a typical corpus of 50,000 morpheme candidates:

  • Processing time: ~2-5 minutes
  • Memory usage: ~500MB
  • Convergence: Usually within 20-50 iterations

📝 Citation

If you use SampoNLP in your research, please cite:

@article{samponlp2025,
  title={SampoNLP: Unsupervised Morpheme Discovery for Uralic Languages},
  author={Your Name},
  journal={Journal Name},
  year={2025}
}

📄 License

SampoNLP is released under the Apache 2.0 License.

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

💖 Support

If you find SampoNLP useful, please consider:

  • ⭐ Starring the repository
  • 📢 Sharing it with colleagues
  • 💬 Providing feedback via issues
  • 🙏 Sponsoring the project

🙏 Acknowledgments

This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.

📬 Contact


Made with ❤️ for the Uralic NLP community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samponlp-0.3.0-cp313-cp313-macosx_11_0_arm64.whl (233.1 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file samponlp-0.3.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for samponlp-0.3.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b0611f9f6cd838504a1843d92e32b568a4cbc12f5149702980d6f94dffa65cc0
MD5 3ac2deaf827ac42d5a1378827de16d04
BLAKE2b-256 b2abc6bb5bb5d0c20336994d40e65b0dd01ad0e511e34ae289a87066248ce524

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page