Skip to main content

Unsupervised morpheme discovery for Uralic languages using the IMDP algorithm

Project description

SampoNLP

Unsupervised Morpheme Discovery for Uralic Languages

PyPI version Downloads License Python 3.8+

SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.

The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.

🌟 Features

  • Unsupervised Learning: No annotated data required
  • 🚀 High Performance: Rust-powered core with Python bindings via PyO3
  • 🔬 Linguistically Motivated: Incorporates positional priors for roots vs. affixes
  • 🌍 Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
  • 📊 Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
  • 🔄 Iterative Refinement: Converges to stable morpheme representations

📦 Installation

From PyPI (recommended)

pip install samponlp

From source

git clone https://github.com/yourusername/samponlp.git
cd samponlp
pip install maturin
maturin develop --release

🚀 Quick Start

Basic Usage

from samponlp import MorphemeCleaner

# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
    language='estonian',
    min_length=1,
    min_type_support=3,
    max_iterations=100,
    convergence_threshold=1e-7
)

# Process morphemes from a file
results = cleaner.process(
    raw_morphemes_path='data/estonian_morphemes.txt',
    output_dir='results/estonian_output'
)

print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")

Analyzing Results

# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
    print(morpheme)

# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
    print(f"{token}: {reason}")

# Examine final scores
print(results.final_scores['ház'])  # 0.334

📚 Supported Languages

SampoNLP comes with pre-configured settings for:

  • 🇫🇮 Finnish (language='finnish')
  • 🇪🇪 Estonian (language='estonian')
  • 🇭🇺 Hungarian (language='hungarian')
  • 🌐 General Uralic (language='uralic')

Each language has customized:

  • Alphabet validation patterns
  • Single-character morpheme whitelists
  • Language-specific filtering rules

🔬 Algorithm Overview

SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:

  1. Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
  2. Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
  3. Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
  4. Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes

For detailed algorithm description, see our paper (link coming soon).

📖 Documentation

Comprehensive documentation is available in the docs/ folder:

🛠️ Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/yourusername/samponlp.git
cd samponlp

# Build with maturin
pip install maturin
maturin develop --release

# Run tests
pytest tests/

Running the Pipeline

python run_pipeline.py

📊 Performance

On a typical corpus of 50,000 morpheme candidates:

  • Processing time: ~2-5 minutes
  • Memory usage: ~500MB
  • Convergence: Usually within 20-50 iterations

📝 Citation

If you use SampoNLP in your research, please cite:

@article{samponlp2025,
  title={SampoNLP: Unsupervised Morpheme Discovery for Uralic Languages},
  author={Your Name},
  journal={Journal Name},
  year={2025}
}

📄 License

SampoNLP is released under the Apache 2.0 License.

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

💖 Support

If you find SampoNLP useful, please consider:

  • ⭐ Starring the repository
  • 📢 Sharing it with colleagues
  • 💬 Providing feedback via issues
  • 🙏 Sponsoring the project

🙏 Acknowledgments

This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.

📬 Contact


Made with ❤️ for the Uralic NLP community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samponlp-0.3.1.tar.gz (2.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

samponlp-0.3.1-cp38-abi3-win_amd64.whl (119.3 kB view details)

Uploaded CPython 3.8+Windows x86-64

samponlp-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (261.6 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

samponlp-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (255.4 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

samponlp-0.3.1-cp38-abi3-macosx_11_0_arm64.whl (220.4 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

samponlp-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl (224.5 kB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file samponlp-0.3.1.tar.gz.

File metadata

  • Download URL: samponlp-0.3.1.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samponlp-0.3.1.tar.gz
Algorithm Hash digest
SHA256 2a528bedba1efef86ba4f9a20d3f4cac370e702e312a774f9a6e63c777092954
MD5 b1b4c16c6cb6db52d53489c52bf453ce
BLAKE2b-256 204167a50e2f8bdc74c90c67bbda71eba583e8ca759106a978136c5fd3aad754

See more details on using hashes here.

File details

Details for the file samponlp-0.3.1-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: samponlp-0.3.1-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 119.3 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samponlp-0.3.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 789efadd7987796fbd64dedebaee6f7156e367c1a9b5e11e0b29d0774f37bd44
MD5 0c7c69b1e77559ecf4aff3445a71b7de
BLAKE2b-256 7f1e47522f2b28501d44fc41f8ed3dba53f12c63a8968a39ab8d4b582d547f57

See more details on using hashes here.

File details

Details for the file samponlp-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for samponlp-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1446b2b1437ff62eb394e66e9f196de4e3b677f865545bd1cc8f4670555e5d81
MD5 2b011e85f922a2deead2ee2712db5a83
BLAKE2b-256 71007e7fac676834d3271041d4ed0200a3f479567a86dfcc42b8ef1034c1793c

See more details on using hashes here.

File details

Details for the file samponlp-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for samponlp-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 683dc489daa33c467a057989bfb7cffba197d4ccfd8b9f195c7e799a12d8e151
MD5 16f54768914c95ff3965ca123dfbfe64
BLAKE2b-256 7095468e7662dca5b343024195c81fe23f60ec82d62ae74d2a39963d00b2d54a

See more details on using hashes here.

File details

Details for the file samponlp-0.3.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for samponlp-0.3.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 69c0b8d91c990a52485c3a80493301c985d9abf767cd3375c330fa3dc5641b41
MD5 8b42d32251f3ff09c5b5048ba40e6f29
BLAKE2b-256 ccfb58981571b9c2edc2ec115a50ef96b2ae7b2e0cc47dde430180e02042b2f6

See more details on using hashes here.

File details

Details for the file samponlp-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for samponlp-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a3f1f43fd04e2461b26b0bdec9f3677af1feeae3a1c540f41b4a39007b635309
MD5 9d1bd1f98638d2e2d1efb9007b825e16
BLAKE2b-256 47c87ebefe79179091f76ed42aaba566d5d33905b62129d6a159e83a898b5183

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page