Skip to main content

Deterministic, offline Marathi word analysis library (shabda = word in Marathi)

Project description

marathi-shabda

Deterministic, offline Marathi word analysis library

PyPI version Python 3.8+ License: MIT


What is marathi-shabda?

marathi-shabda is a production-quality Python library for analyzing Marathi words. It provides:

  1. Lemma (stem) extraction from inflected Marathi words
  2. Dictionary lookup (Marathi ↔ English) with meanings
  3. Morphological analysis (रूप परिचय) including POS, vibhakti, and kāl detection

Why "pratham" (प्रथम)?

Pratham means "first" in Marathi. This library provides the first step in Marathi text analysis: understanding individual words before tackling sentences or documents.


Motivation

Marathi language tooling lags behind other Indian languages. Existing solutions either:

  • Require network access (API-based)
  • Hallucinate meanings (LLM-based)
  • Lack linguistic grounding (pure ML)

marathi-shabda is different:

  • Offline-first: No network, no API keys
  • Dictionary-backed: Authoritative meanings, no hallucinations
  • Explainable: Shows reasoning for every decision
  • Honest about limitations: Surfaces ambiguity instead of hiding it

What It Does

✅ Supported Features

  • Lemma extraction: पाण्यावरपाणी (water)
  • Vibhakti detection: Identifies case markers (तृतीया, सप्तमी, संबंध, etc.)
  • Dictionary lookup: Marathi → English meanings
  • POS tagging: Conservative noun/verb/adjective classification
  • Kāl inference: Basic tense detection for verbs
  • Roman input: Accepts romanized Marathi (e.g., paniपाणी)
  • Stem alternations: Handles oblique forms (पाण्यपाणी)

❌ Explicit Non-Goals

This library does NOT:

  • Parse sentences or multi-word phrases
  • Claim grammatical correctness in all contexts
  • Infer semantics beyond dictionary meanings
  • Require network access
  • Use machine learning (v0.1.0)

Installation

pip install marathi-shabda

Requirements: Python 3.8+, no external dependencies


Quick Start

1. Lemma Extraction

from marathi_shabda import get_lemma

result = get_lemma("पाण्यावर")
print(result.lemma)              # पाणी
print(result.confidence)         # 0.9
print(result.detected_vibhakti)  # VibhaktiType.SAPTAMI (सप्तमी)
print(result.explanation)        # "Detected सप्तमी vibhakti"

2. Dictionary Lookup

from marathi_shabda import lookup_word

result = lookup_word("पाणी")
print(result.english_meanings)   # ['water']
print(result.found)              # True

# Also works with Roman input
result = lookup_word("pani")
print(result.lemma)              # पाणी

3. Morphological Analysis

from marathi_shabda import analyze_word

result = analyze_word("मुलाने")
print(result.lemma)      # मुल
print(result.pos)        # POSTag.NOUN
print(result.vibhakti)   # VibhaktiType.TRUTIYA (तृतीया)
print(result.confidence) # 0.9
print(result.explanation)
# "Detected तृतीया vibhakti; Inferred noun"

How It Works

Architecture

Input Word
   ↓
Normalization (Roman → Devanagari)
   ↓
Dictionary Check (exact match?)
   ↓
Vibhakti Detection (longest-first)
   ↓
Stem Alternations (पाण्य → पाणी)
   ↓
Dictionary Validation (lemma exists?)
   ↓
POS & Kāl Inference
   ↓
Result with Confidence

Key Principles

  1. Dictionary-first validation: Rules generate candidates, dictionary decides truth
  2. Longest-match-first: Detects मध्ये before ये
  3. Conservative inference: Returns UNKNOWN when uncertain
  4. Explainable decisions: Every result includes reasoning

Confidence & Ambiguity

Confidence Scores

  • 1.0: Exact dictionary match
  • 0.9: Vibhakti detected, lemma validated
  • 0.7: Ambiguous (multiple possible lemmas)
  • 0.0: Word not in dictionary

Handling Ambiguity

result = get_lemma("घरात")
if result.ambiguous:
    print(f"Multiple interpretations: {result.candidates}")
    # ['घर', 'घरात']  # Could be noun or compound

Philosophy: We surface ambiguity instead of making false claims.


Offline Guarantee

marathi-shabda works completely offline:

  • ✅ No network requests
  • ✅ No API keys
  • ✅ No telemetry
  • ✅ Bundled SQLite database
  • ✅ Pure Python (stdlib only)

Perfect for:

  • Privacy-sensitive applications
  • Offline environments
  • Embedded systems
  • Research reproducibility

Limitations

Current Limitations (v0.1.0)

  • Single words only: No sentence parsing
  • Conservative POS tagging: Limited to obvious cases
  • Basic kāl detection: Only common verb patterns
  • No semantic analysis: Dictionary meanings only
  • Limited verb conjugation: Focus on nouns/vibhakti

Known Edge Cases

  • Compound words may not split correctly
  • Rare vibhaktis may not be detected
  • Ambiguous forms return multiple candidates
  • Roman transliteration is approximate

We document limitations honestly. If you encounter issues, please report them!


Future Roadmap

v0.2.0 (Planned)

  • Extended database schema (POS, gender, number)
  • Improved verb conjugation analysis
  • Compound word splitting
  • Performance optimizations

v0.3.0 (Planned)

  • Optional SLM integration for ambiguity resolution
  • Sentence-level analysis (experimental)
  • Batch processing API

Long-term

  • Hybrid rule-based + ML approach
  • Community-contributed dictionary expansions
  • Web API (optional deployment)

Command-Line Interface

# Extract lemma
marathi-shabda lemma पाण्यावर

# Dictionary lookup
marathi-shabda lookup पाणी

# Full analysis
marathi-shabda analyze मुलाने

Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • How to add vibhakti rules
  • How to improve transliteration
  • Code style guidelines
  • Testing requirements

License

MIT License - see LICENSE for details


Acknowledgments

  • Marathi language scholars and grammarians
  • Open-source NLP community
  • Contributors and testers

Citation

If you use marathi-shabda in research, please cite:

@software{marathi_shabda,
  title = {marathi-shabda: Deterministic Marathi Word Analysis},
  author = {Marathi Pratham Contributors},
  year = {2026},
  url = {https://github.com/yourusername/marathi-shabda}
}

Support


Philosophy: When unsure, defer. When confident, explain why.

Built with respect for the Marathi language and its speakers. 🙏

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marathi_shabda-0.1.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marathi_shabda-0.1.0-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file marathi_shabda-0.1.0.tar.gz.

File metadata

  • Download URL: marathi_shabda-0.1.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for marathi_shabda-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bda78ed68885cefad79a02cf608a755e984e19e7b5556a537ec90991b734cd6e
MD5 4c21a9c8cedebbaab0ef4576c9d93ebd
BLAKE2b-256 7b9e3d0a482d4fd1ad8607306223552fa6fbb52cc0f9b110393a76d45c0b51b3

See more details on using hashes here.

File details

Details for the file marathi_shabda-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: marathi_shabda-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for marathi_shabda-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2bc5a31abb9bc4b61b30b5c662958b1de43aa67600969d3b927b8b822a2e6b3
MD5 36b5e7e3261da1d2b0b9e0171ca50ee3
BLAKE2b-256 07688058a3258535ef36c76d381c719256650d440f78b9458e44c1652087f0bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page