Skip to main content

Deterministic, offline Marathi word analysis library (shabda = word in Marathi)

Project description

marathi-shabda

Deterministic, offline Marathi word analysis library

PyPI version Python 3.8+ License: MIT


What is marathi-shabda?

marathi-shabda is a production-quality Python library for analyzing Marathi words. It provides:

  1. Lemma (stem) extraction from inflected Marathi words
  2. Dictionary lookup (Marathi ↔ English) with meanings
  3. Morphological analysis (रूप परिचय) including POS, vibhakti, and kāl detection

Why "pratham" (प्रथम)?

Pratham means "first" in Marathi. This library provides the first step in Marathi text analysis: understanding individual words before tackling sentences or documents.


Motivation

Marathi language tooling lags behind other Indian languages. Existing solutions either:

  • Require network access (API-based)
  • Hallucinate meanings (LLM-based)
  • Lack linguistic grounding (pure ML)

marathi-shabda is different:

  • Offline-first: No network, no API keys
  • Dictionary-backed: Authoritative meanings, no hallucinations
  • Explainable: Shows reasoning for every decision
  • Honest about limitations: Surfaces ambiguity instead of hiding it

What It Does

✅ Supported Features

  • Lemma extraction: पाण्यावरपाणी (water)
  • Vibhakti detection: Identifies case markers (तृतीया, सप्तमी, संबंध, etc.)
  • Dictionary lookup: Marathi → English meanings
  • POS tagging: Conservative noun/verb/adjective classification
  • Kāl inference: Basic tense detection for verbs
  • Roman input: Accepts romanized Marathi (e.g., paniपाणी)
  • Stem alternations: Handles oblique forms (पाण्यपाणी)

❌ Explicit Non-Goals

This library does NOT:

  • Parse sentences or multi-word phrases
  • Claim grammatical correctness in all contexts
  • Infer semantics beyond dictionary meanings
  • Require network access
  • Use machine learning (v0.1.0)

Installation

pip install marathi-shabda

Requirements: Python 3.8+, no external dependencies


Quick Start

1. Lemma Extraction

from marathi_shabda import get_lemma

result = get_lemma("पाण्यावर")
print(result.lemma)              # पाणी
print(result.confidence)         # 0.9
print(result.detected_vibhakti)  # VibhaktiType.SAPTAMI (सप्तमी)
print(result.explanation)        # "Detected सप्तमी vibhakti"

2. Dictionary Lookup

from marathi_shabda import lookup_word

result = lookup_word("पाणी")
print(result.english_meanings)   # ['water']
print(result.found)              # True

# Also works with Roman input
result = lookup_word("pani")
print(result.lemma)              # पाणी

3. Morphological Analysis

from marathi_shabda import analyze_word

result = analyze_word("मुलाने")
print(result.lemma)      # मुल
print(result.pos)        # POSTag.NOUN
print(result.vibhakti)   # VibhaktiType.TRUTIYA (तृतीया)
print(result.confidence) # 0.9
print(result.explanation)
# "Detected तृतीया vibhakti; Inferred noun"

How It Works

Architecture

Input Word
   ↓
Normalization (Roman → Devanagari)
   ↓
Dictionary Check (exact match?)
   ↓
Vibhakti Detection (longest-first)
   ↓
Stem Alternations (पाण्य → पाणी)
   ↓
Dictionary Validation (lemma exists?)
   ↓
POS & Kāl Inference
   ↓
Result with Confidence

Key Principles

  1. Dictionary-first validation: Rules generate candidates, dictionary decides truth
  2. Longest-match-first: Detects मध्ये before ये
  3. Conservative inference: Returns UNKNOWN when uncertain
  4. Explainable decisions: Every result includes reasoning

Confidence & Ambiguity

Confidence Scores

  • 1.0: Exact dictionary match
  • 0.9: Vibhakti detected, lemma validated
  • 0.7: Ambiguous (multiple possible lemmas)
  • 0.0: Word not in dictionary

Handling Ambiguity

result = get_lemma("घरात")
if result.ambiguous:
    print(f"Multiple interpretations: {result.candidates}")
    # ['घर', 'घरात']  # Could be noun or compound

Philosophy: We surface ambiguity instead of making false claims.


Offline Guarantee

marathi-shabda works completely offline:

  • ✅ No network requests
  • ✅ No API keys
  • ✅ No telemetry
  • ✅ Bundled SQLite database
  • ✅ Pure Python (stdlib only)

Perfect for:

  • Privacy-sensitive applications
  • Offline environments
  • Embedded systems
  • Research reproducibility

Limitations

Current Limitations (v0.1.0)

  • Single words only: No sentence parsing
  • Conservative POS tagging: Limited to obvious cases
  • Basic kāl detection: Only common verb patterns
  • No semantic analysis: Dictionary meanings only
  • Limited verb conjugation: Focus on nouns/vibhakti

Known Edge Cases

  • Compound words may not split correctly
  • Rare vibhaktis may not be detected
  • Ambiguous forms return multiple candidates
  • Roman transliteration is approximate

We document limitations honestly. If you encounter issues, please report them!


Future Roadmap

v0.2.0 (Planned)

  • Extended database schema (POS, gender, number)
  • Improved verb conjugation analysis
  • Compound word splitting
  • Performance optimizations

v0.3.0 (Planned)

  • Optional SLM integration for ambiguity resolution
  • Sentence-level analysis (experimental)
  • Batch processing API

Long-term

  • Hybrid rule-based + ML approach
  • Community-contributed dictionary expansions
  • Web API (optional deployment)

Command-Line Interface

# Extract lemma
marathi-shabda lemma पाण्यावर

# Dictionary lookup
marathi-shabda lookup पाणी

# Full analysis
marathi-shabda analyze मुलाने

Contributing

We welcome your feedback and suggestions! While the core codebase is maintained by the project owners, we encourage the community to:

How You Can Help

  • Use the library in your projects and applications
  • Report issues if you encounter bugs or unexpected behavior
  • Suggest enhancements for vibhakti rules, transliteration, or new features
  • Share use cases to help us understand real-world applications
  • Provide linguistic feedback on Marathi grammar rules and edge cases

Suggesting Improvements

If you have ideas for improvement:

  1. Open an issue on GitHub describing your suggestion
  2. Provide examples of words or patterns that should be handled better
  3. Share linguistic references if applicable (grammar rules, scholarly sources)

We review all suggestions and incorporate valuable feedback into future releases.

Usage Terms

This library is freely available for use under the MIT License. You can:

  • ✅ Use it in personal and commercial projects
  • ✅ Modify it for your own needs
  • ✅ Distribute it with your applications

The project maintainers reserve the right to manage contributions and maintain ownership of the core codebase.

For detailed guidelines, see CONTRIBUTING.md.


License

Free for Educational & Training Use

This software is licensed under CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) for non-commercial use.

You can freely use this library for:

  • ✅ Educational institutions and training programs
  • ✅ Academic research and publications
  • ✅ Personal learning and experimentation
  • ✅ Non-profit organizations
  • ✅ Student projects and assignments

You cannot use it for:

  • ❌ Commercial software products or services
  • ❌ Business applications or internal tools
  • ❌ Selling or monetizing the software
  • ❌ SaaS or API services for profit

Commercial Licensing

For commercial use, please contact us for a commercial license:

We offer flexible commercial licensing options for businesses and enterprises.

See LICENSE for full legal details.


Contributors

  • Prathmesh Santosh Choudhari (@iampratham29)
  • Vedangi Deepak Deshpande
  • Siddhant Akash Bobde

Acknowledgments

  • @vinodnimbalkar - For valuable open-source contributions to the Marathi language ecosystem
  • Marathi language scholars and grammarians
  • Open-source NLP community
  • All contributors and testers

Citation

If you use marathi-shabda in research, please cite:

@software{marathi_shabda,
  title = {marathi-shabda: Deterministic Marathi Word Analysis},
  author = {Choudhari, Prathmesh Santosh and Deshpande, Vedangi Deepak and Bobde, Siddhant Akash},
  year = {2026},
  url = {https://github.com/iampratham29/marathi-shabda}
}

Support


Philosophy: When unsure, defer. When confident, explain why.

Built with respect for the Marathi language and its speakers. 🙏

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marathi_shabda-0.1.1-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file marathi_shabda-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: marathi_shabda-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for marathi_shabda-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8a1a47ece19baafea47e46bca91ef7487fc069889ac93bf6e56519ebdd770e90
MD5 a5c116a52fd4707f36816910eb187c8f
BLAKE2b-256 63c84479e78459aba88b629b4a1c01a528255dd5a4f3a35b3ddadd0fc8fdc5a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page