Skip to main content

A Universal hub-and-spoke morphological representation converter for Sanskrit.

Project description

Sanskrit Morph Converter

PyPI version License: GPL v3 Python 3.8+

A Python engine for unifying, standardizing, and converting Sanskrit morphological tags across multiple computational paradigms.

In Sanskrit Computational Linguistics, different tools like the Sanskrit Heritage engine, Samsaadhanii, neural models like ByT5, and baseline grammars like Svarupa output morphological analyses in vastly different formats and vocabularies. sanskrit-morph-converter provides a centralized, pivot-based architecture to translate these tagsets into a unified Canonical Representation.

Installation

Install the package directly from PyPI:

pip install sanskrit-morph-converter

Python API Usage

You can import the converter directly into your Python scripts to process strings or JSON outputs from various platforms. The core .convert() method takes a source platform, a target platform, and the raw input.

from sanskrit_morph_converter.converter import RepresentationConverter

# Initialize the converter (automatically loads the compiled mapping TSVs)
converter = RepresentationConverter()

Example 1: Converting ByT5 Output to Canonical

ByT5 outputs rely on underscore and pipe-separated strings. The converter easily parses these into standard Canonical properties.

byt5_raw = "devam_deva_Case=Acc|Gender=Masc|Number=Sing"

# Convert ByT5 to Canonical
canonical_tags = converter.convert('ByT5', 'Canonical', byt5_raw)
print(canonical_tags)
# Output: [{'input': 'देवम्', 'stem': 'देव', 'root': '', 'morph': 'Case=Accusative|Gender=Masculine|Number=Singular'}]

Example 2: Converting Sanskrit Heritage (SH) to DCS

The Sanskrit Heritage engine returns nested JSON dictionaries. You can pass the JSON string directly to convert it to another format, such as DCS.

sh_raw = """{
    "input": "गच्छति", 
    "status": "Success", 
    "morph": [{"word": "गच्छति", "root": "गम्", "inflectional_morphs": ["pr. [1] ac. sg. 3"]}]
}"""

# Convert SH to DCS
dcs_tags = converter.convert('SH', 'DCS', sh_raw, output_format='string')
print(dcs_tags)
# Output (Example): ['gacchati\tgam\tMood=Ind|Number=Sing|Person=3|Tense=Pres']

Command Line Interface (CLI)

The package includes a built-in CLI for batch processing files or testing quick strings directly from your terminal.

Convert a single string:

smc convert ByT5 Canonical -i "devam_deva_Case=Acc|Gender=Masc|Number=Sing"

Process an entire file and save the output:

smc convert SH Canonical -f data/sh_analysis.tsv -o data/canonical_results.tsv

Change the output script (e.g., to WX or IAST):

smc convert ByT5 SH -i "devam_deva_Case=Acc|Gender=Masc|Number=Sing" --script WX

Architecture

This library operates on a flexible, three-stage pipeline: Adapters (to read the source format), a Mapper (to route to a mathematical Pivot), and an Converter (to format the target platform output).

The Google Sheets Integration

To ensure this tool remains accessible to linguists and researchers who may not write code, the mapping vocabulary is not hardcoded. Instead, tag standardizations and lexical exceptions (like pronouns and causatives) are maintained collaboratively in a Master Google Sheet.

When linguistic rules are updated in the sheet, you can use the built-in compiler to fetch the latest data and rebuild the internal .tsv files (pivot_mapping.tsv, normalization.tsv, etc.) without altering the Python engine.

To fetch the latest mappings from the Google Sheet:

sanskrit-morph update

(Note: The pre-compiled .tsv files are already bundled with the PyPI package, so standard users do not need to run the compiler to use the tool).

📜 License

This project is licensed under the GNU GENERAL PUBLIC LICENSE v3 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanskrit_morph_converter-0.1.0.tar.gz (52.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanskrit_morph_converter-0.1.0-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file sanskrit_morph_converter-0.1.0.tar.gz.

File metadata

  • Download URL: sanskrit_morph_converter-0.1.0.tar.gz
  • Upload date:
  • Size: 52.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sanskrit_morph_converter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d54a1893cea154d71fd5eaa51ba20c9b280700476ebe15ad5cfdbf7c483caba0
MD5 982dd6c64fb75484f4300a22762b8180
BLAKE2b-256 639e0a6d5434f76411984039a3e3a6ae408f70eaf7f9139a2e42f3004520dab7

See more details on using hashes here.

File details

Details for the file sanskrit_morph_converter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sanskrit_morph_converter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f7af9b082a87494f81e060b15874dbb7356022b724f517b8f7b88ca0d15e509
MD5 5b42f33faa2a238b68f9d1dafcc9b13a
BLAKE2b-256 f2f63670b25b22cb1df3968df2fca0e2d0340733a810020c5773ca8480e98443

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page