Skip to main content

Sanskrit tokenizer with sandhi splitting for Information Retrieval.

Project description

sanskrit-tokenizer

Tokenize Sanskrit text with sandhi splitting for Information Retrieval.

pip install .

Quick start

from sanskrit_tokenizer import tokenize

tokenize("devālaya")
# ['deva', 'ālaya']

tokenize("धर्म योग")
# ['dharma', 'yoga']

tokenize("dharmakṣetre kurukṣetre")
# ['dharmakṣa', 'itre', 'kurukṣa', 'itre']

tokenize() normalizes to IAST, splits on whitespace and punctuation, then applies reverse sandhi rules. Accepts both Devanagari and IAST.

Sandhi splitting

from sanskrit_tokenizer.sandhi import split_sandhi

split_sandhi("devālaya")   # savarna-dīrgha: ā → a + ā
# ['deva', 'ālaya']

split_sandhi("dharma")     # no junction found
# ['dharma']

Rule-based engine covering vowel sandhi (savarṇa-dīrgha, guṇa, vṛddhi, yān, ayādi), consonant sandhi (voicing, nasals, t-combinations), and visarga sandhi. Uses longest-match heuristic when splits are ambiguous.

Transliteration

from sanskrit_tokenizer.transliterate import (
    devanagari_to_iast,
    iast_to_devanagari,
    is_devanagari,
)

devanagari_to_iast("भगवद्गीता")
# 'bhagavadgītā'

iast_to_devanagari("rāmāyaṇam")
# 'रामायणम्'

is_devanagari("धर्म")
# True

Word-level tokenization

from sanskrit_tokenizer.tokenizer import tokenize_words

tokenize_words("devālaya namaḥ")
# ['devālaya', 'namaḥ']

tokenize_words() splits on whitespace and punctuation only — no sandhi splitting.

CLI

sanskrit-tokenize "devālaya"
# deva
# ālaya

echo "धर्म योग" | sanskrit-tokenize
# dharma
# yoga

sanskrit-tokenize --no-sandhi "devālaya"
# devālaya

sanskrit-tokenize -s " " "dharma yoga"
# dharma yoga
  • --no-sandhi — word-level only, skip sandhi splitting
  • --separator SEP — output separator (default: newline)

License

MIT © Hemanth.HM

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanskrit_tokenizer-0.1.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanskrit_tokenizer-0.1.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file sanskrit_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: sanskrit_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for sanskrit_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3b8844fd98e1d4f936aa9c5118f498ab1c107d10184bb53183b650d633c03ca7
MD5 689075498a29552a1040e2449d737aa8
BLAKE2b-256 f10b5733bddafa613206764db53ba96c39a261ce8252aac1ad026ef0586e0e16

See more details on using hashes here.

File details

Details for the file sanskrit_tokenizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sanskrit_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94392c78196a51bc9eecbdfa2c573ff49509209e6208f56dbb2d2905cc2da652
MD5 da5d3606c35e4933d7e73d941dc129ad
BLAKE2b-256 005a5ecf06f3cc4f1c21b11b479ccf24f6ed72d2624b5e9184c4c7ad0d041dfc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page