Sanskrit tokenizer with sandhi splitting for Information Retrieval.
Project description
sanskrit-tokenizer
Tokenize Sanskrit text with sandhi splitting for Information Retrieval.
pip install .
Quick start
from sanskrit_tokenizer import tokenize
tokenize("devālaya")
# ['deva', 'ālaya']
tokenize("धर्म योग")
# ['dharma', 'yoga']
tokenize("dharmakṣetre kurukṣetre")
# ['dharmakṣa', 'itre', 'kurukṣa', 'itre']
tokenize() normalizes to IAST, splits on whitespace and punctuation, then applies reverse sandhi rules. Accepts both Devanagari and IAST.
Sandhi splitting
from sanskrit_tokenizer.sandhi import split_sandhi
split_sandhi("devālaya") # savarna-dīrgha: ā → a + ā
# ['deva', 'ālaya']
split_sandhi("dharma") # no junction found
# ['dharma']
Rule-based engine covering vowel sandhi (savarṇa-dīrgha, guṇa, vṛddhi, yān, ayādi), consonant sandhi (voicing, nasals, t-combinations), and visarga sandhi. Uses longest-match heuristic when splits are ambiguous.
Transliteration
from sanskrit_tokenizer.transliterate import (
devanagari_to_iast,
iast_to_devanagari,
is_devanagari,
)
devanagari_to_iast("भगवद्गीता")
# 'bhagavadgītā'
iast_to_devanagari("rāmāyaṇam")
# 'रामायणम्'
is_devanagari("धर्म")
# True
Word-level tokenization
from sanskrit_tokenizer.tokenizer import tokenize_words
tokenize_words("devālaya namaḥ")
# ['devālaya', 'namaḥ']
tokenize_words() splits on whitespace and punctuation only — no sandhi splitting.
CLI
sanskrit-tokenize "devālaya"
# deva
# ālaya
echo "धर्म योग" | sanskrit-tokenize
# dharma
# yoga
sanskrit-tokenize --no-sandhi "devālaya"
# devālaya
sanskrit-tokenize -s " " "dharma yoga"
# dharma yoga
--no-sandhi— word-level only, skip sandhi splitting--separator SEP— output separator (default: newline)
License
MIT © Hemanth.HM
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sanskrit_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: sanskrit_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b8844fd98e1d4f936aa9c5118f498ab1c107d10184bb53183b650d633c03ca7
|
|
| MD5 |
689075498a29552a1040e2449d737aa8
|
|
| BLAKE2b-256 |
f10b5733bddafa613206764db53ba96c39a261ce8252aac1ad026ef0586e0e16
|
File details
Details for the file sanskrit_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sanskrit_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94392c78196a51bc9eecbdfa2c573ff49509209e6208f56dbb2d2905cc2da652
|
|
| MD5 |
da5d3606c35e4933d7e73d941dc129ad
|
|
| BLAKE2b-256 |
005a5ecf06f3cc4f1c21b11b479ccf24f6ed72d2624b5e9184c4c7ad0d041dfc
|