A comprehensive toolkit for Sanskrit text processing
Project description
Vedika - Sanskrit NLP Toolkit
Vedika is a comprehensive toolkit for Sanskrit text processing, offering deep learning-based tools for sandhi splitting and joining, text normalization, sentence splitting, syllabification, and tokenization.
Features
- Sandhi Processing
- Split compound Sanskrit words using attention-based neural networks
- Join Sanskrit words with proper sandhi rules
- Support for beam search to get multiple suggestions
- Text Processing
- Syllabification
- Tokenization
- Sentence splitting
- Text normalization
Installation
# Install from PyPI
pip install vedika
Requirements
- Python >= 3.8
- PyTorch >= 1.9.0
- NumPy >= 1.19.0
- Pandas >= 1.3.0
- tqdm >= 4.62.0
- regex >= 2021.8.3
Quick Start
Sandhi Splitting
from vedika import SanskritSplit
# Initialize splitter
splitter = SanskritSplit()
# Split a single word
result = splitter.split("रामायणम्")
print(result['split']) # Output: राम + अयन + अम्
# Batch processing
words = ["रामायणम्", "गीतागोविन्दम्"]
results = splitter.split_batch(words)
for result in results:
print(f"{result['input']} → {result['split']}")
Sandhi Joining
from vedika import SandhiJoiner
# Initialize joiner
joiner = SandhiJoiner()
# Join split words
result = joiner.join("राम+अस्ति")
print(result) # Output: रामास्ति
# Batch processing
texts = ["राम+अस्ति", "गच्छ+अमि"]
results = joiner.join_batch(texts)
print(results) # ['रामास्ति', 'गच्छामि']
Advanced Usage
Beam Search for Multiple Suggestions
# Get multiple suggestions with beam search
result = splitter.split("रामायणम्", beam_size=3)
print(f"Best split: {result['split']}")
print(f"Confidence: {result['confidence']}")
print("Alternatives:")
for alt in result['alternatives']:
print(f"- {alt['split']} (confidence: {alt['confidence']})")
Model Information
# Get model details
info = splitter.get_model_info()
print(f"Vocabulary size: {info['vocabulary_size']}")
print(f"Device: {info['device']}")
print(f"Configuration: {info['model_config']}")
Project Structure
vedika/
├── __init__.py
├── normalizer.py
├── sandhi_join.py
├── sandhi_split.py
├── sentence_splitter.py
├── syllabification.py
├── tokenizer.py
└── data/
├── cleaned_metres.json
├── sandhi_joiner.pth
└── sandhi_split.pth
Model Architecture
The sandhi processing models use:
- Bidirectional LSTM encoder
- GRU decoder with attention
- Multi-head attention mechanism
- Character-level processing
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Tanuj Saxena
- Soumya Sharma
Citation
If you use Vedika in your research, please cite:
@software{vedika2025,
title={Vedika: A Sanskrit Text Processing Toolkit},
author={Saxena, Tanuj and Sharma, Soumya},
year={2025},
url={https://github.com/tanuj437/vedika}
}
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vedika-0.0.18.tar.gz.
File metadata
- Download URL: vedika-0.0.18.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e42699adc859d73cb939e178c742c4d6abde7c79b071469c8c13f8534ea1f056
|
|
| MD5 |
f38a9410abecc2a2a703b3c93bb89b6c
|
|
| BLAKE2b-256 |
2c0ed2ca51ee0e784ab9106b48b514c079a4048568d0fb9164f5700d74f1b142
|
File details
Details for the file vedika-0.0.18-py2.py3-none-any.whl.
File metadata
- Download URL: vedika-0.0.18-py2.py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d5c6812d621ca6f89d6a684d195b06b064da643e06e72af65d5aba06d0c8641
|
|
| MD5 |
9f26bc384e697cec4894f31ee3a82bc8
|
|
| BLAKE2b-256 |
8676ca947abbdb584d50081d17d308d48a8b94daa2df9db7156c85c6232e2fb1
|