Tools for lexical and morphological analysis of Sanskrit
Parsers for Sanskrit / संस्कृतम्
NOTE: This project is still under development. Both over-generation (invalid forms/splits) and under-generation (missing valid forms/splits) are quite likely. Please see the Sanskrit Parser Stack section below for detailed status. Report any issues here.
Please feel free to ping us if you would like to collaborate on this project.
Try it out!
A web interface is available here - https://kmadathil.github.io/sanskrit_parser/ui/index.html
This project has been tested and developed using Python 2.7. A port to Python 3 has been completed, and everything should now work in both versions of Python.
pip install sanskrit_parser
- See generated sphynx docs.
- PS: Command line usage is also documented there.
Deploying REST API server
sudo mkdir /var/www/.sanskrit_parser sudo chmod a+rwx /var/www/.sanskrit_parser
- Generate docs: cd docs; make html
Sanskrit Parser Stack
Stack of parsing tools
Sandhi splitting subroutine Input: Phoneme sequence and Phoneme number to split at Action: Perform a sandhi split at given input phoneme number Ouptut: left and right sequences (multiple options will be output). No semantic validation will be performed (up to higher levels)
Module that performs sandhi split/join and convenient rule definition is at lexical_analyzer/sandhi.py.
Rule definitions (human readable!) are at lexical_analyzer/sandhi_rules/*.txt
- From dhatu + lakAra + puruSha + vachana to pada and vice versa
- From prAtipadika + vibhakti + vachana to pada and vice versa
- Upasarga + dhAtu forms - forward and backwards
- nAmadhAtu forms
- Krt forms - forwards and backwards
- Taddhita forms - forwards and backwards
To be done.
However, we have a usable solution with inriaxmlwrapper + Prof. Gerard Huet’s forms database to act as queriable form database. That gives us the bare minimum we need from Level 1, so Level 2 can work.
Sanskrit Sentence #### Action * Traverse the sentence, splitting it (or not) at each location to determine all possible valid splits * Traverse from left to right * Using dynamic programming, assemble the results of all choices
To split or not to split at each phoneme If split, all possible left/right combination of phonemes that can result Once split, check if the left section is a valid pada (use level 1 tools to pick pada type and tag morphologically) If left section is valid, proceed to split the right section
- At the end of this step, we will have all possible syntactically valid splits with morphological tags
All semantically valid sandhi split sequences
Module that performs sentence split is at lexical_analyzer/SanksritLexicalAnalyzer.py
Semantically valid sequence of tagged padas (output of Level 1) #### Action: * Assemble graphs of morphological constraints
viseShaNa - viseShya karaka/vibhakti vachana/puruSha constraints on tiGantas and subantas
- Check validity of graphs #### Output
- Is the input sequence a morphologically valid sentence?
- Enhanced sequence of tagged padas, with karakas tagged, and a dependency graph associated
Early experimental version (simple sentences only) is at morphological_analyzer/SanskritMorphologicalAnalyzer.py
Seq2Seq based Sanskrit Parser
See: Grammar as a Foreign Language : Vinyals & Kaiser et. al. Google http://arxiv.org/abs/1412.7449
- Method: Seq2Seq Neural Network (n? layers)
- Input Embedding with word2vec (optional)
Sanskrit sentence ### Output Sentence split into padas with tags ### Train/Test data DCS corpus, converted by Vishvas Vasuki
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size sanskrit_parser-0.0.4-py2.py3-none-any.whl (63.1 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View hashes|
|Filename, size sanskrit_parser-0.0.4.tar.gz (38.8 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for sanskrit_parser-0.0.4-py2.py3-none-any.whl