Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization
Project description
Tamil Tokenizer
A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.
Features
Four levels of tokenization:
| Level | Description | Example |
|---|---|---|
| sentence | Split text into sentences | "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences |
| word | Split into words + punctuation | "அவன் வந்தான்." → அவன், வந்தான், . |
| character | Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) | "வந்தான்" → வ, ந், தா, ன் |
| morpheme | Split into root + grammatical suffixes (case, tense, person) | "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative) |
Installation
# From the project directory
pip install -e .
# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."
Usage
Command Line
# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."
# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character
# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence
# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme
# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json
# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text
# Interactive mode
python -m tamil_tokenizer --interactive
Python API
from tamil_tokenizer import TamilTokenizer, Token, TokenType
tokenizer = TamilTokenizer()
# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")
# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")
# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")
# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
print(f"{m.text} -> {m.token_type.value} ({m.metadata})")
# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")
# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']
# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")
Token Types
Word-level
word— Tamil wordnumber— Numeric valuepunctuation— Punctuation marksymbol— Other symbol
Character-level
vowel— உயிரெழுத்து (அ, ஆ, இ, ...)consonant— மெய்யெழுத்து (க், ங், ச், ...)vowel_consonant— உயிர்மெய்யெழுத்து (க, கா, கி, ...)special— ஆய்த எழுத்து (ஃ)
Morpheme-level
root— Root wordsuffix— Generic suffixcase_suffix— வேற்றுமை உருபு (case marker)tense_marker— கால இடைநிலை (tense marker)person_marker— விகுதி (person/number marker)
Project Structure
tamil_tokenizer/
├── __init__.py # Package init + public API
├── __main__.py # CLI entry point
├── tokenizer.py # Main TamilTokenizer class
├── constants/ # Tamil Unicode constants & letter groups
├── grammar/ # Grammar analysis (util, case, tense)
├── config/ # Configuration & data file loading
├── parsers/ # Root word parser & core parsing
├── utils/ # Iterator, splitting, word class utilities
└── data/ # Grammar rule files (.list)
Requirements
- Python 3.8+
- No external dependencies
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vettu-1.0.2.tar.gz
(456.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
vettu-1.0.2-py3-none-any.whl
(480.5 kB
view details)
File details
Details for the file vettu-1.0.2.tar.gz.
File metadata
- Download URL: vettu-1.0.2.tar.gz
- Upload date:
- Size: 456.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdfd5ab4bbbf8b22cd9140b09629b313588c6a0ecdb0f08faf147080b1975ca9
|
|
| MD5 |
4b64af95b4f4cc755ec6835b04c24fd6
|
|
| BLAKE2b-256 |
97f67f86398ef83b6ed3b415fc77e80b31479417a71e2429d8cd43a663821a39
|
File details
Details for the file vettu-1.0.2-py3-none-any.whl.
File metadata
- Download URL: vettu-1.0.2-py3-none-any.whl
- Upload date:
- Size: 480.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
839df209b4b75abf9c581217dba321114dd322c24c0123f1318f6627c7b61c74
|
|
| MD5 |
ca8ff0bf69da03c76b387be53c75abe6
|
|
| BLAKE2b-256 |
a812d32f8ae50aa0e3d4d3e3abf2c75343ca760cdc9075c704b868d08d87f7d0
|