Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization
Project description
Tamil Tokenizer
A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.
Features
Four levels of tokenization:
| Level | Description | Example |
|---|---|---|
| sentence | Split text into sentences | "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences |
| word | Split into words + punctuation | "அவன் வந்தான்." → அவன், வந்தான், . |
| character | Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) | "வந்தான்" → வ, ந், தா, ன் |
| morpheme | Split into root + grammatical suffixes (case, tense, person) | "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative) |
Installation
# From the project directory
pip install -e .
# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."
Usage
Command Line
# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."
# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character
# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence
# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme
# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json
# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text
# Interactive mode
python -m tamil_tokenizer --interactive
Python API
from tamil_tokenizer import TamilTokenizer, Token, TokenType
tokenizer = TamilTokenizer()
# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")
# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")
# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")
# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
print(f"{m.text} -> {m.token_type.value} ({m.metadata})")
# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")
# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']
# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")
Token Types
Word-level
word— Tamil wordnumber— Numeric valuepunctuation— Punctuation marksymbol— Other symbol
Character-level
vowel— உயிரெழுத்து (அ, ஆ, இ, ...)consonant— மெய்யெழுத்து (க், ங், ச், ...)vowel_consonant— உயிர்மெய்யெழுத்து (க, கா, கி, ...)special— ஆய்த எழுத்து (ஃ)
Morpheme-level
root— Root wordsuffix— Generic suffixcase_suffix— வேற்றுமை உருபு (case marker)tense_marker— கால இடைநிலை (tense marker)person_marker— விகுதி (person/number marker)
Project Structure
tamil_tokenizer/
├── __init__.py # Package init + public API
├── __main__.py # CLI entry point
├── tokenizer.py # Main TamilTokenizer class
├── constants/ # Tamil Unicode constants & letter groups
├── grammar/ # Grammar analysis (util, case, tense)
├── config/ # Configuration & data file loading
├── parsers/ # Root word parser & core parsing
├── utils/ # Iterator, splitting, word class utilities
└── data/ # Grammar rule files (.list)
Requirements
- Python 3.8+
- No external dependencies
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vettu-1.0.3.tar.gz
(668.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
vettu-1.0.3-py3-none-any.whl
(696.1 kB
view details)
File details
Details for the file vettu-1.0.3.tar.gz.
File metadata
- Download URL: vettu-1.0.3.tar.gz
- Upload date:
- Size: 668.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b3bb81a6e1cba3748ac33f865d953000acbcc99ebe4474ed2ac47c565723dcc
|
|
| MD5 |
19ff35d4e2778b664e43d751176dec0d
|
|
| BLAKE2b-256 |
8ca4ec1150dba9cfb1b1b4049c3d1edde883a1a5b9c858e95146fd749554137d
|
File details
Details for the file vettu-1.0.3-py3-none-any.whl.
File metadata
- Download URL: vettu-1.0.3-py3-none-any.whl
- Upload date:
- Size: 696.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25315b0a4534620c370416f04ef3b44f6d5b2aeb05fed0a6179c5c3c468d471b
|
|
| MD5 |
30442641fd76f1f8c5907765a559369f
|
|
| BLAKE2b-256 |
49f5b101849a5b04ef71b2f8716067ece0e25ba80f6fc7d953521ef0b4041bde
|