Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization
Project description
Tamil Tokenizer
A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.
Features
Four levels of tokenization:
| Level | Description | Example |
|---|---|---|
| sentence | Split text into sentences | "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences |
| word | Split into words + punctuation | "அவன் வந்தான்." → அவன், வந்தான், . |
| character | Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) | "வந்தான்" → வ, ந், தா, ன் |
| morpheme | Split into root + grammatical suffixes (case, tense, person) | "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative) |
Installation
# From the project directory
pip install -e .
# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."
Usage
Command Line
# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."
# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character
# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence
# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme
# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json
# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text
# Interactive mode
python -m tamil_tokenizer --interactive
Python API
from tamil_tokenizer import TamilTokenizer, Token, TokenType
tokenizer = TamilTokenizer()
# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")
# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")
# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")
# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
print(f"{m.text} -> {m.token_type.value} ({m.metadata})")
# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")
# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']
# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")
Token Types
Word-level
word— Tamil wordnumber— Numeric valuepunctuation— Punctuation marksymbol— Other symbol
Character-level
vowel— உயிரெழுத்து (அ, ஆ, இ, ...)consonant— மெய்யெழுத்து (க், ங், ச், ...)vowel_consonant— உயிர்மெய்யெழுத்து (க, கா, கி, ...)special— ஆய்த எழுத்து (ஃ)
Morpheme-level
root— Root wordsuffix— Generic suffixcase_suffix— வேற்றுமை உருபு (case marker)tense_marker— கால இடைநிலை (tense marker)person_marker— விகுதி (person/number marker)
Project Structure
tamil_tokenizer/
├── __init__.py # Package init + public API
├── __main__.py # CLI entry point
├── tokenizer.py # Main TamilTokenizer class
├── constants/ # Tamil Unicode constants & letter groups
├── grammar/ # Grammar analysis (util, case, tense)
├── config/ # Configuration & data file loading
├── parsers/ # Root word parser & core parsing
├── utils/ # Iterator, splitting, word class utilities
└── data/ # Grammar rule files (.list)
Requirements
- Python 3.8+
- No external dependencies
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vettu-1.0.4.tar.gz
(4.5 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
vettu-1.0.4-py3-none-any.whl
(4.6 MB
view details)
File details
Details for the file vettu-1.0.4.tar.gz.
File metadata
- Download URL: vettu-1.0.4.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
483cf741abf14e49e056779af8b42116e31a123adfa5b594d5561df5d3322c35
|
|
| MD5 |
d6c5911e4f461c0d45122fe562489429
|
|
| BLAKE2b-256 |
683be26e609d4ede563153c707a1fd18cd2cb9e384febc881b6633cbb2549e3a
|
File details
Details for the file vettu-1.0.4-py3-none-any.whl.
File metadata
- Download URL: vettu-1.0.4-py3-none-any.whl
- Upload date:
- Size: 4.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
150f3c33d3d2868944a77ee1070aee6d3307c48bb056f89813870e3c2fbca6bb
|
|
| MD5 |
04d10f6a019829c89342e07fe70efe01
|
|
| BLAKE2b-256 |
942027831ef9568f32e1fe70f107d944db975634cf46a7a9407c7daaac925c1e
|