Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization

These details have not been verified by PyPI

Project links

Project description

Tamil Tokenizer

A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.

Features

Four levels of tokenization:

Level	Description	Example
sentence	Split text into sentences	`"அவன் வந்தான். அவள் பார்த்தாள்."` → 2 sentences
word	Split into words + punctuation	`"அவன் வந்தான்."` → `அவன்`, `வந்தான்`, `.`
character	Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்)	`"வந்தான்"` → `வ`, `ந்`, `தா`, `ன்`
morpheme	Split into root + grammatical suffixes (case, tense, person)	`"பள்ளிக்கு"` → root `பள்ளி` + case suffix `க்கு` (Dative)

Installation

# From the project directory
pip install -e .

# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."

Usage

Command Line

# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."

# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character

# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence

# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme

# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json

# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text

# Interactive mode
python -m tamil_tokenizer --interactive

Python API

from tamil_tokenizer import TamilTokenizer, Token, TokenType

tokenizer = TamilTokenizer()

# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")

# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")

# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
    print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")

# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
    print(f"{m.text} -> {m.token_type.value} ({m.metadata})")

# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")

# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']

# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")

Token Types

Word-level

word — Tamil word
number — Numeric value
punctuation — Punctuation mark
symbol — Other symbol

Character-level

vowel — உயிரெழுத்து (அ, ஆ, இ, ...)
consonant — மெய்யெழுத்து (க், ங், ச், ...)
vowel_consonant — உயிர்மெய்யெழுத்து (க, கா, கி, ...)
special — ஆய்த எழுத்து (ஃ)

Morpheme-level

root — Root word
suffix — Generic suffix
case_suffix — வேற்றுமை உருபு (case marker)
tense_marker — கால இடைநிலை (tense marker)
person_marker — விகுதி (person/number marker)

Project Structure

tamil_tokenizer/
├── __init__.py          # Package init + public API
├── __main__.py          # CLI entry point
├── tokenizer.py         # Main TamilTokenizer class
├── constants/           # Tamil Unicode constants & letter groups
├── grammar/             # Grammar analysis (util, case, tense)
├── config/              # Configuration & data file loading
├── parsers/             # Root word parser & core parsing
├── utils/               # Iterator, splitting, word class utilities
└── data/                # Grammar rule files (.list)

Requirements

Python 3.8+
No external dependencies

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Apr 6, 2026

1.0.3

Apr 6, 2026

This version

1.0.2

Apr 5, 2026

1.0.1

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vettu-1.0.2.tar.gz (456.7 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vettu-1.0.2-py3-none-any.whl (480.5 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file vettu-1.0.2.tar.gz.

File metadata

Download URL: vettu-1.0.2.tar.gz
Upload date: Apr 5, 2026
Size: 456.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`bdfd5ab4bbbf8b22cd9140b09629b313588c6a0ecdb0f08faf147080b1975ca9`
MD5	`4b64af95b4f4cc755ec6835b04c24fd6`
BLAKE2b-256	`97f67f86398ef83b6ed3b415fc77e80b31479417a71e2429d8cd43a663821a39`

See more details on using hashes here.

File details

Details for the file vettu-1.0.2-py3-none-any.whl.

File metadata

Download URL: vettu-1.0.2-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 480.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`839df209b4b75abf9c581217dba321114dd322c24c0123f1318f6627c7b61c74`
MD5	`ca8ff0bf69da03c76b387be53c75abe6`
BLAKE2b-256	`a812d32f8ae50aa0e3d4d3e3abf2c75343ca760cdc9075c704b868d08d87f7d0`

See more details on using hashes here.

vettu 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Tamil Tokenizer

Features

Installation

Usage

Command Line

Python API

Token Types

Word-level

Character-level

Morpheme-level

Project Structure

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes