Skip to main content

Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization

Project description

Tamil Tokenizer

A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.

Features

Four levels of tokenization:

Level Description Example
sentence Split text into sentences "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences
word Split into words + punctuation "அவன் வந்தான்."அவன், வந்தான், .
character Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) "வந்தான்", ந், தா, ன்
morpheme Split into root + grammatical suffixes (case, tense, person) "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative)

Installation

# From the project directory
pip install -e .

# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."

Usage

Command Line

# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."

# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character

# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence

# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme

# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json

# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text

# Interactive mode
python -m tamil_tokenizer --interactive

Python API

from tamil_tokenizer import TamilTokenizer, Token, TokenType

tokenizer = TamilTokenizer()

# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")

# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")

# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
    print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")

# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
    print(f"{m.text} -> {m.token_type.value} ({m.metadata})")

# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")

# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']

# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")

Token Types

Word-level

  • word — Tamil word
  • number — Numeric value
  • punctuation — Punctuation mark
  • symbol — Other symbol

Character-level

  • vowel — உயிரெழுத்து (அ, ஆ, இ, ...)
  • consonant — மெய்யெழுத்து (க், ங், ச், ...)
  • vowel_consonant — உயிர்மெய்யெழுத்து (க, கா, கி, ...)
  • special — ஆய்த எழுத்து (ஃ)

Morpheme-level

  • root — Root word
  • suffix — Generic suffix
  • case_suffix — வேற்றுமை உருபு (case marker)
  • tense_marker — கால இடைநிலை (tense marker)
  • person_marker — விகுதி (person/number marker)

Project Structure

tamil_tokenizer/
├── __init__.py          # Package init + public API
├── __main__.py          # CLI entry point
├── tokenizer.py         # Main TamilTokenizer class
├── constants/           # Tamil Unicode constants & letter groups
├── grammar/             # Grammar analysis (util, case, tense)
├── config/              # Configuration & data file loading
├── parsers/             # Root word parser & core parsing
├── utils/               # Iterator, splitting, word class utilities
└── data/                # Grammar rule files (.list)

Requirements

  • Python 3.8+
  • No external dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vettu-1.0.1.tar.gz (456.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vettu-1.0.1-py3-none-any.whl (480.7 kB view details)

Uploaded Python 3

File details

Details for the file vettu-1.0.1.tar.gz.

File metadata

  • Download URL: vettu-1.0.1.tar.gz
  • Upload date:
  • Size: 456.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4eb330dea08d4df85f34ae2029a062cd134d62030261e2e491708a68edbb1145
MD5 95971e85e88ef4c1251c78f976d4d5b8
BLAKE2b-256 9ff2a01e1a0201e08fe98c5274840e2fcc9d930c37b201d2f2d794c9e435709d

See more details on using hashes here.

File details

Details for the file vettu-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: vettu-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 480.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2e6629c4b674880b79866a38bac8fcb09ad2a053e1314577bb333c3fcb8e18dd
MD5 63396e6f37f6ba10218858548f3114c6
BLAKE2b-256 0ee5a87c0643377ef2e9a069fb0cc227c63e20db2923ab3e7c1d062e4243ca13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page