Skip to main content

Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization

Project description

Tamil Tokenizer

A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.

Features

Four levels of tokenization:

Level Description Example
sentence Split text into sentences "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences
word Split into words + punctuation "அவன் வந்தான்."அவன், வந்தான், .
character Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) "வந்தான்", ந், தா, ன்
morpheme Split into root + grammatical suffixes (case, tense, person) "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative)

Installation

# From the project directory
pip install -e .

# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."

Usage

Command Line

# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."

# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character

# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence

# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme

# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json

# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text

# Interactive mode
python -m tamil_tokenizer --interactive

Python API

from tamil_tokenizer import TamilTokenizer, Token, TokenType

tokenizer = TamilTokenizer()

# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")

# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")

# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
    print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")

# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
    print(f"{m.text} -> {m.token_type.value} ({m.metadata})")

# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")

# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']

# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")

Token Types

Word-level

  • word — Tamil word
  • number — Numeric value
  • punctuation — Punctuation mark
  • symbol — Other symbol

Character-level

  • vowel — உயிரெழுத்து (அ, ஆ, இ, ...)
  • consonant — மெய்யெழுத்து (க், ங், ச், ...)
  • vowel_consonant — உயிர்மெய்யெழுத்து (க, கா, கி, ...)
  • special — ஆய்த எழுத்து (ஃ)

Morpheme-level

  • root — Root word
  • suffix — Generic suffix
  • case_suffix — வேற்றுமை உருபு (case marker)
  • tense_marker — கால இடைநிலை (tense marker)
  • person_marker — விகுதி (person/number marker)

Project Structure

tamil_tokenizer/
├── __init__.py          # Package init + public API
├── __main__.py          # CLI entry point
├── tokenizer.py         # Main TamilTokenizer class
├── constants/           # Tamil Unicode constants & letter groups
├── grammar/             # Grammar analysis (util, case, tense)
├── config/              # Configuration & data file loading
├── parsers/             # Root word parser & core parsing
├── utils/               # Iterator, splitting, word class utilities
└── data/                # Grammar rule files (.list)

Requirements

  • Python 3.8+
  • No external dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vettu-1.0.4.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vettu-1.0.4-py3-none-any.whl (4.6 MB view details)

Uploaded Python 3

File details

Details for the file vettu-1.0.4.tar.gz.

File metadata

  • Download URL: vettu-1.0.4.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.4.tar.gz
Algorithm Hash digest
SHA256 483cf741abf14e49e056779af8b42116e31a123adfa5b594d5561df5d3322c35
MD5 d6c5911e4f461c0d45122fe562489429
BLAKE2b-256 683be26e609d4ede563153c707a1fd18cd2cb9e384febc881b6633cbb2549e3a

See more details on using hashes here.

File details

Details for the file vettu-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: vettu-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 150f3c33d3d2868944a77ee1070aee6d3307c48bb056f89813870e3c2fbca6bb
MD5 04d10f6a019829c89342e07fe70efe01
BLAKE2b-256 942027831ef9568f32e1fe70f107d944db975634cf46a7a9407c7daaac925c1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page