Skip to main content

Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization

Project description

Tamil Tokenizer

A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.

Features

Four levels of tokenization:

Level Description Example
sentence Split text into sentences "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences
word Split into words + punctuation "அவன் வந்தான்."அவன், வந்தான், .
character Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) "வந்தான்", ந், தா, ன்
morpheme Split into root + grammatical suffixes (case, tense, person) "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative)

Installation

# From the project directory
pip install -e .

# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."

Usage

Command Line

# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."

# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character

# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence

# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme

# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json

# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text

# Interactive mode
python -m tamil_tokenizer --interactive

Python API

from tamil_tokenizer import TamilTokenizer, Token, TokenType

tokenizer = TamilTokenizer()

# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")

# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")

# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
    print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")

# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
    print(f"{m.text} -> {m.token_type.value} ({m.metadata})")

# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")

# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']

# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")

Token Types

Word-level

  • word — Tamil word
  • number — Numeric value
  • punctuation — Punctuation mark
  • symbol — Other symbol

Character-level

  • vowel — உயிரெழுத்து (அ, ஆ, இ, ...)
  • consonant — மெய்யெழுத்து (க், ங், ச், ...)
  • vowel_consonant — உயிர்மெய்யெழுத்து (க, கா, கி, ...)
  • special — ஆய்த எழுத்து (ஃ)

Morpheme-level

  • root — Root word
  • suffix — Generic suffix
  • case_suffix — வேற்றுமை உருபு (case marker)
  • tense_marker — கால இடைநிலை (tense marker)
  • person_marker — விகுதி (person/number marker)

Project Structure

tamil_tokenizer/
├── __init__.py          # Package init + public API
├── __main__.py          # CLI entry point
├── tokenizer.py         # Main TamilTokenizer class
├── constants/           # Tamil Unicode constants & letter groups
├── grammar/             # Grammar analysis (util, case, tense)
├── config/              # Configuration & data file loading
├── parsers/             # Root word parser & core parsing
├── utils/               # Iterator, splitting, word class utilities
└── data/                # Grammar rule files (.list)

Requirements

  • Python 3.8+
  • No external dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vettu-1.0.3.tar.gz (668.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vettu-1.0.3-py3-none-any.whl (696.1 kB view details)

Uploaded Python 3

File details

Details for the file vettu-1.0.3.tar.gz.

File metadata

  • Download URL: vettu-1.0.3.tar.gz
  • Upload date:
  • Size: 668.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.3.tar.gz
Algorithm Hash digest
SHA256 5b3bb81a6e1cba3748ac33f865d953000acbcc99ebe4474ed2ac47c565723dcc
MD5 19ff35d4e2778b664e43d751176dec0d
BLAKE2b-256 8ca4ec1150dba9cfb1b1b4049c3d1edde883a1a5b9c858e95146fd749554137d

See more details on using hashes here.

File details

Details for the file vettu-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: vettu-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 696.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 25315b0a4534620c370416f04ef3b44f6d5b2aeb05fed0a6179c5c3c468d471b
MD5 30442641fd76f1f8c5907765a559369f
BLAKE2b-256 49f5b101849a5b04ef71b2f8716067ece0e25ba80f6fc7d953521ef0b4041bde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page