Skip to main content

Multi-level tokenizer for Tamil text — sentence, word, character, and morpheme tokenization

Project description

Tamil Tokenizer

A standalone, multi-level tokenizer for Tamil text. No external dependencies — uses only the Python standard library.

Features

Four levels of tokenization:

Level Description Example
sentence Split text into sentences "அவன் வந்தான். அவள் பார்த்தாள்." → 2 sentences
word Split into words + punctuation "அவன் வந்தான்."அவன், வந்தான், .
character Split into Tamil letters with classification (உயிர்/மெய்/உயிர்மெய், வல்லினம்/மெல்லினம்/இடையினம்) "வந்தான்", ந், தா, ன்
morpheme Split into root + grammatical suffixes (case, tense, person) "பள்ளிக்கு" → root பள்ளி + case suffix க்கு (Dative)

Installation

# From the project directory
pip install -e .

# Or just use directly (no install needed)
python -m tamil_tokenizer "அவன் வந்தான்."

Usage

Command Line

# Word tokenization (default)
python -m tamil_tokenizer "அவன் வந்தான்."

# Character tokenization
python -m tamil_tokenizer "தமிழ்நாடு" --level character

# Sentence tokenization
python -m tamil_tokenizer "அவன் வந்தான். அவள் பார்த்தாள்." --level sentence

# Morpheme tokenization
python -m tamil_tokenizer "பள்ளிக்கு சென்றான்." --level morpheme

# JSON output
python -m tamil_tokenizer "அவன் வந்தான்." --format json

# Plain text output (just token strings)
python -m tamil_tokenizer "அவன் வந்தான்." --format text

# Interactive mode
python -m tamil_tokenizer --interactive

Python API

from tamil_tokenizer import TamilTokenizer, Token, TokenType

tokenizer = TamilTokenizer()

# Sentence tokenization
sentences = tokenizer.sentence_tokenize("அவன் வந்தான். அவள் பார்த்தாள்.")

# Word tokenization
words = tokenizer.word_tokenize("அவன் வந்தான்.")

# Character tokenization
letters = tokenizer.character_tokenize("வந்தான்")
for letter in letters:
    print(f"{letter.text} -> {letter.token_type.value} ({letter.metadata})")

# Morpheme tokenization
morphemes = tokenizer.morpheme_tokenize("பள்ளிக்கு")
for m in morphemes:
    print(f"{m.text} -> {m.token_type.value} ({m.metadata})")

# Unified pipeline
tokens = tokenizer.tokenize("அவன் வந்தான்.", level="word")

# Convenience: get just strings
strings = tokenizer.tokenize_to_strings("அவன் வந்தான்.", level="word")
# ['அவன்', 'வந்தான்', '.']

# Convenience: get dicts (useful for JSON serialization)
dicts = tokenizer.tokenize_to_dicts("அவன் வந்தான்.", level="character")

Token Types

Word-level

  • word — Tamil word
  • number — Numeric value
  • punctuation — Punctuation mark
  • symbol — Other symbol

Character-level

  • vowel — உயிரெழுத்து (அ, ஆ, இ, ...)
  • consonant — மெய்யெழுத்து (க், ங், ச், ...)
  • vowel_consonant — உயிர்மெய்யெழுத்து (க, கா, கி, ...)
  • special — ஆய்த எழுத்து (ஃ)

Morpheme-level

  • root — Root word
  • suffix — Generic suffix
  • case_suffix — வேற்றுமை உருபு (case marker)
  • tense_marker — கால இடைநிலை (tense marker)
  • person_marker — விகுதி (person/number marker)

Project Structure

tamil_tokenizer/
├── __init__.py          # Package init + public API
├── __main__.py          # CLI entry point
├── tokenizer.py         # Main TamilTokenizer class
├── constants/           # Tamil Unicode constants & letter groups
├── grammar/             # Grammar analysis (util, case, tense)
├── config/              # Configuration & data file loading
├── parsers/             # Root word parser & core parsing
├── utils/               # Iterator, splitting, word class utilities
└── data/                # Grammar rule files (.list)

Requirements

  • Python 3.8+
  • No external dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vettu-1.0.2.tar.gz (456.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vettu-1.0.2-py3-none-any.whl (480.5 kB view details)

Uploaded Python 3

File details

Details for the file vettu-1.0.2.tar.gz.

File metadata

  • Download URL: vettu-1.0.2.tar.gz
  • Upload date:
  • Size: 456.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.2.tar.gz
Algorithm Hash digest
SHA256 bdfd5ab4bbbf8b22cd9140b09629b313588c6a0ecdb0f08faf147080b1975ca9
MD5 4b64af95b4f4cc755ec6835b04c24fd6
BLAKE2b-256 97f67f86398ef83b6ed3b415fc77e80b31479417a71e2429d8cd43a663821a39

See more details on using hashes here.

File details

Details for the file vettu-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: vettu-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 480.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vettu-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 839df209b4b75abf9c581217dba321114dd322c24c0123f1318f6627c7b61c74
MD5 ca8ff0bf69da03c76b387be53c75abe6
BLAKE2b-256 a812d32f8ae50aa0e3d4d3e3abf2c75343ca760cdc9075c704b868d08d87f7d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page