Skip to main content

Tamil word lemmatizer. Converts inflected Tamil words into base lemma.

Project description

Tamil Lemmatizer

Tamil Lemmatizer is a character-level lemmatization library for Tamil text.
It normalizes inflected Tamil word forms and maps them to their base lemma using a deep learning model (PyTorch).


✨ Features

  • ✅ Lemmatizes Tamil words to their base form
  • ✅ Handles unseen words using a character-level sequence model
  • ✅ Simple Python API
  • ✅ Supports batch inference
  • ✅ Open-source and extensible

📦 Installation

pip install tamil-lemmatizer

🚀 Quick Start

from tamil_lemmatizer import TamilLemmatizer

lemmatizer = TamilLemmatizer()

word = "சென்றார்கள்"
lemma = lemmatizer.lemmatize(word)

print(lemma)   # Output: செல்

Batch input

words = ["பாடுகிறது", "வந்தார்கள்", "சென்றேன்"]
print(lemmatizer.lemmatize_batch(words))

📚 Description

Tamil is morphologically rich. A single lemma can have hundreds of inflected variations. This library uses:

  • A character-level encoder-decoder architecture
  • Trained using PyTorch on a curated Tamil lemma dataset
  • Supports lemmatization for verbs and nouns

🛠️ Model Architecture

  • Encoder: BiLSTM or Transformer (depending on version)
  • Decoder: Attention-based sequence generator
  • Loss: Cross entropy over Tamil character vocabulary


🔧 CLI Usage

tamil-lemmatizer "வந்தார்கள்"

📄 License

This project is released under the MIT License.


🤝 Contributing

Pull requests are welcome. If contributing major changes, open an issue first to discuss what you want to change.


✉️ Contact

Maintainer: Hemanth Kumar GitHub: Hemanth Thunder


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tamil_lemmatizer-0.0.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file tamil_lemmatizer-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tamil_lemmatizer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1e2618ab374dcdc1dc54ef0b2212d2b50ceda11668c266f1d1f979b1726e179
MD5 fdb74e947cb885e8438c63c729655989
BLAKE2b-256 5aec6ebe968349a679d3b310d45015327879432252c430f7c53888bdd92692c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page