Skip to main content

A simple Tamil text tokenizer library with modern Python structure

Project description

tamil-tokenizer

A simple and efficient Tamil text tokenizer library with modern Python structure.

Python Support License: Apache-2.0

Features

  • Tamil Text Tokenization: Comprehensive tokenization for Tamil text
  • Multiple Tokenization Methods: Word, sentence, character, syllable, and grapheme-level tokenization
  • Enhanced Text Normalization: Unicode normalization, digit standardization, punctuation standardization
  • Script Information Analysis: Comprehensive script detection, complexity scoring, and readability assessment
  • Language Detection: Automatic Tamil language detection with confidence scores
  • Text Validation: Tamil text validation with configurable thresholds
  • Character Type Analysis: Detailed analysis of vowels, consonants, conjuncts, and other character types
  • Modern Python API: Clean, type-hinted interface with both functional and object-oriented approaches
  • Command Line Interface: Full-featured CLI tool for Tamil text processing
  • Fast Processing: Efficient regex-based operations
  • Error Handling: Comprehensive exception handling with meaningful error messages
  • Well Tested: Extensive test suite with high coverage
  • Type Hints: Full type annotation support for better IDE experience

Installation

pip install tamil-tokenizer

Dependencies

  • Python 3.8+
  • regex >= 2022.0.0

Optional Dependencies

For development:

pip install tamil-tokenizer[dev]

Quick Start

from tamil_tokenizer import tokenize_words, tokenize_sentences, TamilTokenizer

# Quick tokenization
words = tokenize_words("தமிழ் மொழி அழகான மொழி")
print(f"Words: {words}")

sentences = tokenize_sentences("வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்?")
print(f"Sentences: {sentences}")

# Using TamilTokenizer class
tokenizer = TamilTokenizer()
tokens = tokenizer.tokenize("தமிழ் உரை", method="words")
print(f"Tokens: {tokens}")

Usage Examples

Basic Tokenization

from tamil_tokenizer import tokenize_words, tokenize_sentences, tokenize_characters

# Word tokenization
text = "தமிழ் மொழி அழகான மொழி"
words = tokenize_words(text)
print(f"Words: {words}")
# Output: ['தமிழ்', 'மொழி', 'அழகான', 'மொழி']

# Sentence tokenization
text = "வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்? நன்றாக இருக்கிறேன்!"
sentences = tokenize_sentences(text)
print(f"Sentences: {sentences}")
# Output: ['வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்', 'நன்றாக இருக்கிறேன்']

# Character tokenization
text = "தமிழ்"
characters = tokenize_characters(text)
print(f"Characters: {characters}")
# Output: ['த', 'ம', 'ி', 'ழ', '்']

Using TamilTokenizer Class

from tamil_tokenizer import TamilTokenizer

# Create tokenizer instance
tokenizer = TamilTokenizer()

# General tokenization method
text = "தமிழ் மொழி அழகான மொழி"
words = tokenizer.tokenize(text, method="words")
sentences = tokenizer.tokenize(text, method="sentences")
characters = tokenizer.tokenize(text, method="characters")

print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Characters: {characters}")

Text Cleaning and Normalization

from tamil_tokenizer import clean_text, normalize_text, TamilTokenizer

# Clean text with extra whitespace
messy_text = "  தமிழ்   மொழி   அழகு  "
cleaned = clean_text(messy_text)
print(f"Cleaned: '{cleaned}'")
# Output: 'தமிழ் மொழி அழகு'

# Clean text and remove punctuation
tokenizer = TamilTokenizer()
text_with_punct = "தமிழ், மொழி! அழகு?"
cleaned_no_punct = tokenizer.clean_text(text_with_punct, remove_punctuation=True)
print(f"No punctuation: '{cleaned_no_punct}'")
# Output: 'தமிழ் மொழி அழகு'

# Normalize text
normalized = normalize_text(messy_text)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ் மொழி அழகு'

Enhanced Text Normalization

from tamil_tokenizer import normalize_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive normalization with all options
text = "  தமிழ்—௧௨௩\u200Cமொழி…அழகான—மொழி  "
normalized = tokenizer.normalize_text(
    text,
    form="NFC",                    # Unicode normalization
    standardize_digits=True,       # Tamil digits to Arabic
    standardize_punctuation=True,  # Standardize punctuation
    remove_zero_width=True         # Remove invisible characters
)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ்-123மொழி...அழகான-மொழி'

# Tamil digit standardization
text_with_digits = "தமிழ் ௧௨௩௪ வருடங்கள் பழமையான மொழி"
standardized = normalize_text(text_with_digits, standardize_digits=True)
print(f"Standardized: {standardized}")
# Output: 'தமிழ் 1234 வருடங்கள் பழமையான மொழி'

Script Information Analysis

from tamil_tokenizer import get_script_info, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive script analysis
text = "தமிழ் மொழி உலகின் பழமையான மொழிகளில் ஒன்று"
info = tokenizer.get_script_info(text)

print(f"Tamil percentage: {info['tamil_percentage']:.1f}%")
print(f"Complexity score: {info['complexity_score']:.2f}")
print(f"Readability level: {info['readability_level']}")
print(f"Scripts detected: {info['scripts_detected']}")
print(f"Has conjuncts: {info['has_conjuncts']}")
print(f"Unicode blocks: {info['unicode_blocks']}")

# Character type analysis
char_types = info['character_types']
print(f"Vowels: {char_types['vowels']}")
print(f"Consonants: {char_types['consonants']}")
print(f"Vowel signs: {char_types['vowel_signs']}")

Language Detection

from tamil_tokenizer import detect_language, TamilTokenizer

tokenizer = TamilTokenizer()

# Detect language with confidence
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி Language",
    "Hello World English Text"
]

for text in texts:
    result = tokenizer.detect_language(text)
    print(f"Text: {text}")
    print(f"Language: {result['primary_language']}")
    print(f"Confidence: {result['confidence']:.2f}")
    print(f"Is Tamil: {result['is_tamil']}")
    print("---")

Text Validation

from tamil_tokenizer import is_valid_tamil_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Validate Tamil text with different thresholds
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி",
    "Hello World"
]

for text in texts:
    # Default threshold (50%)
    is_valid_default = tokenizer.is_valid_tamil_text(text)
    
    # Strict threshold (80%)
    is_valid_strict = tokenizer.is_valid_tamil_text(text, min_tamil_percentage=80.0)
    
    print(f"Text: {text}")
    print(f"Valid (50%): {is_valid_default}")
    print(f"Valid (80%): {is_valid_strict}")
    print("---")

Text Statistics

from tamil_tokenizer import TamilTokenizer

tokenizer = TamilTokenizer()
text = "தமிழ் மொழி அழகான மொழி. இது உலகின் பழமையான மொழிகளில் ஒன்று!"

stats = tokenizer.get_statistics(text)
print(f"Total characters: {stats['total_characters']}")
print(f"Tamil characters: {stats['tamil_characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
print(f"Average word length: {stats['average_word_length']:.2f}")
print(f"Average sentence length: {stats['average_sentence_length']:.2f}")

Error Handling

from tamil_tokenizer import tokenize_words
from tamil_tokenizer.exceptions import InvalidTextError, TokenizationError

try:
    words = tokenize_words("")  # Empty text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

try:
    words = tokenize_words(None)  # None text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

Command Line Interface

The library includes a comprehensive CLI tool:

# Basic word tokenization (default)
tamil-tokenizer "தமிழ் மொழி அழகான மொழி"

# Sentence tokenization
tamil-tokenizer --method sentences "வணக்கம். நலமா?"

# Character tokenization
tamil-tokenizer --method characters "தமிழ்"

# Show text statistics
tamil-tokenizer --stats "தமிழ் உரை"

# Clean text
tamil-tokenizer --clean "தமிழ்   உரை"

# Clean text and remove punctuation
tamil-tokenizer --clean --remove-punctuation "தமிழ், உரை!"

# JSON output
tamil-tokenizer --json "தமிழ் மொழி"

# Verbose output
tamil-tokenizer --verbose "தமிழ் மொழி"

CLI Examples

# Basic tokenization
$ tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
தமிழ்
மொழி
அழகான
மொழி

# Sentence tokenization with verbose output
$ tamil-tokenizer --method sentences --verbose "வணக்கம். நலமா?"
Tokenization method: sentences
Input text: வணக்கம். நலமா?
Token count: 2
Tokens:
--------------------
  1. வணக்கம்
  2. நலமா

# Text statistics
$ tamil-tokenizer --stats "தமிழ் மொழி"
Total characters: 9
Tamil characters: 8
Words: 2
Sentences: 1
Average word length: 4.00
Average sentence length: 2.00

# JSON output
$ tamil-tokenizer --json "தமிழ் மொழி"
{
  "method": "words",
  "input_text": "தமிழ் மொழி",
  "tokens": ["தமிழ்", "மொழி"],
  "token_count": 2
}

API Reference

Functions

tokenize_words(text: str) -> List[str]

Tokenize Tamil text into words.

Parameters:

  • text: Tamil text to tokenize

Returns: List of word tokens

tokenize_sentences(text: str) -> List[str]

Tokenize Tamil text into sentences.

Parameters:

  • text: Tamil text to tokenize

Returns: List of sentence tokens

tokenize_characters(text: str) -> List[str]

Tokenize Tamil text into individual characters.

Parameters:

  • text: Tamil text to tokenize

Returns: List of character tokens (Tamil characters only)

clean_text(text: str, remove_punctuation: bool = False) -> str

Clean Tamil text by normalizing whitespace and optionally removing punctuation.

Parameters:

  • text: Text to clean
  • remove_punctuation: Whether to remove non-Tamil punctuation

Returns: Cleaned text

normalize_text(text: str) -> str

Normalize Tamil text by cleaning and standardizing format.

Parameters:

  • text: Text to normalize

Returns: Normalized text

Classes

TamilTokenizer()

Main class for Tamil text tokenization operations.

Methods:

  • tokenize(text, method="words"): General tokenization method
  • tokenize_words(text): Tokenize into words
  • tokenize_sentences(text): Tokenize into sentences
  • tokenize_characters(text): Tokenize into characters
  • clean_text(text, remove_punctuation=False): Clean text
  • normalize_text(text): Normalize text
  • get_statistics(text): Get text statistics

Exceptions

TamilTokenizerError

Base exception class for tamil-tokenizer library.

InvalidTextError

Raised when invalid text is provided (None, empty, or non-string).

TokenizationError

Raised when tokenization fails due to processing errors.

Development

Setup Development Environment

git clone https://github.com/rajacsp/tamil-tokenizer.git
cd tamil-tokenizer
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=tamil_tokenizer --cov-report=html

Code Formatting

black tamil_tokenizer tests examples

Type Checking

mypy tamil_tokenizer

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.2.0 (2025-01-07)

  • Enhanced Text Normalization
  • Comprehensive Unicode normalization (NFC, NFD, NFKC, NFKD)
  • Tamil digit standardization (௦-௯ to 0-9)
  • Punctuation standardization and zero-width character removal
  • Script Information Analysis
  • Added get_script_info() for comprehensive script analysis
  • Added detect_language() for language detection with confidence scores
  • Added is_valid_tamil_text() for Tamil text validation
  • Character type analysis and complexity scoring
  • Advanced Features
  • Language detection with confidence scoring
  • Text validation with configurable thresholds
  • Unicode block identification and readability assessment
  • Enhanced convenience functions with full parameter support

v0.1.1 (2025-01-07)

  • Enhanced Tamil Tokenization
  • Added syllable-level tokenization (tokenize_syllables())
  • Added grapheme cluster tokenization (tokenize_graphemes())
  • Added word structure analysis (analyze_word_structure())
  • Improved character tokenization for better Unicode handling
  • Enhanced text statistics with Tamil-specific metrics
  • Better support for Tamil conjunct consonants and vowel signs
  • Advanced Tamil script processing with improved regex patterns
  • Fixed character tokenization test compatibility
  • Enhanced tokenize() method to support "syllables" and "graphemes"
  • Added comprehensive test coverage for new features

v0.1.0 (2025-01-07)

  • Initial release
  • Basic Tamil text tokenization (words, sentences, characters)
  • Text cleaning and normalization
  • Command-line interface
  • Comprehensive test suite
  • Type hints throughout the codebase
  • Modern Python packaging with pyproject.toml

Tamil Language Support

This library is specifically designed for Tamil text processing and uses Unicode ranges for Tamil script (U+0B80–U+0BFF). It handles:

  • Tamil characters and diacritics
  • Common Tamil punctuation
  • Mixed Tamil-English text (extracts Tamil portions)
  • Various sentence ending patterns

Acknowledgments

  • The Tamil language community for inspiration
  • The Python community for excellent libraries like regex
  • Contributors and users who help improve this library

Support

If you encounter any issues or have questions, please:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue if needed

For general questions, you can also reach out via email: raja.csp@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tamil_tokenizer-0.1.1.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tamil_tokenizer-0.1.1-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file tamil_tokenizer-0.1.1.tar.gz.

File metadata

  • Download URL: tamil_tokenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for tamil_tokenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 82cb484e3577d64e1c87792e02a86833e2940d37ecbceaa6bfb329d52fce5fe8
MD5 a249cf1babaec82beb7b43e0a55484b2
BLAKE2b-256 4a23a08bf2c600e5dea85ace24ae8929cc7d1e3f5d961fd19f50a0ad599a07f3

See more details on using hashes here.

File details

Details for the file tamil_tokenizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tamil_tokenizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ce6ac3b67334988b82bfd38d7d69615392def4d49619e27d88b3a6f20a202c3f
MD5 1679709db2969e05cb5b6860295aad7a
BLAKE2b-256 2f33d71bce0d2b4955c15b2a5e65e75602d6943c3e9372ba7e9d7f93d3ddffe6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page