A simple Tamil text tokenizer library with modern Python structure
Project description
tamil-tokenizer
A simple and efficient Tamil text tokenizer library with modern Python structure.
Features
- Tamil Text Tokenization: Comprehensive tokenization for Tamil text
- Multiple Tokenization Methods: Word, sentence, character, syllable, and grapheme-level tokenization
- Enhanced Text Normalization: Unicode normalization, digit standardization, punctuation standardization
- Script Information Analysis: Comprehensive script detection, complexity scoring, and readability assessment
- Language Detection: Automatic Tamil language detection with confidence scores
- Text Validation: Tamil text validation with configurable thresholds
- Character Type Analysis: Detailed analysis of vowels, consonants, conjuncts, and other character types
- Modern Python API: Clean, type-hinted interface with both functional and object-oriented approaches
- Command Line Interface: Full-featured CLI tool for Tamil text processing
- Fast Processing: Efficient regex-based operations
- Error Handling: Comprehensive exception handling with meaningful error messages
- Well Tested: Extensive test suite with high coverage
- Type Hints: Full type annotation support for better IDE experience
Installation
pip install tamil-tokenizer
Dependencies
- Python 3.8+
- regex >= 2022.0.0
Optional Dependencies
For development:
pip install tamil-tokenizer[dev]
Quick Start
from tamil_tokenizer import tokenize_words, tokenize_sentences, TamilTokenizer
# Quick tokenization
words = tokenize_words("தமிழ் மொழி அழகான மொழி")
print(f"Words: {words}")
sentences = tokenize_sentences("வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்?")
print(f"Sentences: {sentences}")
# Using TamilTokenizer class
tokenizer = TamilTokenizer()
tokens = tokenizer.tokenize("தமிழ் உரை", method="words")
print(f"Tokens: {tokens}")
Usage Examples
Basic Tokenization
from tamil_tokenizer import tokenize_words, tokenize_sentences, tokenize_characters
# Word tokenization
text = "தமிழ் மொழி அழகான மொழி"
words = tokenize_words(text)
print(f"Words: {words}")
# Output: ['தமிழ்', 'மொழி', 'அழகான', 'மொழி']
# Sentence tokenization
text = "வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்? நன்றாக இருக்கிறேன்!"
sentences = tokenize_sentences(text)
print(f"Sentences: {sentences}")
# Output: ['வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்', 'நன்றாக இருக்கிறேன்']
# Character tokenization
text = "தமிழ்"
characters = tokenize_characters(text)
print(f"Characters: {characters}")
# Output: ['த', 'ம', 'ி', 'ழ', '்']
Using TamilTokenizer Class
from tamil_tokenizer import TamilTokenizer
# Create tokenizer instance
tokenizer = TamilTokenizer()
# General tokenization method
text = "தமிழ் மொழி அழகான மொழி"
words = tokenizer.tokenize(text, method="words")
sentences = tokenizer.tokenize(text, method="sentences")
characters = tokenizer.tokenize(text, method="characters")
print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Characters: {characters}")
Text Cleaning and Normalization
from tamil_tokenizer import clean_text, normalize_text, TamilTokenizer
# Clean text with extra whitespace
messy_text = " தமிழ் மொழி அழகு "
cleaned = clean_text(messy_text)
print(f"Cleaned: '{cleaned}'")
# Output: 'தமிழ் மொழி அழகு'
# Clean text and remove punctuation
tokenizer = TamilTokenizer()
text_with_punct = "தமிழ், மொழி! அழகு?"
cleaned_no_punct = tokenizer.clean_text(text_with_punct, remove_punctuation=True)
print(f"No punctuation: '{cleaned_no_punct}'")
# Output: 'தமிழ் மொழி அழகு'
# Normalize text
normalized = normalize_text(messy_text)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ் மொழி அழகு'
Enhanced Text Normalization
from tamil_tokenizer import normalize_text, TamilTokenizer
tokenizer = TamilTokenizer()
# Comprehensive normalization with all options
text = " தமிழ்—௧௨௩\u200Cமொழி…அழகான—மொழி "
normalized = tokenizer.normalize_text(
text,
form="NFC", # Unicode normalization
standardize_digits=True, # Tamil digits to Arabic
standardize_punctuation=True, # Standardize punctuation
remove_zero_width=True # Remove invisible characters
)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ்-123மொழி...அழகான-மொழி'
# Tamil digit standardization
text_with_digits = "தமிழ் ௧௨௩௪ வருடங்கள் பழமையான மொழி"
standardized = normalize_text(text_with_digits, standardize_digits=True)
print(f"Standardized: {standardized}")
# Output: 'தமிழ் 1234 வருடங்கள் பழமையான மொழி'
Script Information Analysis
from tamil_tokenizer import get_script_info, TamilTokenizer
tokenizer = TamilTokenizer()
# Comprehensive script analysis
text = "தமிழ் மொழி உலகின் பழமையான மொழிகளில் ஒன்று"
info = tokenizer.get_script_info(text)
print(f"Tamil percentage: {info['tamil_percentage']:.1f}%")
print(f"Complexity score: {info['complexity_score']:.2f}")
print(f"Readability level: {info['readability_level']}")
print(f"Scripts detected: {info['scripts_detected']}")
print(f"Has conjuncts: {info['has_conjuncts']}")
print(f"Unicode blocks: {info['unicode_blocks']}")
# Character type analysis
char_types = info['character_types']
print(f"Vowels: {char_types['vowels']}")
print(f"Consonants: {char_types['consonants']}")
print(f"Vowel signs: {char_types['vowel_signs']}")
Language Detection
from tamil_tokenizer import detect_language, TamilTokenizer
tokenizer = TamilTokenizer()
# Detect language with confidence
texts = [
"தமிழ் மொழி அழகான மொழி",
"தமிழ் Tamil மொழி Language",
"Hello World English Text"
]
for text in texts:
result = tokenizer.detect_language(text)
print(f"Text: {text}")
print(f"Language: {result['primary_language']}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Is Tamil: {result['is_tamil']}")
print("---")
Text Validation
from tamil_tokenizer import is_valid_tamil_text, TamilTokenizer
tokenizer = TamilTokenizer()
# Validate Tamil text with different thresholds
texts = [
"தமிழ் மொழி அழகான மொழி",
"தமிழ் Tamil மொழி",
"Hello World"
]
for text in texts:
# Default threshold (50%)
is_valid_default = tokenizer.is_valid_tamil_text(text)
# Strict threshold (80%)
is_valid_strict = tokenizer.is_valid_tamil_text(text, min_tamil_percentage=80.0)
print(f"Text: {text}")
print(f"Valid (50%): {is_valid_default}")
print(f"Valid (80%): {is_valid_strict}")
print("---")
Text Statistics
from tamil_tokenizer import TamilTokenizer
tokenizer = TamilTokenizer()
text = "தமிழ் மொழி அழகான மொழி. இது உலகின் பழமையான மொழிகளில் ஒன்று!"
stats = tokenizer.get_statistics(text)
print(f"Total characters: {stats['total_characters']}")
print(f"Tamil characters: {stats['tamil_characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
print(f"Average word length: {stats['average_word_length']:.2f}")
print(f"Average sentence length: {stats['average_sentence_length']:.2f}")
Error Handling
from tamil_tokenizer import tokenize_words
from tamil_tokenizer.exceptions import InvalidTextError, TokenizationError
try:
words = tokenize_words("") # Empty text
except InvalidTextError as e:
print(f"Invalid text: {e}")
try:
words = tokenize_words(None) # None text
except InvalidTextError as e:
print(f"Invalid text: {e}")
Command Line Interface
The library includes a comprehensive CLI tool:
# Basic word tokenization (default)
tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
# Sentence tokenization
tamil-tokenizer --method sentences "வணக்கம். நலமா?"
# Character tokenization
tamil-tokenizer --method characters "தமிழ்"
# Show text statistics
tamil-tokenizer --stats "தமிழ் உரை"
# Clean text
tamil-tokenizer --clean "தமிழ் உரை"
# Clean text and remove punctuation
tamil-tokenizer --clean --remove-punctuation "தமிழ், உரை!"
# JSON output
tamil-tokenizer --json "தமிழ் மொழி"
# Verbose output
tamil-tokenizer --verbose "தமிழ் மொழி"
CLI Examples
# Basic tokenization
$ tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
தமிழ்
மொழி
அழகான
மொழி
# Sentence tokenization with verbose output
$ tamil-tokenizer --method sentences --verbose "வணக்கம். நலமா?"
Tokenization method: sentences
Input text: வணக்கம். நலமா?
Token count: 2
Tokens:
--------------------
1. வணக்கம்
2. நலமா
# Text statistics
$ tamil-tokenizer --stats "தமிழ் மொழி"
Total characters: 9
Tamil characters: 8
Words: 2
Sentences: 1
Average word length: 4.00
Average sentence length: 2.00
# JSON output
$ tamil-tokenizer --json "தமிழ் மொழி"
{
"method": "words",
"input_text": "தமிழ் மொழி",
"tokens": ["தமிழ்", "மொழி"],
"token_count": 2
}
API Reference
Functions
tokenize_words(text: str) -> List[str]
Tokenize Tamil text into words.
Parameters:
text: Tamil text to tokenize
Returns: List of word tokens
tokenize_sentences(text: str) -> List[str]
Tokenize Tamil text into sentences.
Parameters:
text: Tamil text to tokenize
Returns: List of sentence tokens
tokenize_characters(text: str) -> List[str]
Tokenize Tamil text into individual characters.
Parameters:
text: Tamil text to tokenize
Returns: List of character tokens (Tamil characters only)
clean_text(text: str, remove_punctuation: bool = False) -> str
Clean Tamil text by normalizing whitespace and optionally removing punctuation.
Parameters:
text: Text to cleanremove_punctuation: Whether to remove non-Tamil punctuation
Returns: Cleaned text
normalize_text(text: str) -> str
Normalize Tamil text by cleaning and standardizing format.
Parameters:
text: Text to normalize
Returns: Normalized text
Classes
TamilTokenizer()
Main class for Tamil text tokenization operations.
Methods:
tokenize(text, method="words"): General tokenization methodtokenize_words(text): Tokenize into wordstokenize_sentences(text): Tokenize into sentencestokenize_characters(text): Tokenize into charactersclean_text(text, remove_punctuation=False): Clean textnormalize_text(text): Normalize textget_statistics(text): Get text statistics
Exceptions
TamilTokenizerError
Base exception class for tamil-tokenizer library.
InvalidTextError
Raised when invalid text is provided (None, empty, or non-string).
TokenizationError
Raised when tokenization fails due to processing errors.
Development
Setup Development Environment
git clone https://github.com/rajacsp/tamil-tokenizer.git
cd tamil-tokenizer
pip install -e ".[dev]"
Run Tests
pytest
Run Tests with Coverage
pytest --cov=tamil_tokenizer --cov-report=html
Code Formatting
black tamil_tokenizer tests examples
Type Checking
mypy tamil_tokenizer
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Changelog
v0.2.0 (2025-01-07)
- Enhanced Text Normalization
- Comprehensive Unicode normalization (NFC, NFD, NFKC, NFKD)
- Tamil digit standardization (௦-௯ to 0-9)
- Punctuation standardization and zero-width character removal
- Script Information Analysis
- Added
get_script_info()for comprehensive script analysis - Added
detect_language()for language detection with confidence scores - Added
is_valid_tamil_text()for Tamil text validation - Character type analysis and complexity scoring
- Advanced Features
- Language detection with confidence scoring
- Text validation with configurable thresholds
- Unicode block identification and readability assessment
- Enhanced convenience functions with full parameter support
v0.1.1 (2025-01-07)
- Enhanced Tamil Tokenization
- Added syllable-level tokenization (
tokenize_syllables()) - Added grapheme cluster tokenization (
tokenize_graphemes()) - Added word structure analysis (
analyze_word_structure()) - Improved character tokenization for better Unicode handling
- Enhanced text statistics with Tamil-specific metrics
- Better support for Tamil conjunct consonants and vowel signs
- Advanced Tamil script processing with improved regex patterns
- Fixed character tokenization test compatibility
- Enhanced
tokenize()method to support "syllables" and "graphemes" - Added comprehensive test coverage for new features
v0.1.0 (2025-01-07)
- Initial release
- Basic Tamil text tokenization (words, sentences, characters)
- Text cleaning and normalization
- Command-line interface
- Comprehensive test suite
- Type hints throughout the codebase
- Modern Python packaging with pyproject.toml
Tamil Language Support
This library is specifically designed for Tamil text processing and uses Unicode ranges for Tamil script (U+0B80–U+0BFF). It handles:
- Tamil characters and diacritics
- Common Tamil punctuation
- Mixed Tamil-English text (extracts Tamil portions)
- Various sentence ending patterns
Acknowledgments
- The Tamil language community for inspiration
- The Python community for excellent libraries like regex
- Contributors and users who help improve this library
Support
If you encounter any issues or have questions, please:
- Check the documentation
- Search existing issues
- Create a new issue if needed
For general questions, you can also reach out via email: raja.csp@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tamil_tokenizer-0.1.1.tar.gz.
File metadata
- Download URL: tamil_tokenizer-0.1.1.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82cb484e3577d64e1c87792e02a86833e2940d37ecbceaa6bfb329d52fce5fe8
|
|
| MD5 |
a249cf1babaec82beb7b43e0a55484b2
|
|
| BLAKE2b-256 |
4a23a08bf2c600e5dea85ace24ae8929cc7d1e3f5d961fd19f50a0ad599a07f3
|
File details
Details for the file tamil_tokenizer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tamil_tokenizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce6ac3b67334988b82bfd38d7d69615392def4d49619e27d88b3a6f20a202c3f
|
|
| MD5 |
1679709db2969e05cb5b6860295aad7a
|
|
| BLAKE2b-256 |
2f33d71bce0d2b4955c15b2a5e65e75602d6943c3e9372ba7e9d7f93d3ddffe6
|