Advanced multi-format tokenization system with numerology, hashing, compression, and embeddings
Project description
SanTOK - Advanced Multi-Format Tokenization System
SanTOK (Sanitized Tokenization) is an advanced multi-format tokenization system that provides 9 different tokenization methods with integrated numerology, hashing, compression, and embedding capabilities.
🚀 Features
Core Tokenization Methods
- Space Tokenization: Splits text on whitespace
- Word Tokenization: Extracts words using regex patterns
- Character Tokenization: Character-by-character analysis
- Grammar Tokenization: Separates words, numbers, and punctuation
- Subword Tokenization: Fixed-length chunking
- Byte Tokenization: ASCII value representation
- BPE Tokenization: Byte Pair Encoding
- Syllable Tokenization: Vowel-based splitting
- Frequency Tokenization: Word frequency analysis
Advanced Features
- Numerology Integration: 9-centric digital root calculations
- Hash-Driven Embeddings: Stable across vocabularies
- Lossless Reconstruction: Perfect text reconstruction
- Multi-Format Output: JSON, CSV, TXT, XML, Excel, Parquet, Avro
- High Performance: Concurrent and async processing
📦 Installation
pip install santok
🎯 Quick Start
import santok
# Basic usage
text = "Hello world!"
result = santok.all_tokenizations(text)
# Access different tokenization methods
space_tokens = result['space']
char_tokens = result['char']
word_tokens = result['word']
print(f"Space tokens: {space_tokens}")
print(f"Character tokens: {char_tokens}")
print(f"Word tokens: {word_tokens}")
# Numerology calculation
numerology = santok.numerology_sum(text)
print(f"Numerology sum: {numerology}")
📊 Output Format
Each tokenization method returns a list of dictionaries:
[
{'text': 'Hello', 'frontend': 1},
{'text': 'world!', 'frontend': 2}
]
Where:
text: The actual tokenfrontend: Numerological frontend digit (1-9)
🔄 Lossless Reconstruction System
Lossless Reconstruction Methods (3/9 methods)
- ✅ SPACE: Preserves all whitespace and punctuation perfectly
- ✅ CHAR: Character-by-character perfect preservation
- ✅ BPE: Advanced subword with full structure preservation
Analytical Methods (6/9 methods - Transform text for analysis)
- 🔄 WORD: Extracts words for linguistic analysis (removes punctuation by design)
- 🔄 GRAMMAR: Parses grammatical elements (removes spacing by design)
- 🔄 SUBWORD: Fixed-length chunking for subword modeling (transforms by design)
- 🔄 BYTE: ASCII representation for byte-level analysis (different format by design)
- 🔄 SYLLABLE: Syllable extraction for phonetic analysis (removes spacing by design)
- 🔄 FREQUENCY: Adds frequency metadata for statistical analysis (enhances by design)
🛠️ Advanced Usage
CLI Usage
santok
Programmatic Usage
import santok
# Get all tokenizations
result = santok.all_tokenizations("Your text here")
# Calculate numerology
numerology = santok.numerology_sum("Your text here")
# Run main function
santok.main()
📈 Performance
- Concurrent Processing: Multi-threaded tokenization
- Async Support: Asynchronous processing for large texts
- Memory Efficient: Stream processing for large datasets
- High Speed: Optimized algorithms for maximum performance
🔧 Requirements
- Python 3.8+
- No external dependencies (pure Python)
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👨💻 Author
Santosh chavala
- Email: chavalasantosh@hotmail.com
- GitHub: @chavalasantosh
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📝 Changelog
See CHANGELOG.md for a list of changes and version history.
🔗 Links
SanTOK - Advanced Multi-Format Tokenization System by Santosh chavala
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file santok-1.0.6.tar.gz.
File metadata
- Download URL: santok-1.0.6.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c049355a64257d836cbcfba86f003204e529da5acffad26e979298f09b765c38
|
|
| MD5 |
ff6949d9ebddc1672e6090509dc18546
|
|
| BLAKE2b-256 |
3f0b3e31ec0faa340709d6567696fc0884f78ce981c16bf8c76e86c328f8f282
|
File details
Details for the file santok-1.0.6-py3-none-any.whl.
File metadata
- Download URL: santok-1.0.6-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4dd94f4f962a185432ba28a766afb098550f37d92cf8d739e8c1c54ffadf181
|
|
| MD5 |
cac373d8ee996cd5929f6ba993ea57a4
|
|
| BLAKE2b-256 |
fd544c8533bff075e451f6152ac5d6958be7b11f1ae70805e1ccf9d65e0cb330
|