Advanced multi-format tokenization system with numerology, hashing, compression, and embeddings

These details have not been verified by PyPI

Project links

Project description

SanTOK - Advanced Multi-Format Tokenization System

SanTOK (Sanitized Tokenization) is an advanced multi-format tokenization system that provides 9 different tokenization methods with integrated numerology, hashing, compression, and embedding capabilities.

🚀 Features

Core Tokenization Methods

Space Tokenization: Splits text on whitespace
Word Tokenization: Extracts words using regex patterns
Character Tokenization: Character-by-character analysis
Grammar Tokenization: Separates words, numbers, and punctuation
Subword Tokenization: Fixed-length chunking
Byte Tokenization: ASCII value representation
BPE Tokenization: Byte Pair Encoding
Syllable Tokenization: Vowel-based splitting
Frequency Tokenization: Word frequency analysis

Advanced Features

Numerology Integration: 9-centric digital root calculations
Hash-Driven Embeddings: Stable across vocabularies
Lossless Reconstruction: Perfect text reconstruction
Multi-Format Output: JSON, CSV, TXT, XML, Excel, Parquet, Avro
High Performance: Concurrent and async processing

📦 Installation

pip install santok

🎯 Quick Start

import santok

# Basic usage
text = "Hello world!"
result = santok.all_tokenizations(text)

# Access different tokenization methods
space_tokens = result['space']
char_tokens = result['char']
word_tokens = result['word']

print(f"Space tokens: {space_tokens}")
print(f"Character tokens: {char_tokens}")
print(f"Word tokens: {word_tokens}")

# Numerology calculation
numerology = santok.numerology_sum(text)
print(f"Numerology sum: {numerology}")

📊 Output Format

Each tokenization method returns a list of dictionaries:

[
    {'text': 'Hello', 'frontend': 1},
    {'text': 'world!', 'frontend': 2}
]

Where:

text: The actual token
frontend: Numerological frontend digit (1-9)

🔄 Lossless Reconstruction System

Lossless Reconstruction Methods (3/9 methods)

✅ SPACE: Preserves all whitespace and punctuation perfectly
✅ CHAR: Character-by-character perfect preservation
✅ BPE: Advanced subword with full structure preservation

Analytical Methods (6/9 methods - Transform text for analysis)

🔄 WORD: Extracts words for linguistic analysis (removes punctuation by design)
🔄 GRAMMAR: Parses grammatical elements (removes spacing by design)
🔄 SUBWORD: Fixed-length chunking for subword modeling (transforms by design)
🔄 BYTE: ASCII representation for byte-level analysis (different format by design)
🔄 SYLLABLE: Syllable extraction for phonetic analysis (removes spacing by design)
🔄 FREQUENCY: Adds frequency metadata for statistical analysis (enhances by design)

🛠️ Advanced Usage

CLI Usage

santok

Programmatic Usage

import santok

# Get all tokenizations
result = santok.all_tokenizations("Your text here")

# Calculate numerology
numerology = santok.numerology_sum("Your text here")

# Run main function
santok.main()

📈 Performance

Concurrent Processing: Multi-threaded tokenization
Async Support: Asynchronous processing for large texts
Memory Efficient: Stream processing for large datasets
High Speed: Optimized algorithms for maximum performance

🔧 Requirements

Python 3.8+
No external dependencies (pure Python)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Santosh chavala

Email: chavalasantosh@hotmail.com
GitHub: @chavalasantosh

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 Changelog

See CHANGELOG.md for a list of changes and version history.

🔗 Links

SanTOK - Advanced Multi-Format Tokenization System by Santosh chavala

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

Dec 24, 2025

This version

1.0.6

Oct 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

santok-1.0.6.tar.gz (43.8 kB view details)

Uploaded Oct 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

santok-1.0.6-py3-none-any.whl (6.1 kB view details)

Uploaded Oct 3, 2025 Python 3

File details

Details for the file santok-1.0.6.tar.gz.

File metadata

Download URL: santok-1.0.6.tar.gz
Upload date: Oct 3, 2025
Size: 43.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for santok-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`c049355a64257d836cbcfba86f003204e529da5acffad26e979298f09b765c38`
MD5	`ff6949d9ebddc1672e6090509dc18546`
BLAKE2b-256	`3f0b3e31ec0faa340709d6567696fc0884f78ce981c16bf8c76e86c328f8f282`

See more details on using hashes here.

File details

Details for the file santok-1.0.6-py3-none-any.whl.

File metadata

Download URL: santok-1.0.6-py3-none-any.whl
Upload date: Oct 3, 2025
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for santok-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4dd94f4f962a185432ba28a766afb098550f37d92cf8d739e8c1c54ffadf181`
MD5	`cac373d8ee996cd5929f6ba993ea57a4`
BLAKE2b-256	`fd544c8533bff075e451f6152ac5d6958be7b11f1ae70805e1ccf9d65e0cb330`

See more details on using hashes here.

santok 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SanTOK - Advanced Multi-Format Tokenization System

🚀 Features

Core Tokenization Methods

Advanced Features

📦 Installation

🎯 Quick Start

📊 Output Format

🔄 Lossless Reconstruction System

Lossless Reconstruction Methods (3/9 methods)

Analytical Methods (6/9 methods - Transform text for analysis)

🛠️ Advanced Usage

CLI Usage

Programmatic Usage

📈 Performance

🔧 Requirements

📄 License

👨‍💻 Author

🤝 Contributing

📝 Changelog

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes