Skip to main content

Advanced multi-format tokenization system with numerology, hashing, compression, and embeddings

Project description

SanTOK - Advanced Multi-Format Tokenization System

PyPI version Python 3.8+ License: MIT

SanTOK (Sanitized Tokenization) is an advanced multi-format tokenization system that provides 9 different tokenization methods with integrated numerology, hashing, compression, and embedding capabilities.

🚀 Features

Core Tokenization Methods

  • Space Tokenization: Splits text on whitespace
  • Word Tokenization: Extracts words using regex patterns
  • Character Tokenization: Character-by-character analysis
  • Grammar Tokenization: Separates words, numbers, and punctuation
  • Subword Tokenization: Fixed-length chunking
  • Byte Tokenization: ASCII value representation
  • BPE Tokenization: Byte Pair Encoding
  • Syllable Tokenization: Vowel-based splitting
  • Frequency Tokenization: Word frequency analysis

Advanced Features

  • Numerology Integration: 9-centric digital root calculations
  • Hash-Driven Embeddings: Stable across vocabularies
  • Lossless Reconstruction: Perfect text reconstruction
  • Multi-Format Output: JSON, CSV, TXT, XML, Excel, Parquet, Avro
  • High Performance: Concurrent and async processing

📦 Installation

pip install santok

🎯 Quick Start

import santok

# Basic usage
text = "Hello world!"
result = santok.all_tokenizations(text)

# Access different tokenization methods
space_tokens = result['space']
char_tokens = result['char']
word_tokens = result['word']

print(f"Space tokens: {space_tokens}")
print(f"Character tokens: {char_tokens}")
print(f"Word tokens: {word_tokens}")

# Numerology calculation
numerology = santok.numerology_sum(text)
print(f"Numerology sum: {numerology}")

📊 Output Format

Each tokenization method returns a list of dictionaries:

[
    {'text': 'Hello', 'frontend': 1},
    {'text': 'world!', 'frontend': 2}
]

Where:

  • text: The actual token
  • frontend: Numerological frontend digit (1-9)

🔄 Lossless Reconstruction System

Lossless Reconstruction Methods (3/9 methods)

  • SPACE: Preserves all whitespace and punctuation perfectly
  • CHAR: Character-by-character perfect preservation
  • BPE: Advanced subword with full structure preservation

Analytical Methods (6/9 methods - Transform text for analysis)

  • 🔄 WORD: Extracts words for linguistic analysis (removes punctuation by design)
  • 🔄 GRAMMAR: Parses grammatical elements (removes spacing by design)
  • 🔄 SUBWORD: Fixed-length chunking for subword modeling (transforms by design)
  • 🔄 BYTE: ASCII representation for byte-level analysis (different format by design)
  • 🔄 SYLLABLE: Syllable extraction for phonetic analysis (removes spacing by design)
  • 🔄 FREQUENCY: Adds frequency metadata for statistical analysis (enhances by design)

🛠️ Advanced Usage

CLI Usage

santok

Programmatic Usage

import santok

# Get all tokenizations
result = santok.all_tokenizations("Your text here")

# Calculate numerology
numerology = santok.numerology_sum("Your text here")

# Run main function
santok.main()

📈 Performance

  • Concurrent Processing: Multi-threaded tokenization
  • Async Support: Asynchronous processing for large texts
  • Memory Efficient: Stream processing for large datasets
  • High Speed: Optimized algorithms for maximum performance

🔧 Requirements

  • Python 3.8+
  • No external dependencies (pure Python)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Santosh chavala

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 Changelog

See CHANGELOG.md for a list of changes and version history.

🔗 Links


SanTOK - Advanced Multi-Format Tokenization System by Santosh chavala

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

santok-1.0.6.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

santok-1.0.6-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file santok-1.0.6.tar.gz.

File metadata

  • Download URL: santok-1.0.6.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for santok-1.0.6.tar.gz
Algorithm Hash digest
SHA256 c049355a64257d836cbcfba86f003204e529da5acffad26e979298f09b765c38
MD5 ff6949d9ebddc1672e6090509dc18546
BLAKE2b-256 3f0b3e31ec0faa340709d6567696fc0884f78ce981c16bf8c76e86c328f8f282

See more details on using hashes here.

File details

Details for the file santok-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: santok-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for santok-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b4dd94f4f962a185432ba28a766afb098550f37d92cf8d739e8c1c54ffadf181
MD5 cac373d8ee996cd5929f6ba993ea57a4
BLAKE2b-256 fd544c8533bff075e451f6152ac5d6958be7b11f1ae70805e1ccf9d65e0cb330

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page