Comprehensive Text Processing System: Tokenization, Embeddings, Training, Vector Stores, and More

These details have not been verified by PyPI

Project links

Project description

SanTOK Complete - Comprehensive Text Processing System

SanTOK Complete is a comprehensive, production-ready text processing system that goes far beyond simple tokenization. It provides a complete toolkit for text analysis, semantic understanding, model training, vector storage, API deployment, and more.

🎯 What is SanTOK Complete?

SanTOK Complete is NOT just a tokenizer - it's a complete NLP (Natural Language Processing) system that includes:

✅ Multiple Tokenization Methods - Space, word, character, grammar, subword tokenization
✅ Semantic Embeddings - Generate embeddings for semantic analysis
✅ Vector Stores - Weaviate and other vector database integrations
✅ Model Training - Vocabulary building, language model training
✅ API Servers - Production-ready FastAPI servers
✅ Data Integration - Source map integration, vocabulary adaptation
✅ Data Interpretation - Text analysis and interpretation
✅ Compression - Text compression algorithms
✅ Performance Testing - Benchmarking and performance analysis
✅ CLI Tools - Command-line interfaces
✅ Utilities - Configuration, logging, validation

🚀 Installation

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Method 1: Install as Package (Recommended)

# Navigate to the module directory
cd santok_complete

# Install in editable mode (recommended for development)
pip install -e .

# Or install normally
pip install .

Method 2: Add to Python Path

If you don't want to install, you can add the parent directory to your Python path:

import sys
import os
sys.path.insert(0, r'C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770')

import santok_complete

Method 3: Set Environment Variable

Windows:

set PYTHONPATH=%PYTHONPATH%;C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770

Linux/Mac:

export PYTHONPATH="${PYTHONPATH}:/path/to/SanTOK-Extracted/SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770"

Verify Installation

import santok_complete
print(f"SanTOK Complete Version: {santok_complete.__version__}")

from santok_complete import TextTokenizationEngine
engine = TextTokenizationEngine()
print("✅ Installation successful!")

⚡ Quick Start

Basic Tokenization

from santok_complete import TextTokenizationEngine

# Create engine instance
engine = TextTokenizationEngine(
    random_seed=12345,
    normalize_case=True,
    remove_punctuation=False
)

# Tokenize text
text = "Hello World! This is SanTOK Complete."
result = engine.tokenize(text, tokenization_method="whitespace")

print(f"Tokens: {result['tokens']}")
print(f"Method: {result['method']}")

Generate Embeddings

from santok_complete import SanTOKEmbeddingGenerator

generator = SanTOKEmbeddingGenerator()
embeddings = generator.generate("Your text here")
print(f"Embedding shape: {embeddings.shape}")

Use Vector Store

from santok_complete import SanTOKVectorStore

store = SanTOKVectorStore()
store.add(embeddings, metadata={"text": "Hello", "id": 1})
results = store.search(query_embedding, top_k=5)

🏗️ Core Components

1. Core Tokenization (`core/`)

The foundation of SanTOK Complete provides multiple tokenization methods:

TextTokenizationEngine - Main tokenization engine with multiple methods
TextTokenizer - Core tokenizer class
BaseTokenizer - Base class for custom tokenizers
ParallelTokenizer - Parallel processing support

Available Tokenization Methods:

whitespace - Split by whitespace
word - Word-based tokenization
character - Character-level tokenization
grammar - Grammar-aware tokenization
subword - Subword tokenization

2. Embeddings (`embeddings/`)

Generate semantic embeddings for text analysis:

SanTOKEmbeddingGenerator - Generate embeddings from text
SanTOKVectorStore - Store and search embeddings
SanTOKInferencePipeline - Inference pipeline for embeddings
SemanticTrainer - Train semantic models

3. Training (`training/`)

Train and build language models:

SanTOKVocabularyBuilder - Build vocabularies from text
SanTOKLanguageModelTrainer - Train language models
SanTOKLanguageModel - Language model class
EnhancedTrainer - Enhanced training capabilities
DatasetDownloader - Download training datasets

4. Vector Stores (`vector_stores/`)

Integrate with vector databases:

Weaviate Integration - Full Weaviate vector database support
Vector search and retrieval
Metadata management

5. API Servers (`servers/`)

Production-ready API servers:

MainServer - Full-featured FastAPI server
LightweightServer - Lightweight API server
SimpleServer - Simple HTTP server
JobManager - Job management system
AdminConfig - Admin configuration

6. Integration (`integration/`)

System integration modules:

VocabularyAdapter - Adapt vocabularies between systems
SourceMapIntegration - Source map integration

7. Interpretation (`interpretation/`)

Text analysis and interpretation:

DataInterpreter - Interpret and analyze text data

8. Compression (`compression/`)

Text compression algorithms:

CompressionAlgorithm - Various compression methods

9. Performance (`performance/`)

Testing and benchmarking:

TestAccuracy - Accuracy testing
ComprehensivePerformanceTest - Full performance testing
TestOrganizedOutputs - Output validation

10. CLI (`cli/`)

Command-line interfaces:

Main CLI - Primary command-line interface
Decode Demo - Decoding demonstrations

11. Utilities (`utils/`)

Supporting utilities:

Config - Configuration management
Logging - Logging setup and management
Validation - Input validation functions

📖 Detailed Usage

Text Tokenization

Basic Tokenization

from santok_complete import TextTokenizationEngine

engine = TextTokenizationEngine()
result = engine.tokenize("Hello World!", tokenization_method="whitespace")

# Access tokens
tokens = result['tokens']
for token in tokens:
    print(f"Text: {token['text']}, Index: {token['index']}")

Advanced Tokenization

engine = TextTokenizationEngine(
    random_seed=12345,           # For reproducibility
    embedding_bit=False,          # Enable embedding bit
    normalize_case=True,          # Normalize to lowercase
    remove_punctuation=False,     # Keep punctuation
    collapse_repetitions=0        # No repetition collapsing
)

# Use different methods
methods = ["whitespace", "word", "character", "grammar", "subword"]

for method in methods:
    result = engine.tokenize("Your text here", tokenization_method=method)
    print(f"{method}: {len(result['tokens'])} tokens")

Comprehensive Text Analysis

analysis = engine.analyze_text_comprehensive("Your text here")

# Analysis includes multiple tokenization methods
for method, data in analysis.items():
    print(f"{method}: {len(data['tokens'])} tokens")

Semantic Embeddings

Generate Embeddings

from santok_complete import SanTOKEmbeddingGenerator

generator = SanTOKEmbeddingGenerator()

# Generate embeddings
text = "This is sample text for embedding generation"
embeddings = generator.generate(text)

print(f"Embedding dimension: {embeddings.shape}")
print(f"Embedding vector: {embeddings}")

Batch Embedding Generation

texts = ["First text", "Second text", "Third text"]
embeddings_list = [generator.generate(text) for text in texts]

Vector Stores

Using SanTOK Vector Store

from santok_complete import SanTOKVectorStore

store = SanTOKVectorStore()

# Add documents
doc1_embedding = generator.generate("Document 1 text")
doc2_embedding = generator.generate("Document 2 text")

store.add(doc1_embedding, metadata={"id": 1, "title": "Doc 1"})
store.add(doc2_embedding, metadata={"id": 2, "title": "Doc 2"})

# Search
query_embedding = generator.generate("Search query")
results = store.search(query_embedding, top_k=5)

for result in results:
    print(f"Score: {result['score']}, Metadata: {result['metadata']}")

Weaviate Integration

from santok_complete.vector_stores.weaviate_integration import *

# Connect to Weaviate
client = connect_weaviate(url="http://localhost:8080")

# Store vectors
store_vector(client, embeddings, metadata={"text": "Sample"})

# Search
results = search_vectors(client, query_embedding, limit=10)

Model Training

Build Vocabulary

from santok_complete import SanTOKVocabularyBuilder

builder = SanTOKVocabularyBuilder()

# Build from text corpus
corpus = "Your training text corpus here..."
vocabulary = builder.build_from_text(corpus)

print(f"Vocabulary size: {len(vocabulary)}")
print(f"Vocabulary: {vocabulary}")

Train Language Model

from santok_complete import SanTOKLanguageModelTrainer

trainer = SanTOKLanguageModelTrainer()

# Train model
model = trainer.train(
    training_data="path/to/training/data",
    epochs=10,
    batch_size=32
)

# Save model
model.save("path/to/save/model")

API Server Deployment

Start Main Server

from santok_complete.servers.main_server import app
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)

Start Lightweight Server

from santok_complete.servers.lightweight_server import app
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8001)

Using the CLI

# From command line
python -m santok_complete.cli.cli "Hello World" --method whitespace

# Or if installed
santok "Hello World" --method word

📚 API Reference

TextTokenizationEngine

Main tokenization engine class.

Constructor

TextTokenizationEngine(
    random_seed: int = 12345,
    embedding_bit: bool = False,
    normalize_case: bool = True,
    remove_punctuation: bool = False,
    collapse_repetitions: int = 0
)

Parameters:

random_seed (int): Seed for reproducible tokenization (default: 12345)
embedding_bit (bool): Enable embedding bit for extra variation (default: False)
normalize_case (bool): Convert text to lowercase (default: True)
remove_punctuation (bool): Remove punctuation (default: False)
collapse_repetitions (int): Collapse repeated characters (0=disabled, 1=run-aware, N=collapse to N) (default: 0)

Methods

`tokenize(text: str, tokenization_method: str = "whitespace") -> dict`

Tokenize input text using specified method.

Parameters:

text (str): Input text to tokenize
tokenization_method (str): Method to use ("whitespace", "word", "character", "grammar", "subword")

Returns:

dict: Dictionary containing:
- tokens: List of token dictionaries with 'text' and 'index'
- method: Tokenization method used
- count: Number of tokens

Example:

result = engine.tokenize("Hello World!", "whitespace")
# Returns: {
#     'tokens': [{'text': 'Hello', 'index': 0}, {'text': 'World!', 'index': 1}],
#     'method': 'whitespace',
#     'count': 2
# }

`analyze_text_comprehensive(text: str) -> dict`

Analyze text using all available tokenization methods.

Parameters:

text (str): Input text to analyze

Returns:

dict: Dictionary with results for each method

Example:

analysis = engine.analyze_text_comprehensive("Hello World!")
# Returns: {
#     'whitespace': {'tokens': [...], 'count': 2},
#     'word': {'tokens': [...], 'count': 2},
#     'character': {'tokens': [...], 'count': 12},
#     ...
# }

SanTOKEmbeddingGenerator

Generate semantic embeddings from text.

Constructor

SanTOKEmbeddingGenerator(config: dict = None)

Methods

`generate(text: str) -> numpy.ndarray`

Generate embedding vector for input text.

Parameters:

text (str): Input text

Returns:

numpy.ndarray: Embedding vector

SanTOKVectorStore

Store and search embeddings.

Methods

`add(embedding: np.ndarray, metadata: dict = None) -> str`

Add embedding to the store.

Parameters:

embedding (np.ndarray): Embedding vector
metadata (dict): Optional metadata

Returns:

str: ID of stored embedding

`search(query_embedding: np.ndarray, top_k: int = 10) -> list`

Search for similar embeddings.

Parameters:

query_embedding (np.ndarray): Query embedding vector
top_k (int): Number of results to return

Returns:

list: List of results with 'score' and 'metadata'

💡 Examples

Example 1: Complete Text Processing Pipeline

from santok_complete import (
    TextTokenizationEngine,
    SanTOKEmbeddingGenerator,
    SanTOKVectorStore
)

# Initialize components
engine = TextTokenizationEngine()
generator = SanTOKEmbeddingGenerator()
store = SanTOKVectorStore()

# Process text
text = "SanTOK Complete is a comprehensive text processing system."

# Tokenize
tokens_result = engine.tokenize(text, "whitespace")
print(f"Tokens: {[t['text'] for t in tokens_result['tokens']]}")

# Generate embedding
embedding = generator.generate(text)
print(f"Embedding shape: {embedding.shape}")

# Store in vector database
doc_id = store.add(embedding, metadata={"text": text, "source": "example"})
print(f"Stored document ID: {doc_id}")

# Search
query_text = "text processing"
query_embedding = generator.generate(query_text)
results = store.search(query_embedding, top_k=3)
print(f"Search results: {len(results)} found")

Example 2: Training a Custom Model

from santok_complete import (
    SanTOKVocabularyBuilder,
    SanTOKLanguageModelTrainer
)

# Build vocabulary
builder = SanTOKVocabularyBuilder()
vocab = builder.build_from_text("Your training corpus...")
print(f"Vocabulary size: {len(vocab)}")

# Train model
trainer = SanTOKLanguageModelTrainer()
model = trainer.train(
    training_data="path/to/data",
    vocabulary=vocab,
    epochs=10
)

# Use model
predictions = model.predict("Input text")

Example 3: API Server with Custom Endpoints

from santok_complete.servers.main_server import app
from santok_complete import TextTokenizationEngine
from fastapi import FastAPI

engine = TextTokenizationEngine()

@app.post("/tokenize")
async def tokenize_endpoint(text: str, method: str = "whitespace"):
    result = engine.tokenize(text, method)
    return result

# Run server
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

🏛️ Architecture

Module Structure

santok_complete/
├── __init__.py              # Main module exports
├── README.md                # This file
├── INSTALL.md               # Installation guide
├── HOW_TO_USE.md            # Usage guide
├── setup.py                 # Package setup
│
├── core/                    # Core tokenization
│   ├── core_tokenizer.py   # Core tokenizer implementation
│   ├── base_tokenizer.py   # Base tokenizer class
│   ├── parallel_tokenizer.py # Parallel processing
│   └── santok_engine.py    # Main engine
│
├── embeddings/              # Embedding generation
│   ├── embedding_generator.py
│   ├── vector_store.py
│   ├── inference_pipeline.py
│   └── semantic_trainer.py
│
├── training/                # Model training
│   ├── vocabulary_builder.py
│   ├── language_model_trainer.py
│   └── enhanced_trainer.py
│
├── servers/                 # API servers
│   ├── main_server.py
│   ├── lightweight_server.py
│   └── job_manager.py
│
├── vector_stores/           # Vector database integrations
│   └── weaviate_integration.py
│
├── integration/             # System integration
│   ├── vocabulary_adapter.py
│   └── source_map_integration.py
│
├── interpretation/          # Text interpretation
│   └── data_interpreter.py
│
├── compression/             # Compression algorithms
│   └── compression_algorithms.py
│
├── performance/             # Performance testing
│   ├── test_accuracy.py
│   └── comprehensive_performance_test.py
│
├── cli/                     # Command-line interfaces
│   ├── cli.py
│   └── main.py
│
└── utils/                   # Utilities
    ├── config.py
    ├── logging_config.py
    └── validation.py

Data Flow

Input Text
    ↓
TextTokenizationEngine (Tokenization)
    ↓
Tokens
    ↓
SanTOKEmbeddingGenerator (Embedding Generation)
    ↓
Embeddings
    ↓
SanTOKVectorStore (Storage & Search)
    ↓
Results

🔧 Troubleshooting

Common Issues

Import Error

Problem: ModuleNotFoundError: No module named 'santok_complete'

Solution:

Ensure you've installed the package: pip install -e .
Check Python path includes the parent directory
Verify you're using the correct Python environment

Tokenization Method Not Found

Problem: ValueError: Unknown tokenization method

Solution: Use one of the supported methods: "whitespace", "word", "character", "grammar", "subword"

Embedding Generation Fails

Problem: Embedding generation returns errors

Solution:

Ensure input text is not empty
Check that required dependencies are installed
Verify model files are present (if using pre-trained models)

Server Won't Start

Problem: API server fails to start

Solution:

Check if port is already in use
Verify uvicorn is installed: pip install uvicorn
Check firewall settings

Getting Help

Check the documentation files: INSTALL.md, HOW_TO_USE.md
Review examples in the examples/ directory
Check GitHub issues for known problems

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Santosh Chavala

GitHub: @chavalasantosh
Repository: SanTOK

🙏 Acknowledgments

Built with Python
Uses FastAPI for API servers
Integrates with Weaviate for vector storage
Thanks to all contributors

📊 Statistics

Total Files: 125+ Python files
Lines of Code: 48,000+
Components: 11 major modules
Tokenization Methods: 5+
Supported Python Versions: 3.7+

SanTOK Complete - Your complete solution for text processing, from tokenization to production deployment.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Dec 24, 2025

1.0.6

Oct 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

santok-2.0.0.tar.gz (304.9 kB view details)

Uploaded Dec 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

santok-2.0.0-py3-none-any.whl (333.5 kB view details)

Uploaded Dec 24, 2025 Python 3

File details

Details for the file santok-2.0.0.tar.gz.

File metadata

Download URL: santok-2.0.0.tar.gz
Upload date: Dec 24, 2025
Size: 304.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for santok-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ddccdffe9db092872117b91d5d59ea40ab221f0064d7107b384fa6c5d9b68c5d`
MD5	`0ae1969d388e6f6b77fff0a5276abce9`
BLAKE2b-256	`6cca241bc0409aef1325acbe446c5cc42203952d663e41cdf689834efe740dfd`

See more details on using hashes here.

File details

Details for the file santok-2.0.0-py3-none-any.whl.

File metadata

Download URL: santok-2.0.0-py3-none-any.whl
Upload date: Dec 24, 2025
Size: 333.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for santok-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6d712fa08d920cabf5f2979e5f806498958096f64fdb928a8c9a9e4be8cea73`
MD5	`3671d95e230f3543df83e00217497535`
BLAKE2b-256	`d7291afd8d4ddaa05832e9baa8998cdcfeef6343441991134c5e62769bf5234b`

See more details on using hashes here.

santok 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SanTOK Complete - Comprehensive Text Processing System

🎯 What is SanTOK Complete?

📋 Table of Contents

🚀 Installation

Prerequisites

Method 1: Install as Package (Recommended)

Method 2: Add to Python Path

Method 3: Set Environment Variable

Verify Installation

⚡ Quick Start

Basic Tokenization

Generate Embeddings

Use Vector Store

🏗️ Core Components

1. Core Tokenization (core/)

2. Embeddings (embeddings/)

3. Training (training/)

4. Vector Stores (vector_stores/)

5. API Servers (servers/)

6. Integration (integration/)

7. Interpretation (interpretation/)

8. Compression (compression/)

9. Performance (performance/)

10. CLI (cli/)

11. Utilities (utils/)

📖 Detailed Usage

Text Tokenization

Basic Tokenization

Advanced Tokenization

Comprehensive Text Analysis

Semantic Embeddings

Generate Embeddings

Batch Embedding Generation

Vector Stores

Using SanTOK Vector Store

Weaviate Integration

Model Training

Build Vocabulary

Train Language Model

API Server Deployment

Start Main Server

Start Lightweight Server

Using the CLI

📚 API Reference

TextTokenizationEngine

Constructor

Methods

tokenize(text: str, tokenization_method: str = "whitespace") -> dict

analyze_text_comprehensive(text: str) -> dict

SanTOKEmbeddingGenerator

Constructor

Methods

generate(text: str) -> numpy.ndarray

SanTOKVectorStore

Methods

add(embedding: np.ndarray, metadata: dict = None) -> str

search(query_embedding: np.ndarray, top_k: int = 10) -> list

💡 Examples

Example 1: Complete Text Processing Pipeline

Example 2: Training a Custom Model

Example 3: API Server with Custom Endpoints

🏛️ Architecture

Module Structure

Data Flow

🔧 Troubleshooting

Common Issues

Import Error

Tokenization Method Not Found

Embedding Generation Fails

Server Won't Start

Getting Help

1. Core Tokenization (`core/`)

2. Embeddings (`embeddings/`)

3. Training (`training/`)

4. Vector Stores (`vector_stores/`)

5. API Servers (`servers/`)

6. Integration (`integration/`)

7. Interpretation (`interpretation/`)

8. Compression (`compression/`)

9. Performance (`performance/`)

10. CLI (`cli/`)

11. Utilities (`utils/`)

`tokenize(text: str, tokenization_method: str = "whitespace") -> dict`

`analyze_text_comprehensive(text: str) -> dict`

`generate(text: str) -> numpy.ndarray`

`add(embedding: np.ndarray, metadata: dict = None) -> str`

`search(query_embedding: np.ndarray, top_k: int = 10) -> list`