Comprehensive Text Processing System: Tokenization, Embeddings, Training, Vector Stores, and More
Project description
SanTOK Complete - Comprehensive Text Processing System
SanTOK Complete is a comprehensive, production-ready text processing system that goes far beyond simple tokenization. It provides a complete toolkit for text analysis, semantic understanding, model training, vector storage, API deployment, and more.
๐ฏ What is SanTOK Complete?
SanTOK Complete is NOT just a tokenizer - it's a complete NLP (Natural Language Processing) system that includes:
- โ Multiple Tokenization Methods - Space, word, character, grammar, subword tokenization
- โ Semantic Embeddings - Generate embeddings for semantic analysis
- โ Vector Stores - Weaviate and other vector database integrations
- โ Model Training - Vocabulary building, language model training
- โ API Servers - Production-ready FastAPI servers
- โ Data Integration - Source map integration, vocabulary adaptation
- โ Data Interpretation - Text analysis and interpretation
- โ Compression - Text compression algorithms
- โ Performance Testing - Benchmarking and performance analysis
- โ CLI Tools - Command-line interfaces
- โ Utilities - Configuration, logging, validation
๐ Table of Contents
- Installation
- Quick Start
- Core Components
- Detailed Usage
- API Reference
- Examples
- Architecture
- Troubleshooting
- Contributing
- License
๐ Installation
Prerequisites
- Python 3.7 or higher
- pip (Python package installer)
Method 1: Install as Package (Recommended)
# Navigate to the module directory
cd santok_complete
# Install in editable mode (recommended for development)
pip install -e .
# Or install normally
pip install .
Method 2: Add to Python Path
If you don't want to install, you can add the parent directory to your Python path:
import sys
import os
sys.path.insert(0, r'C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770')
import santok_complete
Method 3: Set Environment Variable
Windows:
set PYTHONPATH=%PYTHONPATH%;C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770
Linux/Mac:
export PYTHONPATH="${PYTHONPATH}:/path/to/SanTOK-Extracted/SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770"
Verify Installation
import santok_complete
print(f"SanTOK Complete Version: {santok_complete.__version__}")
from santok_complete import TextTokenizationEngine
engine = TextTokenizationEngine()
print("โ
Installation successful!")
โก Quick Start
Basic Tokenization
from santok_complete import TextTokenizationEngine
# Create engine instance
engine = TextTokenizationEngine(
random_seed=12345,
normalize_case=True,
remove_punctuation=False
)
# Tokenize text
text = "Hello World! This is SanTOK Complete."
result = engine.tokenize(text, tokenization_method="whitespace")
print(f"Tokens: {result['tokens']}")
print(f"Method: {result['method']}")
Generate Embeddings
from santok_complete import SanTOKEmbeddingGenerator
generator = SanTOKEmbeddingGenerator()
embeddings = generator.generate("Your text here")
print(f"Embedding shape: {embeddings.shape}")
Use Vector Store
from santok_complete import SanTOKVectorStore
store = SanTOKVectorStore()
store.add(embeddings, metadata={"text": "Hello", "id": 1})
results = store.search(query_embedding, top_k=5)
๐๏ธ Core Components
1. Core Tokenization (core/)
The foundation of SanTOK Complete provides multiple tokenization methods:
- TextTokenizationEngine - Main tokenization engine with multiple methods
- TextTokenizer - Core tokenizer class
- BaseTokenizer - Base class for custom tokenizers
- ParallelTokenizer - Parallel processing support
Available Tokenization Methods:
whitespace- Split by whitespaceword- Word-based tokenizationcharacter- Character-level tokenizationgrammar- Grammar-aware tokenizationsubword- Subword tokenization
2. Embeddings (embeddings/)
Generate semantic embeddings for text analysis:
- SanTOKEmbeddingGenerator - Generate embeddings from text
- SanTOKVectorStore - Store and search embeddings
- SanTOKInferencePipeline - Inference pipeline for embeddings
- SemanticTrainer - Train semantic models
3. Training (training/)
Train and build language models:
- SanTOKVocabularyBuilder - Build vocabularies from text
- SanTOKLanguageModelTrainer - Train language models
- SanTOKLanguageModel - Language model class
- EnhancedTrainer - Enhanced training capabilities
- DatasetDownloader - Download training datasets
4. Vector Stores (vector_stores/)
Integrate with vector databases:
- Weaviate Integration - Full Weaviate vector database support
- Vector search and retrieval
- Metadata management
5. API Servers (servers/)
Production-ready API servers:
- MainServer - Full-featured FastAPI server
- LightweightServer - Lightweight API server
- SimpleServer - Simple HTTP server
- JobManager - Job management system
- AdminConfig - Admin configuration
6. Integration (integration/)
System integration modules:
- VocabularyAdapter - Adapt vocabularies between systems
- SourceMapIntegration - Source map integration
7. Interpretation (interpretation/)
Text analysis and interpretation:
- DataInterpreter - Interpret and analyze text data
8. Compression (compression/)
Text compression algorithms:
- CompressionAlgorithm - Various compression methods
9. Performance (performance/)
Testing and benchmarking:
- TestAccuracy - Accuracy testing
- ComprehensivePerformanceTest - Full performance testing
- TestOrganizedOutputs - Output validation
10. CLI (cli/)
Command-line interfaces:
- Main CLI - Primary command-line interface
- Decode Demo - Decoding demonstrations
11. Utilities (utils/)
Supporting utilities:
- Config - Configuration management
- Logging - Logging setup and management
- Validation - Input validation functions
๐ Detailed Usage
Text Tokenization
Basic Tokenization
from santok_complete import TextTokenizationEngine
engine = TextTokenizationEngine()
result = engine.tokenize("Hello World!", tokenization_method="whitespace")
# Access tokens
tokens = result['tokens']
for token in tokens:
print(f"Text: {token['text']}, Index: {token['index']}")
Advanced Tokenization
engine = TextTokenizationEngine(
random_seed=12345, # For reproducibility
embedding_bit=False, # Enable embedding bit
normalize_case=True, # Normalize to lowercase
remove_punctuation=False, # Keep punctuation
collapse_repetitions=0 # No repetition collapsing
)
# Use different methods
methods = ["whitespace", "word", "character", "grammar", "subword"]
for method in methods:
result = engine.tokenize("Your text here", tokenization_method=method)
print(f"{method}: {len(result['tokens'])} tokens")
Comprehensive Text Analysis
analysis = engine.analyze_text_comprehensive("Your text here")
# Analysis includes multiple tokenization methods
for method, data in analysis.items():
print(f"{method}: {len(data['tokens'])} tokens")
Semantic Embeddings
Generate Embeddings
from santok_complete import SanTOKEmbeddingGenerator
generator = SanTOKEmbeddingGenerator()
# Generate embeddings
text = "This is sample text for embedding generation"
embeddings = generator.generate(text)
print(f"Embedding dimension: {embeddings.shape}")
print(f"Embedding vector: {embeddings}")
Batch Embedding Generation
texts = ["First text", "Second text", "Third text"]
embeddings_list = [generator.generate(text) for text in texts]
Vector Stores
Using SanTOK Vector Store
from santok_complete import SanTOKVectorStore
store = SanTOKVectorStore()
# Add documents
doc1_embedding = generator.generate("Document 1 text")
doc2_embedding = generator.generate("Document 2 text")
store.add(doc1_embedding, metadata={"id": 1, "title": "Doc 1"})
store.add(doc2_embedding, metadata={"id": 2, "title": "Doc 2"})
# Search
query_embedding = generator.generate("Search query")
results = store.search(query_embedding, top_k=5)
for result in results:
print(f"Score: {result['score']}, Metadata: {result['metadata']}")
Weaviate Integration
from santok_complete.vector_stores.weaviate_integration import *
# Connect to Weaviate
client = connect_weaviate(url="http://localhost:8080")
# Store vectors
store_vector(client, embeddings, metadata={"text": "Sample"})
# Search
results = search_vectors(client, query_embedding, limit=10)
Model Training
Build Vocabulary
from santok_complete import SanTOKVocabularyBuilder
builder = SanTOKVocabularyBuilder()
# Build from text corpus
corpus = "Your training text corpus here..."
vocabulary = builder.build_from_text(corpus)
print(f"Vocabulary size: {len(vocabulary)}")
print(f"Vocabulary: {vocabulary}")
Train Language Model
from santok_complete import SanTOKLanguageModelTrainer
trainer = SanTOKLanguageModelTrainer()
# Train model
model = trainer.train(
training_data="path/to/training/data",
epochs=10,
batch_size=32
)
# Save model
model.save("path/to/save/model")
API Server Deployment
Start Main Server
from santok_complete.servers.main_server import app
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Start Lightweight Server
from santok_complete.servers.lightweight_server import app
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)
Using the CLI
# From command line
python -m santok_complete.cli.cli "Hello World" --method whitespace
# Or if installed
santok "Hello World" --method word
๐ API Reference
TextTokenizationEngine
Main tokenization engine class.
Constructor
TextTokenizationEngine(
random_seed: int = 12345,
embedding_bit: bool = False,
normalize_case: bool = True,
remove_punctuation: bool = False,
collapse_repetitions: int = 0
)
Parameters:
random_seed(int): Seed for reproducible tokenization (default: 12345)embedding_bit(bool): Enable embedding bit for extra variation (default: False)normalize_case(bool): Convert text to lowercase (default: True)remove_punctuation(bool): Remove punctuation (default: False)collapse_repetitions(int): Collapse repeated characters (0=disabled, 1=run-aware, N=collapse to N) (default: 0)
Methods
tokenize(text: str, tokenization_method: str = "whitespace") -> dict
Tokenize input text using specified method.
Parameters:
text(str): Input text to tokenizetokenization_method(str): Method to use ("whitespace", "word", "character", "grammar", "subword")
Returns:
dict: Dictionary containing:tokens: List of token dictionaries with 'text' and 'index'method: Tokenization method usedcount: Number of tokens
Example:
result = engine.tokenize("Hello World!", "whitespace")
# Returns: {
# 'tokens': [{'text': 'Hello', 'index': 0}, {'text': 'World!', 'index': 1}],
# 'method': 'whitespace',
# 'count': 2
# }
analyze_text_comprehensive(text: str) -> dict
Analyze text using all available tokenization methods.
Parameters:
text(str): Input text to analyze
Returns:
dict: Dictionary with results for each method
Example:
analysis = engine.analyze_text_comprehensive("Hello World!")
# Returns: {
# 'whitespace': {'tokens': [...], 'count': 2},
# 'word': {'tokens': [...], 'count': 2},
# 'character': {'tokens': [...], 'count': 12},
# ...
# }
SanTOKEmbeddingGenerator
Generate semantic embeddings from text.
Constructor
SanTOKEmbeddingGenerator(config: dict = None)
Methods
generate(text: str) -> numpy.ndarray
Generate embedding vector for input text.
Parameters:
text(str): Input text
Returns:
numpy.ndarray: Embedding vector
SanTOKVectorStore
Store and search embeddings.
Methods
add(embedding: np.ndarray, metadata: dict = None) -> str
Add embedding to the store.
Parameters:
embedding(np.ndarray): Embedding vectormetadata(dict): Optional metadata
Returns:
str: ID of stored embedding
search(query_embedding: np.ndarray, top_k: int = 10) -> list
Search for similar embeddings.
Parameters:
query_embedding(np.ndarray): Query embedding vectortop_k(int): Number of results to return
Returns:
list: List of results with 'score' and 'metadata'
๐ก Examples
Example 1: Complete Text Processing Pipeline
from santok_complete import (
TextTokenizationEngine,
SanTOKEmbeddingGenerator,
SanTOKVectorStore
)
# Initialize components
engine = TextTokenizationEngine()
generator = SanTOKEmbeddingGenerator()
store = SanTOKVectorStore()
# Process text
text = "SanTOK Complete is a comprehensive text processing system."
# Tokenize
tokens_result = engine.tokenize(text, "whitespace")
print(f"Tokens: {[t['text'] for t in tokens_result['tokens']]}")
# Generate embedding
embedding = generator.generate(text)
print(f"Embedding shape: {embedding.shape}")
# Store in vector database
doc_id = store.add(embedding, metadata={"text": text, "source": "example"})
print(f"Stored document ID: {doc_id}")
# Search
query_text = "text processing"
query_embedding = generator.generate(query_text)
results = store.search(query_embedding, top_k=3)
print(f"Search results: {len(results)} found")
Example 2: Training a Custom Model
from santok_complete import (
SanTOKVocabularyBuilder,
SanTOKLanguageModelTrainer
)
# Build vocabulary
builder = SanTOKVocabularyBuilder()
vocab = builder.build_from_text("Your training corpus...")
print(f"Vocabulary size: {len(vocab)}")
# Train model
trainer = SanTOKLanguageModelTrainer()
model = trainer.train(
training_data="path/to/data",
vocabulary=vocab,
epochs=10
)
# Use model
predictions = model.predict("Input text")
Example 3: API Server with Custom Endpoints
from santok_complete.servers.main_server import app
from santok_complete import TextTokenizationEngine
from fastapi import FastAPI
engine = TextTokenizationEngine()
@app.post("/tokenize")
async def tokenize_endpoint(text: str, method: str = "whitespace"):
result = engine.tokenize(text, method)
return result
# Run server
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
๐๏ธ Architecture
Module Structure
santok_complete/
โโโ __init__.py # Main module exports
โโโ README.md # This file
โโโ INSTALL.md # Installation guide
โโโ HOW_TO_USE.md # Usage guide
โโโ setup.py # Package setup
โ
โโโ core/ # Core tokenization
โ โโโ core_tokenizer.py # Core tokenizer implementation
โ โโโ base_tokenizer.py # Base tokenizer class
โ โโโ parallel_tokenizer.py # Parallel processing
โ โโโ santok_engine.py # Main engine
โ
โโโ embeddings/ # Embedding generation
โ โโโ embedding_generator.py
โ โโโ vector_store.py
โ โโโ inference_pipeline.py
โ โโโ semantic_trainer.py
โ
โโโ training/ # Model training
โ โโโ vocabulary_builder.py
โ โโโ language_model_trainer.py
โ โโโ enhanced_trainer.py
โ
โโโ servers/ # API servers
โ โโโ main_server.py
โ โโโ lightweight_server.py
โ โโโ job_manager.py
โ
โโโ vector_stores/ # Vector database integrations
โ โโโ weaviate_integration.py
โ
โโโ integration/ # System integration
โ โโโ vocabulary_adapter.py
โ โโโ source_map_integration.py
โ
โโโ interpretation/ # Text interpretation
โ โโโ data_interpreter.py
โ
โโโ compression/ # Compression algorithms
โ โโโ compression_algorithms.py
โ
โโโ performance/ # Performance testing
โ โโโ test_accuracy.py
โ โโโ comprehensive_performance_test.py
โ
โโโ cli/ # Command-line interfaces
โ โโโ cli.py
โ โโโ main.py
โ
โโโ utils/ # Utilities
โโโ config.py
โโโ logging_config.py
โโโ validation.py
Data Flow
Input Text
โ
TextTokenizationEngine (Tokenization)
โ
Tokens
โ
SanTOKEmbeddingGenerator (Embedding Generation)
โ
Embeddings
โ
SanTOKVectorStore (Storage & Search)
โ
Results
๐ง Troubleshooting
Common Issues
Import Error
Problem: ModuleNotFoundError: No module named 'santok_complete'
Solution:
- Ensure you've installed the package:
pip install -e . - Check Python path includes the parent directory
- Verify you're using the correct Python environment
Tokenization Method Not Found
Problem: ValueError: Unknown tokenization method
Solution:
Use one of the supported methods: "whitespace", "word", "character", "grammar", "subword"
Embedding Generation Fails
Problem: Embedding generation returns errors
Solution:
- Ensure input text is not empty
- Check that required dependencies are installed
- Verify model files are present (if using pre-trained models)
Server Won't Start
Problem: API server fails to start
Solution:
- Check if port is already in use
- Verify uvicorn is installed:
pip install uvicorn - Check firewall settings
Getting Help
- Check the documentation files:
INSTALL.md,HOW_TO_USE.md - Review examples in the
examples/directory - Check GitHub issues for known problems
๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Author
Santosh Chavala
- GitHub: @chavalasantosh
- Repository: SanTOK
๐ Acknowledgments
- Built with Python
- Uses FastAPI for API servers
- Integrates with Weaviate for vector storage
- Thanks to all contributors
๐ Statistics
- Total Files: 125+ Python files
- Lines of Code: 48,000+
- Components: 11 major modules
- Tokenization Methods: 5+
- Supported Python Versions: 3.7+
SanTOK Complete - Your complete solution for text processing, from tokenization to production deployment.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file santok-2.0.0.tar.gz.
File metadata
- Download URL: santok-2.0.0.tar.gz
- Upload date:
- Size: 304.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddccdffe9db092872117b91d5d59ea40ab221f0064d7107b384fa6c5d9b68c5d
|
|
| MD5 |
0ae1969d388e6f6b77fff0a5276abce9
|
|
| BLAKE2b-256 |
6cca241bc0409aef1325acbe446c5cc42203952d663e41cdf689834efe740dfd
|
File details
Details for the file santok-2.0.0-py3-none-any.whl.
File metadata
- Download URL: santok-2.0.0-py3-none-any.whl
- Upload date:
- Size: 333.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d712fa08d920cabf5f2979e5f806498958096f64fdb928a8c9a9e4be8cea73
|
|
| MD5 |
3671d95e230f3543df83e00217497535
|
|
| BLAKE2b-256 |
d7291afd8d4ddaa05832e9baa8998cdcfeef6343441991134c5e62769bf5234b
|