Skip to main content

Comprehensive Text Processing System: Tokenization, Embeddings, Training, Vector Stores, and More

Project description

SanTOK Complete - Comprehensive Text Processing System

Python 3.7+ License: MIT

SanTOK Complete is a comprehensive, production-ready text processing system that goes far beyond simple tokenization. It provides a complete toolkit for text analysis, semantic understanding, model training, vector storage, API deployment, and more.

๐ŸŽฏ What is SanTOK Complete?

SanTOK Complete is NOT just a tokenizer - it's a complete NLP (Natural Language Processing) system that includes:

  • โœ… Multiple Tokenization Methods - Space, word, character, grammar, subword tokenization
  • โœ… Semantic Embeddings - Generate embeddings for semantic analysis
  • โœ… Vector Stores - Weaviate and other vector database integrations
  • โœ… Model Training - Vocabulary building, language model training
  • โœ… API Servers - Production-ready FastAPI servers
  • โœ… Data Integration - Source map integration, vocabulary adaptation
  • โœ… Data Interpretation - Text analysis and interpretation
  • โœ… Compression - Text compression algorithms
  • โœ… Performance Testing - Benchmarking and performance analysis
  • โœ… CLI Tools - Command-line interfaces
  • โœ… Utilities - Configuration, logging, validation

๐Ÿ“‹ Table of Contents

  1. Installation
  2. Quick Start
  3. Core Components
  4. Detailed Usage
  5. API Reference
  6. Examples
  7. Architecture
  8. Troubleshooting
  9. Contributing
  10. License

๐Ÿš€ Installation

Prerequisites

  • Python 3.7 or higher
  • pip (Python package installer)

Method 1: Install as Package (Recommended)

# Navigate to the module directory
cd santok_complete

# Install in editable mode (recommended for development)
pip install -e .

# Or install normally
pip install .

Method 2: Add to Python Path

If you don't want to install, you can add the parent directory to your Python path:

import sys
import os
sys.path.insert(0, r'C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770')

import santok_complete

Method 3: Set Environment Variable

Windows:

set PYTHONPATH=%PYTHONPATH%;C:\path\to\SanTOK-Extracted\SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770

Linux/Mac:

export PYTHONPATH="${PYTHONPATH}:/path/to/SanTOK-Extracted/SanTOK-9a284bcf1b497d32e2041726fa2bba1e662d2770"

Verify Installation

import santok_complete
print(f"SanTOK Complete Version: {santok_complete.__version__}")

from santok_complete import TextTokenizationEngine
engine = TextTokenizationEngine()
print("โœ… Installation successful!")

โšก Quick Start

Basic Tokenization

from santok_complete import TextTokenizationEngine

# Create engine instance
engine = TextTokenizationEngine(
    random_seed=12345,
    normalize_case=True,
    remove_punctuation=False
)

# Tokenize text
text = "Hello World! This is SanTOK Complete."
result = engine.tokenize(text, tokenization_method="whitespace")

print(f"Tokens: {result['tokens']}")
print(f"Method: {result['method']}")

Generate Embeddings

from santok_complete import SanTOKEmbeddingGenerator

generator = SanTOKEmbeddingGenerator()
embeddings = generator.generate("Your text here")
print(f"Embedding shape: {embeddings.shape}")

Use Vector Store

from santok_complete import SanTOKVectorStore

store = SanTOKVectorStore()
store.add(embeddings, metadata={"text": "Hello", "id": 1})
results = store.search(query_embedding, top_k=5)

๐Ÿ—๏ธ Core Components

1. Core Tokenization (core/)

The foundation of SanTOK Complete provides multiple tokenization methods:

  • TextTokenizationEngine - Main tokenization engine with multiple methods
  • TextTokenizer - Core tokenizer class
  • BaseTokenizer - Base class for custom tokenizers
  • ParallelTokenizer - Parallel processing support

Available Tokenization Methods:

  • whitespace - Split by whitespace
  • word - Word-based tokenization
  • character - Character-level tokenization
  • grammar - Grammar-aware tokenization
  • subword - Subword tokenization

2. Embeddings (embeddings/)

Generate semantic embeddings for text analysis:

  • SanTOKEmbeddingGenerator - Generate embeddings from text
  • SanTOKVectorStore - Store and search embeddings
  • SanTOKInferencePipeline - Inference pipeline for embeddings
  • SemanticTrainer - Train semantic models

3. Training (training/)

Train and build language models:

  • SanTOKVocabularyBuilder - Build vocabularies from text
  • SanTOKLanguageModelTrainer - Train language models
  • SanTOKLanguageModel - Language model class
  • EnhancedTrainer - Enhanced training capabilities
  • DatasetDownloader - Download training datasets

4. Vector Stores (vector_stores/)

Integrate with vector databases:

  • Weaviate Integration - Full Weaviate vector database support
  • Vector search and retrieval
  • Metadata management

5. API Servers (servers/)

Production-ready API servers:

  • MainServer - Full-featured FastAPI server
  • LightweightServer - Lightweight API server
  • SimpleServer - Simple HTTP server
  • JobManager - Job management system
  • AdminConfig - Admin configuration

6. Integration (integration/)

System integration modules:

  • VocabularyAdapter - Adapt vocabularies between systems
  • SourceMapIntegration - Source map integration

7. Interpretation (interpretation/)

Text analysis and interpretation:

  • DataInterpreter - Interpret and analyze text data

8. Compression (compression/)

Text compression algorithms:

  • CompressionAlgorithm - Various compression methods

9. Performance (performance/)

Testing and benchmarking:

  • TestAccuracy - Accuracy testing
  • ComprehensivePerformanceTest - Full performance testing
  • TestOrganizedOutputs - Output validation

10. CLI (cli/)

Command-line interfaces:

  • Main CLI - Primary command-line interface
  • Decode Demo - Decoding demonstrations

11. Utilities (utils/)

Supporting utilities:

  • Config - Configuration management
  • Logging - Logging setup and management
  • Validation - Input validation functions

๐Ÿ“– Detailed Usage

Text Tokenization

Basic Tokenization

from santok_complete import TextTokenizationEngine

engine = TextTokenizationEngine()
result = engine.tokenize("Hello World!", tokenization_method="whitespace")

# Access tokens
tokens = result['tokens']
for token in tokens:
    print(f"Text: {token['text']}, Index: {token['index']}")

Advanced Tokenization

engine = TextTokenizationEngine(
    random_seed=12345,           # For reproducibility
    embedding_bit=False,          # Enable embedding bit
    normalize_case=True,          # Normalize to lowercase
    remove_punctuation=False,     # Keep punctuation
    collapse_repetitions=0        # No repetition collapsing
)

# Use different methods
methods = ["whitespace", "word", "character", "grammar", "subword"]

for method in methods:
    result = engine.tokenize("Your text here", tokenization_method=method)
    print(f"{method}: {len(result['tokens'])} tokens")

Comprehensive Text Analysis

analysis = engine.analyze_text_comprehensive("Your text here")

# Analysis includes multiple tokenization methods
for method, data in analysis.items():
    print(f"{method}: {len(data['tokens'])} tokens")

Semantic Embeddings

Generate Embeddings

from santok_complete import SanTOKEmbeddingGenerator

generator = SanTOKEmbeddingGenerator()

# Generate embeddings
text = "This is sample text for embedding generation"
embeddings = generator.generate(text)

print(f"Embedding dimension: {embeddings.shape}")
print(f"Embedding vector: {embeddings}")

Batch Embedding Generation

texts = ["First text", "Second text", "Third text"]
embeddings_list = [generator.generate(text) for text in texts]

Vector Stores

Using SanTOK Vector Store

from santok_complete import SanTOKVectorStore

store = SanTOKVectorStore()

# Add documents
doc1_embedding = generator.generate("Document 1 text")
doc2_embedding = generator.generate("Document 2 text")

store.add(doc1_embedding, metadata={"id": 1, "title": "Doc 1"})
store.add(doc2_embedding, metadata={"id": 2, "title": "Doc 2"})

# Search
query_embedding = generator.generate("Search query")
results = store.search(query_embedding, top_k=5)

for result in results:
    print(f"Score: {result['score']}, Metadata: {result['metadata']}")

Weaviate Integration

from santok_complete.vector_stores.weaviate_integration import *

# Connect to Weaviate
client = connect_weaviate(url="http://localhost:8080")

# Store vectors
store_vector(client, embeddings, metadata={"text": "Sample"})

# Search
results = search_vectors(client, query_embedding, limit=10)

Model Training

Build Vocabulary

from santok_complete import SanTOKVocabularyBuilder

builder = SanTOKVocabularyBuilder()

# Build from text corpus
corpus = "Your training text corpus here..."
vocabulary = builder.build_from_text(corpus)

print(f"Vocabulary size: {len(vocabulary)}")
print(f"Vocabulary: {vocabulary}")

Train Language Model

from santok_complete import SanTOKLanguageModelTrainer

trainer = SanTOKLanguageModelTrainer()

# Train model
model = trainer.train(
    training_data="path/to/training/data",
    epochs=10,
    batch_size=32
)

# Save model
model.save("path/to/save/model")

API Server Deployment

Start Main Server

from santok_complete.servers.main_server import app
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)

Start Lightweight Server

from santok_complete.servers.lightweight_server import app
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8001)

Using the CLI

# From command line
python -m santok_complete.cli.cli "Hello World" --method whitespace

# Or if installed
santok "Hello World" --method word

๐Ÿ“š API Reference

TextTokenizationEngine

Main tokenization engine class.

Constructor

TextTokenizationEngine(
    random_seed: int = 12345,
    embedding_bit: bool = False,
    normalize_case: bool = True,
    remove_punctuation: bool = False,
    collapse_repetitions: int = 0
)

Parameters:

  • random_seed (int): Seed for reproducible tokenization (default: 12345)
  • embedding_bit (bool): Enable embedding bit for extra variation (default: False)
  • normalize_case (bool): Convert text to lowercase (default: True)
  • remove_punctuation (bool): Remove punctuation (default: False)
  • collapse_repetitions (int): Collapse repeated characters (0=disabled, 1=run-aware, N=collapse to N) (default: 0)

Methods

tokenize(text: str, tokenization_method: str = "whitespace") -> dict

Tokenize input text using specified method.

Parameters:

  • text (str): Input text to tokenize
  • tokenization_method (str): Method to use ("whitespace", "word", "character", "grammar", "subword")

Returns:

  • dict: Dictionary containing:
    • tokens: List of token dictionaries with 'text' and 'index'
    • method: Tokenization method used
    • count: Number of tokens

Example:

result = engine.tokenize("Hello World!", "whitespace")
# Returns: {
#     'tokens': [{'text': 'Hello', 'index': 0}, {'text': 'World!', 'index': 1}],
#     'method': 'whitespace',
#     'count': 2
# }
analyze_text_comprehensive(text: str) -> dict

Analyze text using all available tokenization methods.

Parameters:

  • text (str): Input text to analyze

Returns:

  • dict: Dictionary with results for each method

Example:

analysis = engine.analyze_text_comprehensive("Hello World!")
# Returns: {
#     'whitespace': {'tokens': [...], 'count': 2},
#     'word': {'tokens': [...], 'count': 2},
#     'character': {'tokens': [...], 'count': 12},
#     ...
# }

SanTOKEmbeddingGenerator

Generate semantic embeddings from text.

Constructor

SanTOKEmbeddingGenerator(config: dict = None)

Methods

generate(text: str) -> numpy.ndarray

Generate embedding vector for input text.

Parameters:

  • text (str): Input text

Returns:

  • numpy.ndarray: Embedding vector

SanTOKVectorStore

Store and search embeddings.

Methods

add(embedding: np.ndarray, metadata: dict = None) -> str

Add embedding to the store.

Parameters:

  • embedding (np.ndarray): Embedding vector
  • metadata (dict): Optional metadata

Returns:

  • str: ID of stored embedding
search(query_embedding: np.ndarray, top_k: int = 10) -> list

Search for similar embeddings.

Parameters:

  • query_embedding (np.ndarray): Query embedding vector
  • top_k (int): Number of results to return

Returns:

  • list: List of results with 'score' and 'metadata'

๐Ÿ’ก Examples

Example 1: Complete Text Processing Pipeline

from santok_complete import (
    TextTokenizationEngine,
    SanTOKEmbeddingGenerator,
    SanTOKVectorStore
)

# Initialize components
engine = TextTokenizationEngine()
generator = SanTOKEmbeddingGenerator()
store = SanTOKVectorStore()

# Process text
text = "SanTOK Complete is a comprehensive text processing system."

# Tokenize
tokens_result = engine.tokenize(text, "whitespace")
print(f"Tokens: {[t['text'] for t in tokens_result['tokens']]}")

# Generate embedding
embedding = generator.generate(text)
print(f"Embedding shape: {embedding.shape}")

# Store in vector database
doc_id = store.add(embedding, metadata={"text": text, "source": "example"})
print(f"Stored document ID: {doc_id}")

# Search
query_text = "text processing"
query_embedding = generator.generate(query_text)
results = store.search(query_embedding, top_k=3)
print(f"Search results: {len(results)} found")

Example 2: Training a Custom Model

from santok_complete import (
    SanTOKVocabularyBuilder,
    SanTOKLanguageModelTrainer
)

# Build vocabulary
builder = SanTOKVocabularyBuilder()
vocab = builder.build_from_text("Your training corpus...")
print(f"Vocabulary size: {len(vocab)}")

# Train model
trainer = SanTOKLanguageModelTrainer()
model = trainer.train(
    training_data="path/to/data",
    vocabulary=vocab,
    epochs=10
)

# Use model
predictions = model.predict("Input text")

Example 3: API Server with Custom Endpoints

from santok_complete.servers.main_server import app
from santok_complete import TextTokenizationEngine
from fastapi import FastAPI

engine = TextTokenizationEngine()

@app.post("/tokenize")
async def tokenize_endpoint(text: str, method: str = "whitespace"):
    result = engine.tokenize(text, method)
    return result

# Run server
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

๐Ÿ›๏ธ Architecture

Module Structure

santok_complete/
โ”œโ”€โ”€ __init__.py              # Main module exports
โ”œโ”€โ”€ README.md                # This file
โ”œโ”€โ”€ INSTALL.md               # Installation guide
โ”œโ”€โ”€ HOW_TO_USE.md            # Usage guide
โ”œโ”€โ”€ setup.py                 # Package setup
โ”‚
โ”œโ”€โ”€ core/                    # Core tokenization
โ”‚   โ”œโ”€โ”€ core_tokenizer.py   # Core tokenizer implementation
โ”‚   โ”œโ”€โ”€ base_tokenizer.py   # Base tokenizer class
โ”‚   โ”œโ”€โ”€ parallel_tokenizer.py # Parallel processing
โ”‚   โ””โ”€โ”€ santok_engine.py    # Main engine
โ”‚
โ”œโ”€โ”€ embeddings/              # Embedding generation
โ”‚   โ”œโ”€โ”€ embedding_generator.py
โ”‚   โ”œโ”€โ”€ vector_store.py
โ”‚   โ”œโ”€โ”€ inference_pipeline.py
โ”‚   โ””โ”€โ”€ semantic_trainer.py
โ”‚
โ”œโ”€โ”€ training/                # Model training
โ”‚   โ”œโ”€โ”€ vocabulary_builder.py
โ”‚   โ”œโ”€โ”€ language_model_trainer.py
โ”‚   โ””โ”€โ”€ enhanced_trainer.py
โ”‚
โ”œโ”€โ”€ servers/                 # API servers
โ”‚   โ”œโ”€โ”€ main_server.py
โ”‚   โ”œโ”€โ”€ lightweight_server.py
โ”‚   โ””โ”€โ”€ job_manager.py
โ”‚
โ”œโ”€โ”€ vector_stores/           # Vector database integrations
โ”‚   โ””โ”€โ”€ weaviate_integration.py
โ”‚
โ”œโ”€โ”€ integration/             # System integration
โ”‚   โ”œโ”€โ”€ vocabulary_adapter.py
โ”‚   โ””โ”€โ”€ source_map_integration.py
โ”‚
โ”œโ”€โ”€ interpretation/          # Text interpretation
โ”‚   โ””โ”€โ”€ data_interpreter.py
โ”‚
โ”œโ”€โ”€ compression/             # Compression algorithms
โ”‚   โ””โ”€โ”€ compression_algorithms.py
โ”‚
โ”œโ”€โ”€ performance/             # Performance testing
โ”‚   โ”œโ”€โ”€ test_accuracy.py
โ”‚   โ””โ”€โ”€ comprehensive_performance_test.py
โ”‚
โ”œโ”€โ”€ cli/                     # Command-line interfaces
โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ””โ”€โ”€ main.py
โ”‚
โ””โ”€โ”€ utils/                   # Utilities
    โ”œโ”€โ”€ config.py
    โ”œโ”€โ”€ logging_config.py
    โ””โ”€โ”€ validation.py

Data Flow

Input Text
    โ†“
TextTokenizationEngine (Tokenization)
    โ†“
Tokens
    โ†“
SanTOKEmbeddingGenerator (Embedding Generation)
    โ†“
Embeddings
    โ†“
SanTOKVectorStore (Storage & Search)
    โ†“
Results

๐Ÿ”ง Troubleshooting

Common Issues

Import Error

Problem: ModuleNotFoundError: No module named 'santok_complete'

Solution:

  1. Ensure you've installed the package: pip install -e .
  2. Check Python path includes the parent directory
  3. Verify you're using the correct Python environment

Tokenization Method Not Found

Problem: ValueError: Unknown tokenization method

Solution: Use one of the supported methods: "whitespace", "word", "character", "grammar", "subword"

Embedding Generation Fails

Problem: Embedding generation returns errors

Solution:

  1. Ensure input text is not empty
  2. Check that required dependencies are installed
  3. Verify model files are present (if using pre-trained models)

Server Won't Start

Problem: API server fails to start

Solution:

  1. Check if port is already in use
  2. Verify uvicorn is installed: pip install uvicorn
  3. Check firewall settings

Getting Help

  • Check the documentation files: INSTALL.md, HOW_TO_USE.md
  • Review examples in the examples/ directory
  • Check GitHub issues for known problems

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ‘ค Author

Santosh Chavala


๐Ÿ™ Acknowledgments

  • Built with Python
  • Uses FastAPI for API servers
  • Integrates with Weaviate for vector storage
  • Thanks to all contributors

๐Ÿ“Š Statistics

  • Total Files: 125+ Python files
  • Lines of Code: 48,000+
  • Components: 11 major modules
  • Tokenization Methods: 5+
  • Supported Python Versions: 3.7+

SanTOK Complete - Your complete solution for text processing, from tokenization to production deployment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

santok-2.0.0.tar.gz (304.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

santok-2.0.0-py3-none-any.whl (333.5 kB view details)

Uploaded Python 3

File details

Details for the file santok-2.0.0.tar.gz.

File metadata

  • Download URL: santok-2.0.0.tar.gz
  • Upload date:
  • Size: 304.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for santok-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ddccdffe9db092872117b91d5d59ea40ab221f0064d7107b384fa6c5d9b68c5d
MD5 0ae1969d388e6f6b77fff0a5276abce9
BLAKE2b-256 6cca241bc0409aef1325acbe446c5cc42203952d663e41cdf689834efe740dfd

See more details on using hashes here.

File details

Details for the file santok-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: santok-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 333.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for santok-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6d712fa08d920cabf5f2979e5f806498958096f64fdb928a8c9a9e4be8cea73
MD5 3671d95e230f3543df83e00217497535
BLAKE2b-256 d7291afd8d4ddaa05832e9baa8998cdcfeef6343441991134c5e62769bf5234b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page