SOMA - Advanced Tokenization & Intelligence Framework
Project description
SOMA: Advanced Intelligence Framework
SOMA is a next-generation tokenization and intelligence framework designed to bridge the gap between raw text and semantic understanding. Unlike traditional tokenizers that simply split text, SOMA applies mathematical analysis, feature extraction, and cognitive structures to create a richer representation of language.
"Intelligence begins with how we perceive the data. SOMA changes the perception."
🚀 Why SOMA?
SOMA is built for researchers and developers who need more than just BPE (Byte Pair Encoding). It offers a unified engine for:
- Universal Tokenization: Seamlessly switch between whitespace, word, character, subword, and grammar-based strategies.
- Mathematical Embeddings: Proprietary "Frontend Digit" calculation for deterministic, low-compute feature extraction.
- Cognitive Architecture: Integrated support for Small Language Models (SLMs) and reasoning pipelines.
- Structure-Aware: The
soma_coremodule understands text hierarchy and structural patterns effectively.
📦 Installation
pip install somaya
⚡ Quick Start
Python API
from soma import TextTokenizationEngine
# Initialize the engine
engine = TextTokenizationEngine()
# Process text with advanced analysis
text = "The future of AI is structural."
result = engine.tokenize(text, tokenization_method="subword")
print(f"Tokens: {result['tokens']}")
print(f"Features: {result['features']}")
# Output:
# Tokens: ['The', 'fut', 'ure', 'of', 'AI', 'is', 'str', 'uct', 'ural', '.']
# Features: {'entropy_index': 7, 'balance_index': 4, ...}
Command Line Interface
Process files directly from your terminal:
# Tokenize a file
soma tokenize input.txt --method subword --output result.json
# Analyze text structure
soma analyze "Analyze this sentence for structural balance."
🏗️ Architecture
SOMA is modular by design, allowing you to use only what you need:
| Module | Purpose |
|---|---|
soma |
The high-level wrapper and entry point for all standard operations. |
soma_core |
Structural Core: Handles metrics, pattern recognition, and hierarchy detection. |
cognitive |
AI Layer: Contains reasoning engines, SLM (Small Language Model) architectures (soma_gpt), and verbalizers. |
src |
Engine Room: The low-level implementations of parallel tokenizers and embedding generators. |
semantic_trainer |
Training: Tools for training custom semantic embeddings on your own corpora. |
🔧 modules Overview
1. SOMA Core (soma_core)
The backbone of the system. It replaces simple regex splitting with structure-aware parsing.
- Key Class:
StructureHierarchy - Capabilities: Pattern building, Similarity metrics via
soma_core_metrics.
2. Cognitive Layer (cognitive)
Where text meets reasoning.
- Reasoning:
soma_reasoner.pyenables logical deduction chains. - SLM:
soma_gpt.pyprovides a lightweight, trainable transformer implementation for specialized tasks.
3. Vector Integration
Seamlessly plug into vector databases.
- Built-in support for Weaviate and ChromaDB.
- Easy export of semantic embeddings to downstream ML tasks.
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Author: Santosh Chavala Repository: https://github.com/chavalasantosh/SanVerse
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file somaya-1.0.5.tar.gz.
File metadata
- Download URL: somaya-1.0.5.tar.gz
- Upload date:
- Size: 5.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e39db3314bf48ba6f44d6f1f293870879c69dbdd197c2df9329dc6881df8b849
|
|
| MD5 |
25c93d7cbb023e027c5d8dddf49d6c4b
|
|
| BLAKE2b-256 |
4f549c48014329ff4e09d6b914092d8b8146f73aae3a613702c19d31dfe25985
|
File details
Details for the file somaya-1.0.5-py3-none-any.whl.
File metadata
- Download URL: somaya-1.0.5-py3-none-any.whl
- Upload date:
- Size: 477.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d477fb8b32e0ada08a2d36069fb51d6daf0c93156edfba80db983d89278e1c0a
|
|
| MD5 |
f09e62f143b4284a32f56f9d03cca685
|
|
| BLAKE2b-256 |
47a3e9850cf49bc4dcd93bb13442c9247c50b00e121b37e1604f70bf1ba2dafb
|