Skip to main content

SOMA - Advanced Tokenization & Intelligence Framework

Project description

SOMA: Advanced Intelligence Framework

PyPI version Python Versions License: MIT Code Style: Black Build Status Ask DeepWiki zread

SOMA is a next-generation tokenization and intelligence framework designed to bridge the gap between raw text and semantic understanding. Unlike traditional tokenizers that simply split text, SOMA applies mathematical analysis, feature extraction, and cognitive structures to create a richer representation of language.

"Intelligence begins with how we perceive the data. SOMA changes the perception."


🚀 Why SOMA?

SOMA is built for researchers and developers who need more than just BPE (Byte Pair Encoding). It offers a unified engine for:

  • Universal Tokenization: Seamlessly switch between whitespace, word, character, subword, and grammar-based strategies.
  • Mathematical Embeddings: Proprietary "Frontend Digit" calculation for deterministic, low-compute feature extraction.
  • Cognitive Architecture: Integrated support for Small Language Models (SLMs) and reasoning pipelines.
  • Structure-Aware: The soma_core module understands text hierarchy and structural patterns effectively.

📦 Installation

pip install somaya

⚡ Quick Start

Python API

from soma import TextTokenizationEngine

# Initialize the engine
engine = TextTokenizationEngine()

# Process text with advanced analysis
text = "The future of AI is structural."
result = engine.tokenize(text, tokenization_method="subword")

print(f"Tokens:   {result['tokens']}")
print(f"Features: {result['features']}")
# Output:
# Tokens:   ['The', 'fut', 'ure', 'of', 'AI', 'is', 'str', 'uct', 'ural', '.']
# Features: {'entropy_index': 7, 'balance_index': 4, ...}

Command Line Interface

Process files directly from your terminal:

# Tokenize a file
soma tokenize input.txt --method subword --output result.json

# Analyze text structure
soma analyze "Analyze this sentence for structural balance."

🏗️ Architecture

SOMA is modular by design, allowing you to use only what you need:

Module Purpose
soma The high-level wrapper and entry point for all standard operations.
soma_core Structural Core: Handles metrics, pattern recognition, and hierarchy detection.
cognitive AI Layer: Contains reasoning engines, SLM (Small Language Model) architectures (soma_gpt), and verbalizers.
src Engine Room: The low-level implementations of parallel tokenizers and embedding generators.
semantic_trainer Training: Tools for training custom semantic embeddings on your own corpora.

🔧 modules Overview

1. SOMA Core (soma_core)

The backbone of the system. It replaces simple regex splitting with structure-aware parsing.

  • Key Class: StructureHierarchy
  • Capabilities: Pattern building, Similarity metrics via soma_core_metrics.

2. Cognitive Layer (cognitive)

Where text meets reasoning.

  • Reasoning: soma_reasoner.py enables logical deduction chains.
  • SLM: soma_gpt.py provides a lightweight, trainable transformer implementation for specialized tasks.

3. Vector Integration

Seamlessly plug into vector databases.

  • Built-in support for Weaviate and ChromaDB.
  • Easy export of semantic embeddings to downstream ML tasks.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Author: Santosh Chavala Repository: https://github.com/chavalasantosh/SanVerse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

somaya-1.0.5.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

somaya-1.0.5-py3-none-any.whl (477.8 kB view details)

Uploaded Python 3

File details

Details for the file somaya-1.0.5.tar.gz.

File metadata

  • Download URL: somaya-1.0.5.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for somaya-1.0.5.tar.gz
Algorithm Hash digest
SHA256 e39db3314bf48ba6f44d6f1f293870879c69dbdd197c2df9329dc6881df8b849
MD5 25c93d7cbb023e027c5d8dddf49d6c4b
BLAKE2b-256 4f549c48014329ff4e09d6b914092d8b8146f73aae3a613702c19d31dfe25985

See more details on using hashes here.

File details

Details for the file somaya-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: somaya-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 477.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for somaya-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d477fb8b32e0ada08a2d36069fb51d6daf0c93156edfba80db983d89278e1c0a
MD5 f09e62f143b4284a32f56f9d03cca685
BLAKE2b-256 47a3e9850cf49bc4dcd93bb13442c9247c50b00e121b37e1604f70bf1ba2dafb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page