Skip to main content

LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.

Project description

JustEmbed - LENS

Language Embedder with No Synthesizer

Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.

PyPI version Python 3.8+ License: MIT

Philosophy

JustEmbed is built on three core principles:

  1. Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet required.
  2. Laptop-Friendly: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
  3. Domain-Specific: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).

Why JustEmbed?

Most embedding solutions require:

  • GPU hardware
  • Cloud API keys and costs
  • Hours of training time
  • Large model files (GB)
  • Internet connectivity

JustEmbed requires:

  • ✅ Any laptop with Python
  • ✅ No API keys or costs
  • ✅ Seconds of training time
  • ✅ Small models (8 MB)
  • ✅ Works completely offline

What's Working

✅ Core Features (v0.1.1a1)

  • E5-Small Embeddings: General-purpose 384-dim embeddings via ONNX
  • Custom Model Training: Train domain-specific models from your text
  • Knowledge Bases: Create multiple KBs with different models
  • Semantic Search: Query with natural language, get relevant results
  • Web UI: Browser-based interface for all operations
  • CLI: Command-line interface for automation
  • Offline Operation: No internet required after installation

✅ Custom Model Training

Train models that learn your domain's vocabulary:

# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)

# Legal domain example  
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)

Training Performance:

  • Time: <5 seconds for 1000-word corpus
  • Hardware: CPU-only (no GPU needed)
  • Model size: ~8 MB
  • Embedding dim: 64-256 (configurable)

✅ Search Quality

Precision: High-quality results with scores 0.6-0.9 Recall: Finds synonyms and related concepts Speed: <100ms query latency

Example query results:

Query: "fever"
Results:
  1. Score: 0.862 - "...fever in the context of infection..."
  2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
  3. Score: 0.836 - "Body temperature regulation..."

Quick Start

Installation

pip install justembed

Start the Server

justembed begin --workspace ~/my_docs --port 5424

Open browser to http://localhost:5424

Train a Custom Model

  1. Click "🚀 Train Custom Model"
  2. Upload your domain-specific text file (.txt or .md)
  3. Enter model name (e.g., "medical_v1")
  4. Click "Train Model" (takes ~5 seconds)

Create a Knowledge Base

  1. Enter KB name (e.g., "medical_kb")
  2. Select model type: "Custom Model"
  3. Select your trained model
  4. Click "Create KB"

Upload Documents

  1. Choose your document file
  2. Select the KB
  3. Click "Upload & Preview Chunks"
  4. Review chunks and click "Apply Chunking"
  5. Wait for embedding to complete

Query

  1. Enter search query (e.g., "fever", "pyrexia")
  2. Select KB or "All KBs"
  3. Click "Search"
  4. View results with relevance scores

Use Cases

Medical Documentation

Train on medical texts to learn:

  • pyrexia ↔ fever
  • renal ↔ kidney
  • UTI ↔ urinary tract infection
  • hypertension ↔ high blood pressure

Legal Documents

Train on legal texts to learn:

  • plaintiff ↔ claimant
  • defendant ↔ respondent
  • tort ↔ civil wrong
  • litigation ↔ lawsuit

Technical Documentation

Train on technical texts to learn:

  • API ↔ application programming interface
  • REST ↔ representational state transfer
  • CRUD ↔ create read update delete
  • microservices ↔ service-oriented architecture

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Web UI / CLI                          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      FastAPI Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Training   │  │   Embedding  │  │    Query     │      │
│  │   Pipeline   │  │   Pipeline   │  │   Pipeline   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Embedder Layer                            │
│  ┌──────────────┐              ┌──────────────┐            │
│  │  E5-Small    │              │   Custom     │            │
│  │  (ONNX)      │              │   Models     │            │
│  │  384-dim     │              │   (ONNX)     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌──────────────┐              ┌──────────────┐            │
│  │   DuckDB     │              │   File       │            │
│  │   (KBs)      │              │   System     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

Custom Model Training

How It Works

  1. TF-IDF Vectorization: Extract features from your text
  2. MLP Training: Neural network learns to compress features
  3. ONNX Export: Portable model format for fast inference
  4. L2 Normalization: Consistent similarity scores

Training Pipeline

Text Corpus → Chunking → TF-IDF (5000 features)
                              ↓
                    MLP (512 → 256 → 128)
                              ↓
                    ONNX Export (~8 MB)
                              ↓
                    Custom Embedder

Model Configuration

  • Embedding Dimension: 64-256 (default: 128)
  • Max Features: 1000-10000 (default: 5000)
  • Hidden Layers: 512 → 256
  • Activation: ReLU
  • Optimizer: Adam

Performance

Training

  • Time: <5 seconds (1000-word corpus)
  • Hardware: CPU-only
  • Memory: <500 MB
  • Model Size: ~8 MB

Inference

  • Query Latency: <100ms
  • Embedding Speed: ~1000 docs/second
  • Memory: <200 MB per model

Quality

  • Precision: 0.6-0.9 similarity scores
  • Synonym Learning: 0.8+ for domain terms
  • Semantic Understanding: Related concepts found

Requirements

  • Python 3.8+
  • 500 MB disk space
  • 1 GB RAM
  • CPU (no GPU required)

Dependencies

Core:

  • FastAPI (web server)
  • ONNX Runtime (model inference)
  • DuckDB (storage)
  • scikit-learn (training)

Full list in pyproject.toml

CLI Commands

# Start server
justembed begin --workspace ~/docs --port 5424

# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000

# Show version
justembed --version

# Show help
justembed --help

Python API

from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer

# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
    corpus=["text1", "text2", "text3"],
    model_name="my_model",
    embedding_dim=128,
    max_features=5000,
)

# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")

# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])

Configuration

Models stored in: ~/.cache/justembed/

  • custom_models/ - Custom trained models
  • tokenizer.json - E5 tokenizer

Workspace structure:

workspace/
├── kb/
│   ├── kb1.duckdb
│   ├── kb2.duckdb
│   └── _history.duckdb

Roadmap

v0.1.x (Current)

  • ✅ E5-Small embeddings
  • ✅ Custom model training
  • ✅ Web UI
  • ✅ CLI
  • ✅ Knowledge bases
  • ✅ Semantic search

License

MIT License - see LICENSE file for details

Author

Krishnamoorthy Sankaran

Citation

If you use JustEmbed in your research, please cite:

@software{justembed2024,
  title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
  author = {Sankaran, Krishnamoorthy},
  year = {2024},
  url = {https://github.com/sekarkrishna/justembed}
}

Acknowledgments

  • E5-Small model by Microsoft
  • ONNX Runtime by Microsoft
  • FastAPI by Sebastián Ramírez
  • DuckDB by DuckDB Labs

Support

Changelog

v0.1.1a1 (2026-02-14)

New Features:

  • Custom model training from text files
  • Domain-specific synonym learning
  • Model selection in KB creation
  • Improved text chunking (sentence-based fallback)
  • Web UI for model training

Improvements:

  • Reduced minimum training corpus to 500 words
  • Better error messages
  • Model metadata display in UI
  • Query results show model used

Bug Fixes:

  • Fixed text chunking for continuous text
  • Fixed ONNX shape handling for custom models
  • Fixed model caching

v0.1.0 (2026-01-15)

  • Initial release
  • E5-Small embeddings
  • Basic web UI
  • Knowledge base management
  • Semantic search

JustEmbed - Semantic search that just works. Offline. On your laptop. In seconds.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justembed-0.1.1a1.tar.gz (22.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justembed-0.1.1a1-py3-none-any.whl (22.2 MB view details)

Uploaded Python 3

File details

Details for the file justembed-0.1.1a1.tar.gz.

File metadata

  • Download URL: justembed-0.1.1a1.tar.gz
  • Upload date:
  • Size: 22.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a1.tar.gz
Algorithm Hash digest
SHA256 6778dd7f911ff385198d648b58468842537be0ba1414e754e7a13a75608edfec
MD5 828359ccb527df605784cad454668c61
BLAKE2b-256 f1967bdaadd556e26c09f8f36d28434b2e2a00116e94ddfb7a39757b039faed1

See more details on using hashes here.

File details

Details for the file justembed-0.1.1a1-py3-none-any.whl.

File metadata

  • Download URL: justembed-0.1.1a1-py3-none-any.whl
  • Upload date:
  • Size: 22.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 60a976e874cf861341fe476a676ba7407c9d111d3c28bd5aa961b9939dbcb925
MD5 db91d459a98fb337c4a2a09cbde30d02
BLAKE2b-256 feb747d37edae9de10eaedf4f362ad9f3933bec4d29bd5100b8293b9c073545a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page