Skip to main content

LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.

Project description

JustEmbed - LENS

Language Embedder with No Synthesizer

Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.

PyPI version Python 3.8+ License: MIT

Philosophy

JustEmbed is built on three core principles:

  1. Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet required.
  2. Laptop-Friendly: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
  3. Domain-Specific: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).

Why JustEmbed?

Most embedding solutions require:

  • GPU hardware
  • Cloud API keys and costs
  • Hours of training time
  • Large model files (GB)
  • Internet connectivity

JustEmbed requires:

  • ✅ Any laptop with Python
  • ✅ No API keys or costs
  • ✅ Seconds of training time
  • ✅ Small models (8 MB)
  • ✅ Works completely offline

What's Working

✅ Core Features (v0.1.1a1)

  • E5-Small Embeddings: General-purpose 384-dim embeddings via ONNX
  • Custom Model Training: Train domain-specific models from your text
  • Knowledge Bases: Create multiple KBs with different models
  • Semantic Search: Query with natural language, get relevant results
  • Web UI: Browser-based interface for all operations
  • CLI: Command-line interface for automation
  • Offline Operation: No internet required after installation

✅ Custom Model Training

Train models that learn your domain's vocabulary:

# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)

# Legal domain example  
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)

Training Performance:

  • Time: <5 seconds for 1000-word corpus
  • Hardware: CPU-only (no GPU needed)
  • Model size: ~8 MB
  • Embedding dim: 64-256 (configurable)

✅ Search Quality

Precision: High-quality results with scores 0.6-0.9 Recall: Finds synonyms and related concepts Speed: <100ms query latency

Example query results:

Query: "fever"
Results:
  1. Score: 0.862 - "...fever in the context of infection..."
  2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
  3. Score: 0.836 - "Body temperature regulation..."

Quick Start

Installation

pip install justembed

Start the Server

justembed begin --workspace ~/my_docs --port 5424

Open browser to http://localhost:5424

Train a Custom Model

  1. Click "🚀 Train Custom Model"
  2. Upload your domain-specific text file (.txt or .md)
  3. Enter model name (e.g., "medical_v1")
  4. Click "Train Model" (takes ~5 seconds)

Create a Knowledge Base

  1. Enter KB name (e.g., "medical_kb")
  2. Select model type: "Custom Model"
  3. Select your trained model
  4. Click "Create KB"

Upload Documents

  1. Choose your document file
  2. Select the KB
  3. Click "Upload & Preview Chunks"
  4. Review chunks and click "Apply Chunking"
  5. Wait for embedding to complete

Query

  1. Enter search query (e.g., "fever", "pyrexia")
  2. Select KB or "All KBs"
  3. Click "Search"
  4. View results with relevance scores

Use Cases

Medical Documentation

Train on medical texts to learn:

  • pyrexia ↔ fever
  • renal ↔ kidney
  • UTI ↔ urinary tract infection
  • hypertension ↔ high blood pressure

Legal Documents

Train on legal texts to learn:

  • plaintiff ↔ claimant
  • defendant ↔ respondent
  • tort ↔ civil wrong
  • litigation ↔ lawsuit

Technical Documentation

Train on technical texts to learn:

  • API ↔ application programming interface
  • REST ↔ representational state transfer
  • CRUD ↔ create read update delete
  • microservices ↔ service-oriented architecture

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Web UI / CLI                          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      FastAPI Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Training   │  │   Embedding  │  │    Query     │      │
│  │   Pipeline   │  │   Pipeline   │  │   Pipeline   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Embedder Layer                            │
│  ┌──────────────┐              ┌──────────────┐            │
│  │  E5-Small    │              │   Custom     │            │
│  │  (ONNX)      │              │   Models     │            │
│  │  384-dim     │              │   (ONNX)     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌──────────────┐              ┌──────────────┐            │
│  │   DuckDB     │              │   File       │            │
│  │   (KBs)      │              │   System     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

Custom Model Training

How It Works

  1. TF-IDF Vectorization: Extract features from your text
  2. MLP Training: Neural network learns to compress features
  3. ONNX Export: Portable model format for fast inference
  4. L2 Normalization: Consistent similarity scores

Training Pipeline

Text Corpus → Chunking → TF-IDF (5000 features)
                              ↓
                    MLP (512 → 256 → 128)
                              ↓
                    ONNX Export (~8 MB)
                              ↓
                    Custom Embedder

Model Configuration

  • Embedding Dimension: 64-256 (default: 128)
  • Max Features: 1000-10000 (default: 5000)
  • Hidden Layers: 512 → 256
  • Activation: ReLU
  • Optimizer: Adam

Performance

Training

  • Time: <5 seconds (1000-word corpus)
  • Hardware: CPU-only
  • Memory: <500 MB
  • Model Size: ~8 MB

Inference

  • Query Latency: <100ms
  • Embedding Speed: ~1000 docs/second
  • Memory: <200 MB per model

Quality

  • Precision: 0.6-0.9 similarity scores
  • Synonym Learning: 0.8+ for domain terms
  • Semantic Understanding: Related concepts found

Requirements

  • Python 3.8+
  • 500 MB disk space
  • 1 GB RAM
  • CPU (no GPU required)

Dependencies

Core:

  • FastAPI (web server)
  • ONNX Runtime (model inference)
  • DuckDB (storage)
  • scikit-learn (training)

Full list in pyproject.toml

CLI Commands

# Start server
justembed begin --workspace ~/docs --port 5424

# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000

# Show version
justembed --version

# Show help
justembed --help

Python API

from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer

# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
    corpus=["text1", "text2", "text3"],
    model_name="my_model",
    embedding_dim=128,
    max_features=5000,
)

# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")

# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])

Configuration

Models stored in: ~/.cache/justembed/

  • custom_models/ - Custom trained models
  • tokenizer.json - E5 tokenizer

Workspace structure:

workspace/
├── kb/
│   ├── kb1.duckdb
│   ├── kb2.duckdb
│   └── _history.duckdb

Roadmap

v0.1.x (Current)

  • ✅ E5-Small embeddings
  • ✅ Custom model training
  • ✅ Web UI
  • ✅ CLI
  • ✅ Knowledge bases
  • ✅ Semantic search

License

MIT License - see LICENSE file for details

Author

Krishnamoorthy Sankaran

Citation

If you use JustEmbed in your research, please cite:

@software{justembed2024,
  title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
  author = {Sankaran, Krishnamoorthy},
  year = {2024},
  url = {https://github.com/sekarkrishna/justembed}
}

Acknowledgments

  • E5-Small model by Microsoft
  • ONNX Runtime by Microsoft
  • FastAPI by Sebastián Ramírez
  • DuckDB by DuckDB Labs

Support

Changelog

v0.1.1a1 (2026-02-14)

New Features:

  • Custom model training from text files
  • Domain-specific synonym learning
  • Model selection in KB creation
  • Improved text chunking (sentence-based fallback)
  • Web UI for model training

Improvements:

  • Reduced minimum training corpus to 500 words
  • Better error messages
  • Model metadata display in UI
  • Query results show model used

Bug Fixes:

  • Fixed text chunking for continuous text
  • Fixed ONNX shape handling for custom models
  • Fixed model caching

v0.1.0 (2026-01-15)

  • Initial release
  • E5-Small embeddings
  • Basic web UI
  • Knowledge base management
  • Semantic search

JustEmbed - Semantic search that just works. Offline. On your laptop. In seconds.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justembed-0.1.1a3.tar.gz (22.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justembed-0.1.1a3-py3-none-any.whl (22.3 MB view details)

Uploaded Python 3

File details

Details for the file justembed-0.1.1a3.tar.gz.

File metadata

  • Download URL: justembed-0.1.1a3.tar.gz
  • Upload date:
  • Size: 22.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a3.tar.gz
Algorithm Hash digest
SHA256 cb74c644d65e5f485d5e228e1997cf2d5397e5ec1d7f2ec85bf7f61d02329b88
MD5 8faedec560b03b73ca04e8ac6d0904b0
BLAKE2b-256 5617a3cd1583991f84eb5fe3cf461615baaefbca8d7df3251e27bdc1f14efdbc

See more details on using hashes here.

File details

Details for the file justembed-0.1.1a3-py3-none-any.whl.

File metadata

  • Download URL: justembed-0.1.1a3-py3-none-any.whl
  • Upload date:
  • Size: 22.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a3-py3-none-any.whl
Algorithm Hash digest
SHA256 6801313ef5f531c7106eb615a546ead3014bea9a62d4e554d3fd21344f06207b
MD5 f3d6317eb60112e825f7c53b6eabbe54
BLAKE2b-256 f8d19e1bb12ac520cfa3990cf8428bb102c03100b9b64a23b1f96b5f8989032f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page