Skip to main content

LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.

Project description

JustEmbed - LENS

Language Embedder with No Synthesizer

Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.

PyPI version Python 3.8+ License: MIT

Philosophy

JustEmbed is built on three core principles:

  1. Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet required.
  2. Laptop-Friendly: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
  3. Domain-Specific: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).

Why JustEmbed?

Most embedding solutions require:

  • GPU hardware
  • Cloud API keys and costs
  • Hours of training time
  • Large model files (GB)
  • Internet connectivity

JustEmbed requires:

  • ✅ Any laptop with Python
  • ✅ No API keys or costs
  • ✅ Seconds of training time
  • ✅ Small models (8 MB)
  • ✅ Works completely offline

What's Working

✅ Core Features (v0.1.1a1)

  • E5-Small Embeddings: General-purpose 384-dim embeddings via ONNX
  • Custom Model Training: Train domain-specific models from your text
  • Knowledge Bases: Create multiple KBs with different models
  • Semantic Search: Query with natural language, get relevant results
  • Web UI: Browser-based interface for all operations
  • CLI: Command-line interface for automation
  • Offline Operation: No internet required after installation

✅ Custom Model Training

Train models that learn your domain's vocabulary:

# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)

# Legal domain example  
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)

Training Performance:

  • Time: <5 seconds for 1000-word corpus
  • Hardware: CPU-only (no GPU needed)
  • Model size: ~8 MB
  • Embedding dim: 64-256 (configurable)

✅ Search Quality

Precision: High-quality results with scores 0.6-0.9 Recall: Finds synonyms and related concepts Speed: <100ms query latency

Example query results:

Query: "fever"
Results:
  1. Score: 0.862 - "...fever in the context of infection..."
  2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
  3. Score: 0.836 - "Body temperature regulation..."

Quick Start

Installation

pip install justembed

Start the Server

justembed begin --workspace ~/my_docs --port 5424

Open browser to http://localhost:5424

Train a Custom Model

  1. Click "🚀 Train Custom Model"
  2. Upload your domain-specific text file (.txt or .md)
  3. Enter model name (e.g., "medical_v1")
  4. Click "Train Model" (takes ~5 seconds)

Create a Knowledge Base

  1. Enter KB name (e.g., "medical_kb")
  2. Select model type: "Custom Model"
  3. Select your trained model
  4. Click "Create KB"

Upload Documents

  1. Choose your document file
  2. Select the KB
  3. Click "Upload & Preview Chunks"
  4. Review chunks and click "Apply Chunking"
  5. Wait for embedding to complete

Query

  1. Enter search query (e.g., "fever", "pyrexia")
  2. Select KB or "All KBs"
  3. Click "Search"
  4. View results with relevance scores

Use Cases

Medical Documentation

Train on medical texts to learn:

  • pyrexia ↔ fever
  • renal ↔ kidney
  • UTI ↔ urinary tract infection
  • hypertension ↔ high blood pressure

Legal Documents

Train on legal texts to learn:

  • plaintiff ↔ claimant
  • defendant ↔ respondent
  • tort ↔ civil wrong
  • litigation ↔ lawsuit

Technical Documentation

Train on technical texts to learn:

  • API ↔ application programming interface
  • REST ↔ representational state transfer
  • CRUD ↔ create read update delete
  • microservices ↔ service-oriented architecture

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Web UI / CLI                          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      FastAPI Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Training   │  │   Embedding  │  │    Query     │      │
│  │   Pipeline   │  │   Pipeline   │  │   Pipeline   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Embedder Layer                            │
│  ┌──────────────┐              ┌──────────────┐            │
│  │  E5-Small    │              │   Custom     │            │
│  │  (ONNX)      │              │   Models     │            │
│  │  384-dim     │              │   (ONNX)     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌──────────────┐              ┌──────────────┐            │
│  │   DuckDB     │              │   File       │            │
│  │   (KBs)      │              │   System     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

Custom Model Training

How It Works

  1. TF-IDF Vectorization: Extract features from your text
  2. MLP Training: Neural network learns to compress features
  3. ONNX Export: Portable model format for fast inference
  4. L2 Normalization: Consistent similarity scores

Training Pipeline

Text Corpus → Chunking → TF-IDF (5000 features)
                              ↓
                    MLP (512 → 256 → 128)
                              ↓
                    ONNX Export (~8 MB)
                              ↓
                    Custom Embedder

Model Configuration

  • Embedding Dimension: 64-256 (default: 128)
  • Max Features: 1000-10000 (default: 5000)
  • Hidden Layers: 512 → 256
  • Activation: ReLU
  • Optimizer: Adam

Performance

Training

  • Time: <5 seconds (1000-word corpus)
  • Hardware: CPU-only
  • Memory: <500 MB
  • Model Size: ~8 MB

Inference

  • Query Latency: <100ms
  • Embedding Speed: ~1000 docs/second
  • Memory: <200 MB per model

Quality

  • Precision: 0.6-0.9 similarity scores
  • Synonym Learning: 0.8+ for domain terms
  • Semantic Understanding: Related concepts found

Requirements

  • Python 3.8+
  • 500 MB disk space
  • 1 GB RAM
  • CPU (no GPU required)

Dependencies

Core:

  • FastAPI (web server)
  • ONNX Runtime (model inference)
  • DuckDB (storage)
  • scikit-learn (training)

Full list in pyproject.toml

CLI Commands

# Start server
justembed begin --workspace ~/docs --port 5424

# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000

# Show version
justembed --version

# Show help
justembed --help

Python API

from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer

# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
    corpus=["text1", "text2", "text3"],
    model_name="my_model",
    embedding_dim=128,
    max_features=5000,
)

# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")

# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])

Configuration

Models stored in: ~/.cache/justembed/

  • custom_models/ - Custom trained models
  • tokenizer.json - E5 tokenizer

Workspace structure:

workspace/
├── kb/
│   ├── kb1.duckdb
│   ├── kb2.duckdb
│   └── _history.duckdb

Roadmap

v0.1.x (Current)

  • ✅ E5-Small embeddings
  • ✅ Custom model training
  • ✅ Web UI
  • ✅ CLI
  • ✅ Knowledge bases
  • ✅ Semantic search

License

MIT License - see LICENSE file for details

Author

Krishnamoorthy Sankaran

Citation

If you use JustEmbed in your research, please cite:

@software{justembed2024,
  title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
  author = {Sankaran, Krishnamoorthy},
  year = {2024},
  url = {https://github.com/sekarkrishna/justembed}
}

Acknowledgments

  • E5-Small model by Microsoft
  • ONNX Runtime by Microsoft
  • FastAPI by Sebastián Ramírez
  • DuckDB by DuckDB Labs

Support

Changelog

v0.1.1a1 (2026-02-14)

New Features:

  • Custom model training from text files
  • Domain-specific synonym learning
  • Model selection in KB creation
  • Improved text chunking (sentence-based fallback)
  • Web UI for model training

Improvements:

  • Reduced minimum training corpus to 500 words
  • Better error messages
  • Model metadata display in UI
  • Query results show model used

Bug Fixes:

  • Fixed text chunking for continuous text
  • Fixed ONNX shape handling for custom models
  • Fixed model caching

v0.1.0 (2026-01-15)

  • Initial release
  • E5-Small embeddings
  • Basic web UI
  • Knowledge base management
  • Semantic search

JustEmbed - Semantic search that just works. Offline. On your laptop. In seconds.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justembed-0.1.1a2.tar.gz (22.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justembed-0.1.1a2-py3-none-any.whl (22.3 MB view details)

Uploaded Python 3

File details

Details for the file justembed-0.1.1a2.tar.gz.

File metadata

  • Download URL: justembed-0.1.1a2.tar.gz
  • Upload date:
  • Size: 22.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a2.tar.gz
Algorithm Hash digest
SHA256 952db671b9504b6f81ab37f48c26775261806906347ff7978f168655f7a1887c
MD5 412b9b250eb54cb0b27034a6fc0aa06d
BLAKE2b-256 e44941673844bd8f7984c6ac791c4653ea2978859d7024be04546770daf2f37f

See more details on using hashes here.

File details

Details for the file justembed-0.1.1a2-py3-none-any.whl.

File metadata

  • Download URL: justembed-0.1.1a2-py3-none-any.whl
  • Upload date:
  • Size: 22.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 863ec453899924ed0b64621970e4224787a167025529081f74a550ff8539ddfc
MD5 d39f8c80dfed9c99b8f2ea856e6d7ba9
BLAKE2b-256 20cc2b177fd22147e5096d3e526ba10d638c45f9cbab2e061e3f5abf30fed081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page