LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.

These details have not been verified by PyPI

Project links

Project description

JustEmbed - LENS

Language Embedder with No Synthesizer

Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.

Philosophy

JustEmbed is built on three core principles:

Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet required.
Laptop-Friendly: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
Domain-Specific: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).

Why JustEmbed?

Most embedding solutions require:

GPU hardware
Cloud API keys and costs
Hours of training time
Large model files (GB)
Internet connectivity

JustEmbed requires:

✅ Any laptop with Python
✅ No API keys or costs
✅ Seconds of training time
✅ Small models (8 MB)
✅ Works completely offline

What's Working

✅ Core Features (v0.1.1a1)

E5-Small Embeddings: General-purpose 384-dim embeddings via ONNX
Custom Model Training: Train domain-specific models from your text
Knowledge Bases: Create multiple KBs with different models
Semantic Search: Query with natural language, get relevant results
Web UI: Browser-based interface for all operations
CLI: Command-line interface for automation
Offline Operation: No internet required after installation

✅ Custom Model Training

Train models that learn your domain's vocabulary:

# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)

# Legal domain example  
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)

Training Performance:

Time: <5 seconds for 1000-word corpus
Hardware: CPU-only (no GPU needed)
Model size: ~8 MB
Embedding dim: 64-256 (configurable)

✅ Search Quality

Precision: High-quality results with scores 0.6-0.9 Recall: Finds synonyms and related concepts Speed: <100ms query latency

Example query results:

Query: "fever"
Results:
  1. Score: 0.862 - "...fever in the context of infection..."
  2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
  3. Score: 0.836 - "Body temperature regulation..."

Quick Start

Installation

pip install justembed

Start the Server

justembed begin --workspace ~/my_docs --port 5424

Open browser to http://localhost:5424

Train a Custom Model

Click "🚀 Train Custom Model"
Upload your domain-specific text file (.txt or .md)
Enter model name (e.g., "medical_v1")
Click "Train Model" (takes ~5 seconds)

Create a Knowledge Base

Enter KB name (e.g., "medical_kb")
Select model type: "Custom Model"
Select your trained model
Click "Create KB"

Upload Documents

Choose your document file
Select the KB
Click "Upload & Preview Chunks"
Review chunks and click "Apply Chunking"
Wait for embedding to complete

Query

Enter search query (e.g., "fever", "pyrexia")
Select KB or "All KBs"
Click "Search"
View results with relevance scores

Use Cases

Medical Documentation

Train on medical texts to learn:

pyrexia ↔ fever
renal ↔ kidney
UTI ↔ urinary tract infection
hypertension ↔ high blood pressure

Legal Documents

Train on legal texts to learn:

plaintiff ↔ claimant
defendant ↔ respondent
tort ↔ civil wrong
litigation ↔ lawsuit

Technical Documentation

Train on technical texts to learn:

API ↔ application programming interface
REST ↔ representational state transfer
CRUD ↔ create read update delete
microservices ↔ service-oriented architecture

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Web UI / CLI                          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      FastAPI Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Training   │  │   Embedding  │  │    Query     │      │
│  │   Pipeline   │  │   Pipeline   │  │   Pipeline   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Embedder Layer                            │
│  ┌──────────────┐              ┌──────────────┐            │
│  │  E5-Small    │              │   Custom     │            │
│  │  (ONNX)      │              │   Models     │            │
│  │  384-dim     │              │   (ONNX)     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌──────────────┐              ┌──────────────┐            │
│  │   DuckDB     │              │   File       │            │
│  │   (KBs)      │              │   System     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

Custom Model Training

How It Works

TF-IDF Vectorization: Extract features from your text
MLP Training: Neural network learns to compress features
ONNX Export: Portable model format for fast inference
L2 Normalization: Consistent similarity scores

Training Pipeline

Text Corpus → Chunking → TF-IDF (5000 features)
                              ↓
                    MLP (512 → 256 → 128)
                              ↓
                    ONNX Export (~8 MB)
                              ↓
                    Custom Embedder

Model Configuration

Embedding Dimension: 64-256 (default: 128)
Max Features: 1000-10000 (default: 5000)
Hidden Layers: 512 → 256
Activation: ReLU
Optimizer: Adam

Performance

Training

Time: <5 seconds (1000-word corpus)
Hardware: CPU-only
Memory: <500 MB
Model Size: ~8 MB

Inference

Query Latency: <100ms
Embedding Speed: ~1000 docs/second
Memory: <200 MB per model

Quality

Precision: 0.6-0.9 similarity scores
Synonym Learning: 0.8+ for domain terms
Semantic Understanding: Related concepts found

Requirements

Python 3.8+
500 MB disk space
1 GB RAM
CPU (no GPU required)

Dependencies

Core:

FastAPI (web server)
ONNX Runtime (model inference)
DuckDB (storage)
scikit-learn (training)

Full list in pyproject.toml

CLI Commands

# Start server
justembed begin --workspace ~/docs --port 5424

# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000

# Show version
justembed --version

# Show help
justembed --help

Python API

from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer

# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
    corpus=["text1", "text2", "text3"],
    model_name="my_model",
    embedding_dim=128,
    max_features=5000,
)

# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")

# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])

Configuration

Models stored in: ~/.cache/justembed/

custom_models/ - Custom trained models
tokenizer.json - E5 tokenizer

Workspace structure:

workspace/
├── kb/
│   ├── kb1.duckdb
│   ├── kb2.duckdb
│   └── _history.duckdb

Roadmap

v0.1.x (Current)

✅ E5-Small embeddings
✅ Custom model training
✅ Web UI
✅ CLI
✅ Knowledge bases
✅ Semantic search

License

MIT License - see LICENSE file for details

Author

Krishnamoorthy Sankaran

Email: krishnamoorthy.sankaran@sekrad.org
GitHub: https://github.com/sekarkrishna/justembed

Citation

If you use JustEmbed in your research, please cite:

@software{justembed2024,
  title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
  author = {Sankaran, Krishnamoorthy},
  year = {2024},
  url = {https://github.com/sekarkrishna/justembed}
}

Acknowledgments

E5-Small model by Microsoft
ONNX Runtime by Microsoft
FastAPI by Sebastián Ramírez
DuckDB by DuckDB Labs

Support

Issues: https://github.com/sekarkrishna/justembed/issues
Discussions: https://github.com/sekarkrishna/justembed/discussions
Email: krishnamoorthy.sankaran@sekrad.org

Changelog

v0.1.1a1 (2026-02-14)

New Features:

Custom model training from text files
Domain-specific synonym learning
Model selection in KB creation
Improved text chunking (sentence-based fallback)
Web UI for model training

Improvements:

Reduced minimum training corpus to 500 words
Better error messages
Model metadata display in UI
Query results show model used

Bug Fixes:

Fixed text chunking for continuous text
Fixed ONNX shape handling for custom models
Fixed model caching

v0.1.0 (2026-01-15)

Initial release
E5-Small embeddings
Basic web UI
Knowledge base management
Semantic search

JustEmbed - Semantic search that just works. Offline. On your laptop. In seconds.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1a9 pre-release

Feb 23, 2026

0.1.1a8 pre-release

Feb 23, 2026

0.1.1a7 pre-release

Feb 16, 2026

0.1.1a6 pre-release

Feb 16, 2026

0.1.1a5 pre-release

Feb 15, 2026

0.1.1a4 pre-release

Feb 15, 2026

0.1.1a3 pre-release

Feb 15, 2026

0.1.1a2 pre-release

Feb 15, 2026

This version

0.1.1a1 pre-release

Feb 14, 2026

0.1.0a6 pre-release

Jan 30, 2026

0.1.0a5 pre-release

Jan 29, 2026

0.1.0a3 pre-release

Jan 28, 2026

0.1.0a2 pre-release

Jan 28, 2026

0.1.0a1 pre-release

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justembed-0.1.1a1.tar.gz (22.2 MB view details)

Uploaded Feb 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

justembed-0.1.1a1-py3-none-any.whl (22.2 MB view details)

Uploaded Feb 14, 2026 Python 3

File details

Details for the file justembed-0.1.1a1.tar.gz.

File metadata

Download URL: justembed-0.1.1a1.tar.gz
Upload date: Feb 14, 2026
Size: 22.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a1.tar.gz
Algorithm	Hash digest
SHA256	`6778dd7f911ff385198d648b58468842537be0ba1414e754e7a13a75608edfec`
MD5	`828359ccb527df605784cad454668c61`
BLAKE2b-256	`f1967bdaadd556e26c09f8f36d28434b2e2a00116e94ddfb7a39757b039faed1`

See more details on using hashes here.

File details

Details for the file justembed-0.1.1a1-py3-none-any.whl.

File metadata

Download URL: justembed-0.1.1a1-py3-none-any.whl
Upload date: Feb 14, 2026
Size: 22.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60a976e874cf861341fe476a676ba7407c9d111d3c28bd5aa961b9939dbcb925`
MD5	`db91d459a98fb337c4a2a09cbde30d02`
BLAKE2b-256	`feb747d37edae9de10eaedf4f362ad9f3933bec4d29bd5100b8293b9c073545a`

See more details on using hashes here.

justembed 0.1.1a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JustEmbed - LENS

Philosophy

Why JustEmbed?

What's Working

✅ Core Features (v0.1.1a1)

✅ Custom Model Training

✅ Search Quality

Quick Start

Installation

Start the Server

Train a Custom Model

Create a Knowledge Base

Upload Documents

Query

Use Cases

Medical Documentation

Legal Documents

Technical Documentation

Architecture

Custom Model Training

How It Works

Training Pipeline

Model Configuration

Performance

Training

Inference

Quality

Requirements

Dependencies

CLI Commands

Python API

Configuration

Roadmap

v0.1.x (Current)

License

Author

Citation

Acknowledgments

Support

Changelog

v0.1.1a1 (2026-02-14)

v0.1.0 (2026-01-15)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes