LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.
Project description
JustEmbed - LENS
Language Embedder with No Synthesizer
Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.
Philosophy
JustEmbed is built on three core principles:
- Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet required.
- Laptop-Friendly: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
- Domain-Specific: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).
Why JustEmbed?
Most embedding solutions require:
- GPU hardware
- Cloud API keys and costs
- Hours of training time
- Large model files (GB)
- Internet connectivity
JustEmbed requires:
- ✅ Any laptop with Python
- ✅ No API keys or costs
- ✅ Seconds of training time
- ✅ Small models (8 MB)
- ✅ Works completely offline
What's Working
✅ Core Features (v0.1.1a1)
- E5-Small Embeddings: General-purpose 384-dim embeddings via ONNX
- Custom Model Training: Train domain-specific models from your text
- Knowledge Bases: Create multiple KBs with different models
- Semantic Search: Query with natural language, get relevant results
- Web UI: Browser-based interface for all operations
- CLI: Command-line interface for automation
- Offline Operation: No internet required after installation
✅ Custom Model Training
Train models that learn your domain's vocabulary:
# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)
# Legal domain example
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)
Training Performance:
- Time: <5 seconds for 1000-word corpus
- Hardware: CPU-only (no GPU needed)
- Model size: ~8 MB
- Embedding dim: 64-256 (configurable)
✅ Search Quality
Precision: High-quality results with scores 0.6-0.9 Recall: Finds synonyms and related concepts Speed: <100ms query latency
Example query results:
Query: "fever"
Results:
1. Score: 0.862 - "...fever in the context of infection..."
2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
3. Score: 0.836 - "Body temperature regulation..."
Quick Start
Installation
pip install justembed
Start the Server
justembed begin --workspace ~/my_docs --port 5424
Open browser to http://localhost:5424
Train a Custom Model
- Click "🚀 Train Custom Model"
- Upload your domain-specific text file (.txt or .md)
- Enter model name (e.g., "medical_v1")
- Click "Train Model" (takes ~5 seconds)
Create a Knowledge Base
- Enter KB name (e.g., "medical_kb")
- Select model type: "Custom Model"
- Select your trained model
- Click "Create KB"
Upload Documents
- Choose your document file
- Select the KB
- Click "Upload & Preview Chunks"
- Review chunks and click "Apply Chunking"
- Wait for embedding to complete
Query
- Enter search query (e.g., "fever", "pyrexia")
- Select KB or "All KBs"
- Click "Search"
- View results with relevance scores
Use Cases
Medical Documentation
Train on medical texts to learn:
- pyrexia ↔ fever
- renal ↔ kidney
- UTI ↔ urinary tract infection
- hypertension ↔ high blood pressure
Legal Documents
Train on legal texts to learn:
- plaintiff ↔ claimant
- defendant ↔ respondent
- tort ↔ civil wrong
- litigation ↔ lawsuit
Technical Documentation
Train on technical texts to learn:
- API ↔ application programming interface
- REST ↔ representational state transfer
- CRUD ↔ create read update delete
- microservices ↔ service-oriented architecture
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Web UI / CLI │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ FastAPI Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Training │ │ Embedding │ │ Query │ │
│ │ Pipeline │ │ Pipeline │ │ Pipeline │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Embedder Layer │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ E5-Small │ │ Custom │ │
│ │ (ONNX) │ │ Models │ │
│ │ 384-dim │ │ (ONNX) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ DuckDB │ │ File │ │
│ │ (KBs) │ │ System │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Custom Model Training
How It Works
- TF-IDF Vectorization: Extract features from your text
- MLP Training: Neural network learns to compress features
- ONNX Export: Portable model format for fast inference
- L2 Normalization: Consistent similarity scores
Training Pipeline
Text Corpus → Chunking → TF-IDF (5000 features)
↓
MLP (512 → 256 → 128)
↓
ONNX Export (~8 MB)
↓
Custom Embedder
Model Configuration
- Embedding Dimension: 64-256 (default: 128)
- Max Features: 1000-10000 (default: 5000)
- Hidden Layers: 512 → 256
- Activation: ReLU
- Optimizer: Adam
Performance
Training
- Time: <5 seconds (1000-word corpus)
- Hardware: CPU-only
- Memory: <500 MB
- Model Size: ~8 MB
Inference
- Query Latency: <100ms
- Embedding Speed: ~1000 docs/second
- Memory: <200 MB per model
Quality
- Precision: 0.6-0.9 similarity scores
- Synonym Learning: 0.8+ for domain terms
- Semantic Understanding: Related concepts found
Requirements
- Python 3.8+
- 500 MB disk space
- 1 GB RAM
- CPU (no GPU required)
Dependencies
Core:
- FastAPI (web server)
- ONNX Runtime (model inference)
- DuckDB (storage)
- scikit-learn (training)
Full list in pyproject.toml
CLI Commands
# Start server
justembed begin --workspace ~/docs --port 5424
# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000
# Show version
justembed --version
# Show help
justembed --help
Python API
from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer
# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
corpus=["text1", "text2", "text3"],
model_name="my_model",
embedding_dim=128,
max_features=5000,
)
# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")
# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])
Configuration
Models stored in: ~/.cache/justembed/
custom_models/- Custom trained modelstokenizer.json- E5 tokenizer
Workspace structure:
workspace/
├── kb/
│ ├── kb1.duckdb
│ ├── kb2.duckdb
│ └── _history.duckdb
Roadmap
v0.1.x (Current)
- ✅ E5-Small embeddings
- ✅ Custom model training
- ✅ Web UI
- ✅ CLI
- ✅ Knowledge bases
- ✅ Semantic search
License
MIT License - see LICENSE file for details
Author
Krishnamoorthy Sankaran
Citation
If you use JustEmbed in your research, please cite:
@software{justembed2024,
title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
author = {Sankaran, Krishnamoorthy},
year = {2024},
url = {https://github.com/sekarkrishna/justembed}
}
Acknowledgments
- E5-Small model by Microsoft
- ONNX Runtime by Microsoft
- FastAPI by Sebastián Ramírez
- DuckDB by DuckDB Labs
Support
- Issues: https://github.com/sekarkrishna/justembed/issues
- Discussions: https://github.com/sekarkrishna/justembed/discussions
- Email: krishnamoorthy.sankaran@sekrad.org
Changelog
v0.1.1a1 (2026-02-14)
New Features:
- Custom model training from text files
- Domain-specific synonym learning
- Model selection in KB creation
- Improved text chunking (sentence-based fallback)
- Web UI for model training
Improvements:
- Reduced minimum training corpus to 500 words
- Better error messages
- Model metadata display in UI
- Query results show model used
Bug Fixes:
- Fixed text chunking for continuous text
- Fixed ONNX shape handling for custom models
- Fixed model caching
v0.1.0 (2026-01-15)
- Initial release
- E5-Small embeddings
- Basic web UI
- Knowledge base management
- Semantic search
JustEmbed - Semantic search that just works. Offline. On your laptop. In seconds.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file justembed-0.1.1a1.tar.gz.
File metadata
- Download URL: justembed-0.1.1a1.tar.gz
- Upload date:
- Size: 22.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6778dd7f911ff385198d648b58468842537be0ba1414e754e7a13a75608edfec
|
|
| MD5 |
828359ccb527df605784cad454668c61
|
|
| BLAKE2b-256 |
f1967bdaadd556e26c09f8f36d28434b2e2a00116e94ddfb7a39757b039faed1
|
File details
Details for the file justembed-0.1.1a1-py3-none-any.whl.
File metadata
- Download URL: justembed-0.1.1a1-py3-none-any.whl
- Upload date:
- Size: 22.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60a976e874cf861341fe476a676ba7407c9d111d3c28bd5aa961b9939dbcb925
|
|
| MD5 |
db91d459a98fb337c4a2a09cbde30d02
|
|
| BLAKE2b-256 |
feb747d37edae9de10eaedf4f362ad9f3933bec4d29bd5100b8293b9c073545a
|