Skip to main content

Your First Step Into Semantic Search. Experience embeddings hands-on with no cloud accounts required.

Project description

JustEmbed

Your First Step Into Semantic Search

Experience embeddings hands-on. No cloud accounts, no setup complexity, no commitment. Just your laptop and your curiosity.

PyPI version Python 3.8+ License: MIT

Author: Krishnamoorthy Sankaran
Email: krishnamoorthy.sankaran@sekrad.org
GitHub: https://github.com/sekarkrishna/justembed
PyPI: https://pypi.org/project/justembed/


What is JustEmbed?

JustEmbed is a focused tool for semantic search - understanding meaning, not just matching keywords. It's designed as your entry point into the embedding ecosystem, letting you experience how semantic search works before committing to cloud platforms or production tools.

For Non-Technical Users

Upload your documents through a web interface and search by meaning. No coding required, no technical knowledge needed. See exactly how your text is processed and understand what's happening at each step.

For Developers

A simple Python API (import justembed as je) that lets you experiment with embeddings locally. Build confidence with semantic search concepts before moving to production vector databases.


Quick Start

Installation

pip install justembed

Web Interface

justembed begin --workspace ~/my_documents

Open http://localhost:5424 in your browser.

Python API

import justembed as je

je.begin(workspace="~/docs")
je.create_kb("my_kb")
je.add(kb="my_kb", path="document.txt")

# Hybrid search (default in v0.1.1a8+)
results = je.query("search term", kb="my_kb")

# Adjust search mode
results = je.query("search term", kb="my_kb", alpha=0.3)

CLI

# Create KB
justembed create-kb my_kb

# Add documents
justembed add my_kb document.txt

# Hybrid search (v0.1.1a8+)
justembed search my_kb "search term"

# Adjust alpha parameter
justembed search my_kb "search term" --alpha 0.3

# Pure keyword search
justembed search my_kb "search term" --mode bm25

# Pure semantic search
justembed search my_kb "search term" --mode semantic

# With match explanation
justembed search my_kb "search term" --explain

# Inspect vocabulary (custom models)
justembed inspect-vocab my_custom_model --dimension 0 --top-k 10

# Evaluate model
justembed evaluate my_kb eval_data.json

# Compare models
justembed evaluate my_kb eval_data.json --compare baseline_kb

Understanding Semantic Search

Traditional keyword search looks for exact word matches. Semantic search understands meaning.

Example: Imagine a document with these paragraphs:

  1. "Volcanoes erupt with molten lava at temperatures exceeding 1000°C..."
  2. "Industrial smelting uses high-temperature furnaces above 800°C..."
  3. "Igloos are dome-shaped shelters built from compressed snow..."
  4. "Icebergs float in cold ocean waters at sub-zero temperatures..."

Search for "hot":

  • Traditional search: No results (word "hot" doesn't appear)
  • Semantic search: Returns paragraphs 1 & 2 (understands heat/temperature relationship)

This is what JustEmbed lets you experience.


Core Concepts

1. Chunking

Documents are broken into smaller pieces (chunks) for efficient searching. JustEmbed's UI shows you exactly how your text will be chunked before processing.

2. Embedding

Each chunk is converted to a list of numbers (an embedding) that represents its meaning. Similar meanings have similar numbers.

3. Searching

When you search, your query is converted to an embedding and compared to all chunk embeddings. Results are ranked by similarity (0.0-1.0 score).

4. Hybrid Search (v0.1.1a8+)

Combines two search methods for better results:

  • Semantic search: Understands meaning (finds "hot" when you search for "temperature")
  • BM25 keyword search: Finds exact word matches (finds "Python" when you search for "Python")

The alpha parameter (0-1) controls the balance:

  • alpha=0.0: Pure keyword search (BM25 only)
  • alpha=0.5: Balanced (default)
  • alpha=1.0: Pure semantic search

Complete API Reference

Workspace Management

# Start workspace
je.begin(workspace="~/my_docs", port=5424)

# Register existing workspace
je.register_workspace("~/shared_workspace")

# List workspaces
workspaces = je.list_workspaces()

# Deregister (data stays on disk)
je.deregister_workspace("~/old_workspace", confirm=True)

# Stop server
je.terminate()

Knowledge Bases

# Create with default model
je.create_kb("general_kb")

# Create with custom model
je.create_kb("medical_kb", model="medical_v1")

# List all KBs
kbs = je.list_kbs()

# Delete KB
je.delete_kb("old_kb", confirm=True)

Adding Documents

# From file
je.add(kb="my_kb", path="document.txt")

# From text
je.add(kb="my_kb", text="Your content...")

# With chunking options
je.add(
    kb="my_kb",
    path="document.txt",
    max_tokens=300,
    merge_threshold=50,
)

Searching

# Basic search (uses hybrid search by default in v0.1.1a8+)
results = je.query("search term", kb="my_kb")

# Search all KBs
results = je.query("search term", kb="all")

# Advanced options
results = je.query(
    text="search term",
    kb="my_kb",
    top_k=10,
    min_score=0.5
)

# Hybrid search with alpha parameter (v0.1.1a8+)
results = je.query(
    text="search term",
    kb="my_kb",
    alpha=0.5,  # 0.0=keyword only, 1.0=semantic only
    mode="hybrid"  # or "semantic", "bm25"
)

# Results structure
for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Text: {result['text']}")
    print(f"File: {result['file']}")
    print(f"KB: {result['kb']}")
    
    # Hybrid search provides score breakdown (v0.1.1a8+)
    if 'bm25_score' in result:
        print(f"BM25: {result['bm25_score']:.3f}")
        print(f"Semantic: {result['semantic_score']:.3f}")
        print(f"Matching terms: {result['matching_terms']}")

Match Explanation (v0.1.1a8+)

Understand why results matched your query:

# Get detailed explanation for a result
from justembed.interpretability import MatchExplainer

explainer = MatchExplainer(hybrid_engine, model, model_type="e5")
explanation = explainer.explain_result(query, result)

# Explanation includes:
# - Score breakdown (BM25 vs semantic contribution)
# - Matching keywords
# - Semantic similarity analysis
# - Plain language summary for domain experts

Vocabulary Inspection (Custom Models, v0.1.1a8+)

Inspect what your custom model learned:

from justembed.interpretability import VocabularyInspector

inspector = VocabularyInspector(custom_model)

# Get top features for a dimension
features = inspector.get_top_features_for_dimension(dim=0, top_k=10)
for feature, weight in features:
    print(f"{feature}: {weight:.4f}")

# Find dimensions influenced by a term
dims = inspector.get_dimensions_for_feature("medical")

Custom Model Training

# Train from file
je.train_model(
    name="medical_v1",
    training_data="medical_textbook.txt",
    embedding_dim=128,
    max_features=5000
)

# Train from text
je.train_model(
    name="legal_v1",
    training_data=["Your training corpus..."],
    embedding_dim=128
)

# List models
models = je.list_models()

Key Features

Hybrid Search (v0.1.1a8+)

Combines keyword and semantic search for better results:

# Create KB (hybrid search enabled automatically)
je.create_kb("my_kb")
je.add(kb="my_kb", path="documents.txt")

# Search with default balanced mode (alpha=0.5)
results = je.query("Python programming", kb="my_kb")

# Adjust alpha for your use case
results = je.query(
    "Python programming",
    kb="my_kb",
    alpha=0.3  # More weight on keywords
)

# Pure keyword search (exact matches)
results = je.query("Python", kb="my_kb", alpha=0.0)

# Pure semantic search (meaning-based)
results = je.query("programming language", kb="my_kb", alpha=1.0)

When to adjust alpha:

  • Technical docs (alpha=0.2-0.4): Favor exact term matches (API names, error codes)
  • General content (alpha=0.5): Balanced approach (default)
  • Conceptual search (alpha=0.6-0.8): Favor meaning over exact words

Model Comparison (v0.1.1a8+)

Compare custom models against the E5 baseline:

# Train custom model
je.train_model("domain_v1", training_data="domain_docs.txt")

# Create KBs with different models
je.create_kb("custom_kb", model="domain_v1")
je.create_kb("baseline_kb", model="e5")

# Add same documents to both
je.add(kb="custom_kb", path="test_docs.txt")
je.add(kb="baseline_kb", path="test_docs.txt")

# Compare via web UI (tabbed interface)
# Or via evaluation API
from justembed.evaluation import EvaluationEngine

eval_data = {
    "queries": ["query1", "query2"],
    "relevance": {
        "query1": ["chunk_id_1", "chunk_id_2"],
        "query2": ["chunk_id_3"]
    }
}

comparison = evaluate_models(
    model_a="custom_kb",
    model_b="baseline_kb",
    eval_data=eval_data
)

print(f"Custom model MAP: {comparison['model_a']['MAP']:.3f}")
print(f"Baseline MAP: {comparison['model_b']['MAP']:.3f}")
print(f"Improvement: {comparison['delta']['MAP']:.3f}")

Evaluation Metrics (v0.1.1a8+)

Measure search quality with standard IR metrics:

from justembed.evaluation import EvaluationEngine

# Prepare evaluation data
queries = ["Python programming", "machine learning"]
relevance_judgments = {
    "Python programming": ["chunk_1", "chunk_2"],
    "machine learning": ["chunk_3"]
}

# Evaluate
engine = HybridSearchEngine(kb_name, workspace, embedder)
eval_engine = EvaluationEngine(engine)

results = eval_engine.evaluate(
    queries,
    relevance_judgments,
    k_values=[1, 3, 5, 10]
)

# Metrics provided:
# - Precision@k: Fraction of top-k results that are relevant
# - Recall@k: Fraction of relevant docs in top-k
# - MAP: Mean Average Precision across queries

print(f"Precision@3: {results['precision_at_k'][3]:.3f}")
print(f"Recall@3: {results['recall_at_k'][3]:.3f}")
print(f"MAP: {results['mean_average_precision']:.3f}")

Embedding Visualization (v0.1.1a8+)

Visualize search results in 2D space:

from justembed.visualization import EmbeddingVisualizer

visualizer = EmbeddingVisualizer()

# Get search results
results = je.query("search term", kb="my_kb", top_k=20)

# Extract embeddings
query_embedding = embedder.embed_query("search term")
result_embeddings = [r['embedding'] for r in results]
result_texts = [r['text'] for r in results]
scores = [r['score'] for r in results]

# Create visualization
html = visualizer.visualize_query_results(
    query_embedding,
    result_embeddings,
    result_texts,
    scores
)

# Display in web UI or save to file
with open("visualization.html", "w") as f:
    f.write(html)

Domain-Specific Models

Train models that understand your domain's vocabulary:

# Medical domain
medical_text = """
Pyrexia, commonly known as fever, is elevated body temperature.
Renal function refers to kidney performance.
A UTI affects the bladder and kidneys.
"""

je.train_model("medical_v1", training_data=[medical_text])
je.create_kb("medical_kb", model="medical_v1")

# Now "fever" finds "pyrexia", "kidney" finds "renal"

Multiple Knowledge Bases

Organize by topic, each with its own model:

je.create_kb("medical_kb", model="medical_v1")
je.create_kb("legal_kb", model="legal_v1")
je.create_kb("general_kb")  # Uses default E5-Small model

Workspace Sharing

Share by zipping the workspace folder:

# Create and populate
je.begin(workspace="~/shared_kb")
je.create_kb("team_kb")
je.add(kb="team_kb", path="docs.txt")

# Zip ~/shared_kb and share

# Recipient registers and uses
je.register_workspace("~/received_kb")
je.begin(workspace="~/received_kb")
results = je.query("search", kb="team_kb")

Architecture

User Interface (Web UI / Python API / CLI)
           ↓
    FastAPI Server
           ↓
Hybrid Search Engine (v0.1.1a8+)
    ├─ BM25 Engine (SQLite FTS5)
    └─ Semantic Engine (Embeddings)
           ↓
Embedder Layer (E5-Small / Custom Models)
           ↓
Storage Layer (SQLite / File System)

Design Decisions

Offline-First: Everything runs locally. No API keys, no cloud dependencies, no internet after installation.

Hybrid Search (v0.1.1a8+): Combines keyword (BM25) and semantic search for better results. Automatically indexes documents for both search methods during KB build.

ONNX Models: Portable, CPU-friendly, small size (~8-15 MB). Works on any platform.

SQLite Storage: Embedded database with FTS5 for full-text search. No separate server. Fast and reliable.

Deterministic Chunking: Rule-based, predictable. Same input always produces same chunks.

Privacy: Your data never leaves your machine. No telemetry, no tracking.


Requirements

  • Python 3.8+
  • 500 MB disk space
  • 1 GB RAM
  • CPU (no GPU required)
  • No internet (after installation)

Guarantees

Technical:

  • Deterministic (same input → same output)
  • No hallucinations (only returns your text)
  • Offline (works without internet)
  • Private (data never leaves your machine)
  • No tracking or telemetry

File System:

  • Writes only to workspace and ~/.cache/justembed/
  • Reads only files you upload
  • Never deletes files outside workspace

License

MIT License


Author

Krishnamoorthy Sankaran


Support


Citation

@software{justembed2026,
  title = {JustEmbed: Your First Step Into Semantic Search},
  author = {Sankaran, Krishnamoorthy},
  year = {2026},
  url = {https://github.com/sekarkrishna/justembed}
}

Acknowledgments

  • E5-Small model: Microsoft Research
  • ONNX Runtime: Microsoft
  • FastAPI: Sebastián Ramírez
  • DuckDB: DuckDB Labs
  • scikit-learn: scikit-learn developers

JustEmbed - Start here. Build confidence. Graduate to production tools when ready.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justembed-0.1.1a8.tar.gz (160.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justembed-0.1.1a8-py3-none-any.whl (144.9 kB view details)

Uploaded Python 3

File details

Details for the file justembed-0.1.1a8.tar.gz.

File metadata

  • Download URL: justembed-0.1.1a8.tar.gz
  • Upload date:
  • Size: 160.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a8.tar.gz
Algorithm Hash digest
SHA256 4a41186e03bdbdbdab9db483039be0faf13e085c0fbd2eb31cde14182fb5dac4
MD5 f01c2792ef4d510a5f77868c4f3e5fc1
BLAKE2b-256 b66b05b7394c077c292464a27bdbe07c9c4d7252239ff33fa61f64b3bfb25d67

See more details on using hashes here.

File details

Details for the file justembed-0.1.1a8-py3-none-any.whl.

File metadata

  • Download URL: justembed-0.1.1a8-py3-none-any.whl
  • Upload date:
  • Size: 144.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for justembed-0.1.1a8-py3-none-any.whl
Algorithm Hash digest
SHA256 f96347c6fb1663e792029064dcef8182ec7f07e20960adbd8bc082b7ee61a322
MD5 9ec20467f633c4713b20d6e6add72c25
BLAKE2b-256 7a7bc507ad630ac0d6752ca56dcb527cdfc6e3da6784660c981f2d9127cc89d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page