Skip to main content

Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

Project description

embs

PyPI License Downloads

embs is your one-stop toolkit for document ingestion, embedding, and ranking workflows. Whether you are building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it fast and simple to integrate document retrieval, embedding, and ranking with minimal configuration.

Why Choose embs?

  • Free External APIs:

    • Docsifer for converting files/URLs (PDFs, HTML, images, etc.) to Markdown.
    • Lightweight Embeddings API for generating high-quality, multilingual embeddings.
  • Optimized for RAG & Chatbots:
    Automatically split documents into meaningful chunks, generate embeddings, and rank them by query relevance to empower your chatbot or generative model.

  • Flexible Splitting:
    Use the built-in Markdown splitter or provide a custom splitting function to best suit your documents.

  • Unified Pipeline:
    Seamlessly handle document ingestion, content extraction, embedding generation, and relevance ranking—all in one library.

  • DuckDuckGo-powered Web Search:
    The new search_documents function leverages DuckDuckGo to find relevant URLs by keyword, retrieves their content via Docsifer, and ranks the results.

  • Optional Embedding Results:
    Simply pass options={"embeddings": True} to receive the raw embedding vectors with your ranking results.

Installation

Install via pip:

pip install embs

Or add to your pyproject.toml (for Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Quick Start Examples

1. Query Documents (Ranking by Relevance)

This example shows how to retrieve documents (from a file, URL, or both), rank them by relevance to your query, and optionally include the embeddings.

import asyncio
from functools import partial
from embs import Embs

# Configure the built-in Markdown splitter.
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

# Asynchronously retrieve and rank documents.
async def run_query():
    docs = await client.query_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],
        splitter=md_splitter,
        options={"embeddings": True}  # Include embeddings in each result.
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['probability']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...")
        if "embeddings" in d:
            print("Embeddings:", d["embeddings"])
        print()

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explain quantum computing",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
    options={"embeddings": True}
)
for d in docs:
    print(d["filename"], "=> Score:", d["probability"])

2. Search Documents via DuckDuckGo

Use DuckDuckGo to search for relevant URLs by keyword, then retrieve, split, and rank their content.

import asyncio
from embs import Embs

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest advances in AI",
        limit=5,         # Maximum number of search results.
        blocklist=["youtube.com"],  # Optional: filter out certain domains.
        options={"embeddings": True}  # Include embeddings in the returned items.
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest advances in AI",
    limit=5,
    blocklist=["youtube.com"],
    options={"embeddings": True}
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['probability']:.4f}")

Caching for Performance

Enable caching to speed up repeated operations:

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # required only for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

Testing

The library is tested using pytest and pytest-asyncio. To run the tests:

pytest --asyncio-mode=auto

License

Licensed under the MIT License. See LICENSE for details.

Contributions are welcome! Please submit issues, ideas, or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.6.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embs-0.1.6-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file embs-0.1.6.tar.gz.

File metadata

  • Download URL: embs-0.1.6.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.6.tar.gz
Algorithm Hash digest
SHA256 68c31dea48aab4857b759565f21c4f6391918824e6a18ae31d136d8115d8eb44
MD5 e919cbb44a66a396094b62de9afa2198
BLAKE2b-256 552ef9d4f970eb1c8c91f57b4e0e83e19df8be8018f1f5199ba61fed2c186ad3

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.6.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embs-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: embs-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 80fd603edbe103f28ef9bbd59fa314bd078476cabbd6c219accca40cbd03096b
MD5 6ef5542a014bdf7ad703175bf9773a7d
BLAKE2b-256 58668652a7a1ac1ea4547a9f4ca53dd233d2bc28be9454e3d3960f7488d31f88

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.6-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page