Skip to main content

Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

Project description

embs

PyPI License Downloads

embs is your one-stop toolkit for document ingestion, embedding, and ranking workflows. Whether you are building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it fast and simple to integrate document retrieval, embedding, and ranking with minimal configuration.

Why Choose embs?

  • Free External APIs:

    • Docsifer for converting files/URLs (PDFs, HTML, images, etc.) to Markdown.
    • Lightweight Embeddings API for generating high-quality, multilingual embeddings.
  • Optimized for RAG & Chatbots:
    Automatically split documents into meaningful chunks, generate embeddings, and rank them by query relevance to empower your chatbot or generative model.

  • Flexible Splitting:
    Use the built-in Markdown splitter or provide a custom splitting function to best suit your documents.

  • Unified Pipeline:
    Seamlessly handle document ingestion, content extraction, embedding generation, and relevance ranking—all in one library.

  • DuckDuckGo-powered Web Search:
    The new search_documents function leverages DuckDuckGo to find relevant URLs by keyword, retrieves their content via Docsifer, and ranks the results.

  • Optional Embedding Results:
    Simply pass options={"embeddings": True} to receive the raw embedding vectors with your ranking results.

Installation

Install via pip:

pip install embs

Or add to your pyproject.toml (for Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Quick Start Examples

1. Query Documents (Ranking by Relevance)

This example shows how to retrieve documents (from a file, URL, or both), rank them by relevance to your query, and optionally include the embeddings.

import asyncio
from functools import partial
from embs import Embs

# Configure the built-in Markdown splitter.
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

# Asynchronously retrieve and rank documents.
async def run_query():
    docs = await client.query_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],
        splitter=md_splitter,
        options={"embeddings": True}  # Include embeddings in each result.
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['probability']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...")
        if "embeddings" in d:
            print("Embeddings:", d["embeddings"])
        print()

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explain quantum computing",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
    options={"embeddings": True}
)
for d in docs:
    print(d["filename"], "=> Score:", d["probability"])

2. Search Documents via DuckDuckGo

Use DuckDuckGo to search for relevant URLs by keyword, then retrieve, split, and rank their content.

import asyncio
from embs import Embs

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest advances in AI",
        limit=5,         # Maximum number of search results.
        blocklist=["youtube.com"],  # Optional: filter out certain domains.
        options={"embeddings": True}  # Include embeddings in the returned items.
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest advances in AI",
    limit=5,
    blocklist=["youtube.com"],
    options={"embeddings": True}
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['probability']:.4f}")

Caching for Performance

Enable caching to speed up repeated operations:

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # required only for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

Testing

The library is tested using pytest and pytest-asyncio. To run the tests:

pytest --asyncio-mode=auto

License

Licensed under the MIT License. See LICENSE for details.

Contributions are welcome! Please submit issues, ideas, or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.5.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embs-0.1.5-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file embs-0.1.5.tar.gz.

File metadata

  • Download URL: embs-0.1.5.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.5.tar.gz
Algorithm Hash digest
SHA256 29672561ed0d69d420dc4b9ceb4469abac85d91ae1858fc15a0f9f8cba9b51d1
MD5 f0880df10bfd982aa2c4f285695bd597
BLAKE2b-256 492f89b65febdacc3508d3c53b47c114146ecd6ab40023f0a95ff24bc35b3a13

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.5.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embs-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: embs-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 181f2b04241cdbe2096acaa79d14e490a15a28594fce4a8f6d561e383638c6f0
MD5 86847a69656b51c8db86565fbdb67b5f
BLAKE2b-256 ce15a6ae0dc918d61497ef70bce981dd5d5951ab0d2bf47c968323afedb92a23

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.5-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page