Skip to main content

Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

Project description

embs

PyPI License Downloads

embs is your one-stop toolkit for document ingestion, embedding, and ranking workflows. Whether you are building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it fast and simple to integrate document retrieval, embedding, and ranking with minimal configuration.

Why Choose embs?

  • Free External APIs:

    • Docsifer for converting files/URLs (PDFs, HTML, images, etc.) to Markdown.
    • Lightweight Embeddings API for generating high-quality, multilingual embeddings.
  • Optimized for RAG & Chatbots:
    Automatically split documents into meaningful chunks, generate embeddings, and rank them by query relevance to empower your chatbot or generative model.

  • Flexible Splitting:
    Use the built-in Markdown splitter or provide a custom splitting function to best suit your documents.

  • Unified Pipeline:
    Seamlessly handle document ingestion, content extraction, embedding generation, and relevance ranking—all in one library.

  • DuckDuckGo-powered Web Search:
    The new search_documents function leverages DuckDuckGo to find relevant URLs by keyword, retrieves their content via Docsifer, and ranks the results.

  • Optional Embedding Results:
    Simply pass options={"embeddings": True} to receive the raw embedding vectors with your ranking results.

Installation

Install via pip:

pip install embs

Or add to your pyproject.toml (for Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Quick Start Examples

1. Query Documents (Ranking by Relevance)

This example shows how to retrieve documents (from a file, URL, or both), rank them by relevance to your query, and optionally include the embeddings.

import asyncio
from functools import partial
from embs import Embs

# Configure the built-in Markdown splitter.
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

# Asynchronously retrieve and rank documents.
async def run_query():
    docs = await client.query_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],
        splitter=md_splitter,
        options={"embeddings": True}  # Include embeddings in each result.
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['probability']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...")
        if "embeddings" in d:
            print("Embeddings:", d["embeddings"])
        print()

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explain quantum computing",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
    options={"embeddings": True}
)
for d in docs:
    print(d["filename"], "=> Score:", d["probability"])

2. Search Documents via DuckDuckGo

Use DuckDuckGo to search for relevant URLs by keyword, then retrieve, split, and rank their content.

import asyncio
from embs import Embs

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest advances in AI",
        limit=5,         # Maximum number of search results.
        blocklist=["youtube.com"],  # Optional: filter out certain domains.
        options={"embeddings": True}  # Include embeddings in the returned items.
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest advances in AI",
    limit=5,
    blocklist=["youtube.com"],
    options={"embeddings": True}
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['probability']:.4f}")

Caching for Performance

Enable caching to speed up repeated operations:

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # required only for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

Testing

The library is tested using pytest and pytest-asyncio. To run the tests:

pytest --asyncio-mode=auto

License

Licensed under the MIT License. See LICENSE for details.

Contributions are welcome! Please submit issues, ideas, or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.4.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embs-0.1.4-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file embs-0.1.4.tar.gz.

File metadata

  • Download URL: embs-0.1.4.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.4.tar.gz
Algorithm Hash digest
SHA256 cc1bc33ecab5fa1e50a2a03abb32a363c8ef27a1699d7da60163445739d17eb2
MD5 240d6c115514d38fec526f3924d28781
BLAKE2b-256 dd7570e16563f5b30513593c1111a76e4bf106670bdbecb714aa35c20522ac36

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.4.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embs-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: embs-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ff47bc06e51f65f5d4ef62d8ca4265cb23dc930ad582c15faba880905311961e
MD5 f391ccf3227648a2c2e95c171d6a50e3
BLAKE2b-256 69b2911a599346ee398c16c7865929f2f24146e5b82944771e11e94dded5f91d

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.4-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page