Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

These details have not been verified by PyPI

Project description

embs

embs is your one-stop toolkit for document ingestion, embedding, and ranking workflows. Whether you are building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it fast and simple to integrate document retrieval, embedding, and ranking with minimal configuration.

Why Choose embs?

Free External APIs:
- Docsifer for converting files/URLs (PDFs, HTML, images, etc.) to Markdown.
- Lightweight Embeddings API for generating high-quality, multilingual embeddings.
Optimized for RAG & Chatbots:
Automatically split documents into meaningful chunks, generate embeddings, and rank them by query relevance to empower your chatbot or generative model.
Flexible Splitting:
Use the built-in Markdown splitter or provide a custom splitting function to best suit your documents.
Unified Pipeline:
Seamlessly handle document ingestion, content extraction, embedding generation, and relevance ranking—all in one library.
DuckDuckGo-powered Web Search:
The new search_documents function leverages DuckDuckGo to find relevant URLs by keyword, retrieves their content via Docsifer, and ranks the results.
Optional Embedding Results:
Simply pass options={"embeddings": True} to receive the raw embedding vectors with your ranking results.

Installation

Install via pip:

pip install embs

Or add to your pyproject.toml (for Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Quick Start Examples

1. Query Documents (Ranking by Relevance)

This example shows how to retrieve documents (from a file, URL, or both), rank them by relevance to your query, and optionally include the embeddings.

import asyncio
from functools import partial
from embs import Embs

# Configure the built-in Markdown splitter.
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

# Asynchronously retrieve and rank documents.
async def run_query():
    docs = await client.query_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],
        splitter=md_splitter,
        options={"embeddings": True}  # Include embeddings in each result.
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['probability']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...")
        if "embeddings" in d:
            print("Embeddings:", d["embeddings"])
        print()

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explain quantum computing",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
    options={"embeddings": True}
)
for d in docs:
    print(d["filename"], "=> Score:", d["probability"])

2. Search Documents via DuckDuckGo

Use DuckDuckGo to search for relevant URLs by keyword, then retrieve, split, and rank their content.

import asyncio
from embs import Embs

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest advances in AI",
        limit=5,         # Maximum number of search results.
        blocklist=["youtube.com"],  # Optional: filter out certain domains.
        options={"embeddings": True}  # Include embeddings in the returned items.
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest advances in AI",
    limit=5,
    blocklist=["youtube.com"],
    options={"embeddings": True}
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['probability']:.4f}")

Caching for Performance

Enable caching to speed up repeated operations:

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # required only for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

Testing

The library is tested using pytest and pytest-asyncio. To run the tests:

pytest --asyncio-mode=auto

License

Licensed under the MIT License. See LICENSE for details.

Contributions are welcome! Please submit issues, ideas, or pull requests.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.8

Feb 2, 2025

0.1.7

Feb 2, 2025

0.1.6

Feb 2, 2025

0.1.5

Feb 2, 2025

This version

0.1.4

Feb 2, 2025

0.1.3

Feb 2, 2025

0.1.2

Jan 27, 2025

0.1.1

Jan 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.4.tar.gz (12.5 kB view details)

Uploaded Feb 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embs-0.1.4-py3-none-any.whl (11.3 kB view details)

Uploaded Feb 2, 2025 Python 3

File details

Details for the file embs-0.1.4.tar.gz.

File metadata

Download URL: embs-0.1.4.tar.gz
Upload date: Feb 2, 2025
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`cc1bc33ecab5fa1e50a2a03abb32a363c8ef27a1699d7da60163445739d17eb2`
MD5	`240d6c115514d38fec526f3924d28781`
BLAKE2b-256	`dd7570e16563f5b30513593c1111a76e4bf106670bdbecb714aa35c20522ac36`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.4.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.4.tar.gz
- Subject digest: cc1bc33ecab5fa1e50a2a03abb32a363c8ef27a1699d7da60163445739d17eb2
- Sigstore transparency entry: 167948638
- Sigstore integration time: Feb 2, 2025
Source repository:
- Permalink: lh0x00/embs@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed
- Trigger Event: push

File details

Details for the file embs-0.1.4-py3-none-any.whl.

File metadata

Download URL: embs-0.1.4-py3-none-any.whl
Upload date: Feb 2, 2025
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ff47bc06e51f65f5d4ef62d8ca4265cb23dc930ad582c15faba880905311961e`
MD5	`f391ccf3227648a2c2e95c171d6a50e3`
BLAKE2b-256	`69b2911a599346ee398c16c7865929f2f24146e5b82944771e11e94dded5f91d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.4-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.4-py3-none-any.whl
- Subject digest: ff47bc06e51f65f5d4ef62d8ca4265cb23dc930ad582c15faba880905311961e
- Sigstore transparency entry: 167948640
- Sigstore integration time: Feb 2, 2025
Source repository:
- Permalink: lh0x00/embs@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed
- Trigger Event: push

embs 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

embs

Why Choose embs?

Installation

Quick Start Examples

1. Query Documents (Ranking by Relevance)

2. Search Documents via DuckDuckGo

Caching for Performance

Testing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance