Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.
Project description
embs
embs is your one-stop toolkit for document ingestion, embedding, and ranking workflows. Whether you are building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it fast and simple to integrate document retrieval, embedding, and ranking with minimal configuration.
Why Choose embs?
-
Free External APIs:
- Docsifer for converting files/URLs (PDFs, HTML, images, etc.) to Markdown.
- Lightweight Embeddings API for generating high-quality, multilingual embeddings.
-
Optimized for RAG & Chatbots:
Automatically split documents into meaningful chunks, generate embeddings, and rank them by query relevance to empower your chatbot or generative model. -
Flexible Splitting:
Use the built-in Markdown splitter or provide a custom splitting function to best suit your documents. -
Unified Pipeline:
Seamlessly handle document ingestion, content extraction, embedding generation, and relevance ranking—all in one library. -
DuckDuckGo-powered Web Search:
The newsearch_documentsfunction leverages DuckDuckGo to find relevant URLs by keyword, retrieves their content via Docsifer, and ranks the results. -
Optional Embedding Results:
Simply passoptions={"embeddings": True}to receive the raw embedding vectors with your ranking results.
Installation
Install via pip:
pip install embs
Or add to your pyproject.toml (for Poetry):
[tool.poetry.dependencies]
embs = "^0.1.0"
Quick Start Examples
1. Query Documents (Ranking by Relevance)
This example shows how to retrieve documents (from a file, URL, or both), rank them by relevance to your query, and optionally include the embeddings.
import asyncio
from functools import partial
from embs import Embs
# Configure the built-in Markdown splitter.
split_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
"return_each_line": False,
"strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)
client = Embs()
# Asynchronously retrieve and rank documents.
async def run_query():
docs = await client.query_documents_async(
query="Explain quantum computing",
files=["/path/to/quantum_theory.pdf"],
splitter=md_splitter,
options={"embeddings": True} # Include embeddings in each result.
)
for d in docs:
print(f"{d['filename']} => Score: {d['probability']:.4f}")
print(f"Snippet: {d['markdown'][:80]}...")
if "embeddings" in d:
print("Embeddings:", d["embeddings"])
print()
asyncio.run(run_query())
For synchronous usage:
docs = client.query_documents(
query="Explain quantum computing",
files=["/path/to/quantum_theory.pdf"],
splitter=md_splitter,
options={"embeddings": True}
)
for d in docs:
print(d["filename"], "=> Score:", d["probability"])
2. Search Documents via DuckDuckGo
Use DuckDuckGo to search for relevant URLs by keyword, then retrieve, split, and rank their content.
import asyncio
from embs import Embs
client = Embs()
async def run_search():
results = await client.search_documents_async(
query="Latest advances in AI",
limit=5, # Maximum number of search results.
blocklist=["youtube.com"], # Optional: filter out certain domains.
options={"embeddings": True} # Include embeddings in the returned items.
)
for item in results:
print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
print(f"Snippet: {item['markdown'][:80]}...\n")
asyncio.run(run_search())
For synchronous usage:
results = client.search_documents(
query="Latest advances in AI",
limit=5,
blocklist=["youtube.com"],
options={"embeddings": True}
)
for item in results:
print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
Caching for Performance
Enable caching to speed up repeated operations:
cache_conf = {
"enabled": True,
"type": "memory", # or "disk"
"prefix": "myapp",
"dir": "cache_folder", # required only for disk caching
"max_mem_items": 128,
"max_ttl_seconds": 86400
}
client = Embs(cache_config=cache_conf)
Testing
The library is tested using pytest and pytest-asyncio. To run the tests:
pytest --asyncio-mode=auto
License
Licensed under the MIT License. See LICENSE for details.
Contributions are welcome! Please submit issues, ideas, or pull requests.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embs-0.1.4.tar.gz.
File metadata
- Download URL: embs-0.1.4.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc1bc33ecab5fa1e50a2a03abb32a363c8ef27a1699d7da60163445739d17eb2
|
|
| MD5 |
240d6c115514d38fec526f3924d28781
|
|
| BLAKE2b-256 |
dd7570e16563f5b30513593c1111a76e4bf106670bdbecb714aa35c20522ac36
|
Provenance
The following attestation bundles were made for embs-0.1.4.tar.gz:
Publisher:
release.yml on lh0x00/embs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embs-0.1.4.tar.gz -
Subject digest:
cc1bc33ecab5fa1e50a2a03abb32a363c8ef27a1699d7da60163445739d17eb2 - Sigstore transparency entry: 167948638
- Sigstore integration time:
-
Permalink:
lh0x00/embs@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/lh0x00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file embs-0.1.4-py3-none-any.whl.
File metadata
- Download URL: embs-0.1.4-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff47bc06e51f65f5d4ef62d8ca4265cb23dc930ad582c15faba880905311961e
|
|
| MD5 |
f391ccf3227648a2c2e95c171d6a50e3
|
|
| BLAKE2b-256 |
69b2911a599346ee398c16c7865929f2f24146e5b82944771e11e94dded5f91d
|
Provenance
The following attestation bundles were made for embs-0.1.4-py3-none-any.whl:
Publisher:
release.yml on lh0x00/embs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embs-0.1.4-py3-none-any.whl -
Subject digest:
ff47bc06e51f65f5d4ef62d8ca4265cb23dc930ad582c15faba880905311961e - Sigstore transparency entry: 167948640
- Sigstore integration time:
-
Permalink:
lh0x00/embs@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/lh0x00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5f2ad0c978e527d56d08de9ec78d31bfb76a31ed -
Trigger Event:
push
-
Statement type: