Embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.
Project description
embs
embs is your one-stop toolkit for handling document ingestion, embedding, and ranking workflows. Whether you're building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it easy to integrate document retrieval, embeddings, and ranking with minimal setup.
Why Choose embs?
-
Free External APIs:
- Docsifer for document conversion (PDFs, URLs, images, etc.) and
- Lightweight Embeddings API for generating state-of-the-art embeddings, including multi-language support and some of the best models for NLP tasks — all provided free of charge.
- These APIs support top-tier multilingual embeddings models like
sentence-transformersandOpenAI-compatible embeddings, so you can achieve top-quality results with minimal configuration.
-
Perfect for RAG Systems: Automatically convert and split documents into meaningful chunks, generate embeddings, and rank them — all tailored for retrieval-augmented workflows like OpenAI GPT or other generative models.
-
Integrates with Chatbots: Preprocess, split, and embed your knowledge base to build conversational systems that respond with accurate and contextually relevant answers.
-
Flexible Splitting: Use built-in or custom chunking strategies to split large documents into smaller, retrievable sections. This improves relevance in retrieval-based workflows.
-
Unified Pipeline: Streamline everything from document ingestion to semantic ranking with a single API.
-
Lightweight & Extensible: No heavy dependencies beyond
aiohttp. Easily fits into your existing infrastructure.
Installation
pip install embs
Or in pyproject.toml (Poetry):
[tool.poetry.dependencies]
embs = "^0.1.0"
Key Use Cases
Retrieval-Augmented Generation (RAG)
In RAG workflows, retrieved knowledge informs a generative model like GPT to produce accurate and relevant answers. embs simplifies this by:
- Converting raw documents (PDFs, URLs) to clean text or markdown.
- Splitting documents into retrievable chunks (e.g., by headers or lines).
- Embedding chunks with powerful multilingual models.
- Ranking the chunks for relevance to the query.
With caching enabled, repeated requests are even faster, ensuring scalability for real-world deployments.
Code Practices
Below is an end-to-end example that retrieves documents, applies the built-in Markdown splitter, generates embeddings, and ranks them by query relevance. This showcases how embs works perfectly for chatbot or RAG pipelines.
Example: Retrieve, Split, and Rank
import asyncio
from functools import partial
from embs import Embs
async def main():
# Markdown-based splitter configuration
split_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
"return_each_line": False, # Keep chunks as sections, not individual lines
"strip_headers": True # Remove header text from the chunks
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)
# Initialize the Embs client
client = Embs()
# Step 1: Retrieve documents and split them by Markdown headers
raw_docs = await client.retrieve_documents_async(
files=["/path/to/sample.pdf"],
urls=["https://example.com"],
splitter=md_splitter # Apply built-in markdown splitter
)
print(f"Total chunks after splitting: {len(raw_docs)}")
# Step 2: Rank the retrieved documents by relevance to a query
results = await client.search_documents_async(
query="Explain quantum computing",
files=["/path/to/quantum_theory.pdf"], # Additional files to retrieve and rank
urls=["https://example.com/quantum.html"],
splitter=md_splitter # Apply splitter for additional sources
)
# Step 3: Output the top-ranked results
for item in results[:3]:
print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
print(f"Snippet: {item['markdown'][:80]}...\n")
asyncio.run(main())
Why Is This Perfect for Chatbots?
- Context-Aware Answers: By splitting large documents into manageable chunks and ranking them for relevance, your chatbot always responds with the most contextually appropriate snippet.
- Multilingual Embeddings: The Lightweight Embeddings API supports embeddings for multiple languages, so your chatbot can handle diverse user inputs and knowledge bases.
- Caching for Scalability: Repeated retrieval or ranking operations are sped up dramatically with in-memory or disk-based caching, ensuring low-latency responses.
API Reference
Below are the primary methods in embs. All async methods have a synchronous equivalent.
1. retrieve_documents_async / retrieve_documents
Convert files and/or URLs into Markdown using Docsifer. Optionally, apply a splitter to break down large documents into chunks.
async def retrieve_documents_async(
files=None,
urls=None,
openai_config=None,
settings=None,
concurrency=5,
options=None,
splitter=None
) -> List[Dict[str, str]]:
...
- Params:
files: List of file paths or file-like objects.urls: List of URLs for Docsifer to process.splitter: A callable that receives and returns a list of docs, e.g.,Embs.markdown_splitter.
- Returns: A list of documents (
{"filename": <str>, "markdown": <str>}).
2. embed_async / embed
Generate embeddings for text or a list of texts using the Lightweight Embeddings API.
async def embed_async(
text_or_texts: Union[str, List[str]],
model=None
) -> Dict[str, Any]:
...
- Params:
text_or_texts: Single string or list of strings to embed.model: Optional; specify the embedding model (defaults tosnowflake-arctic-embed-l-v2.0).
- Returns: Embedding data as a dictionary.
3. rank_async / rank
Rank a list of text candidates by relevance to a query using the Lightweight Embeddings API.
async def rank_async(
query: str,
candidates: List[str],
model=None
) -> List[Dict[str, Any]]:
...
- Params:
query: The query string.candidates: List of candidate texts.model: Optional; specify the ranking model (defaults tosnowflake-arctic-embed-l-v2.0).
- Returns: A ranked list of
{"text": <candidate>, "probability": <float>, "cosine_similarity": <float>}.
4. search_documents_async / search_documents
Retrieve documents (files/URLs), optionally split them, and rank their chunks by relevance to a query.
async def search_documents_async(
query: str,
files=None,
urls=None,
openai_config=None,
settings=None,
concurrency=5,
options=None,
model=None,
splitter=None
) -> List[Dict[str, Any]]:
...
- Params:
query: The query to rank against.files,urls: As inretrieve_documents_async.splitter: Optional; e.g., useEmbs.markdown_splitter.
- Returns: A ranked list of chunks with
{"filename": ..., "markdown": ..., "probability": ..., "cosine_similarity": ...}.
Caching for Performance
Enable in-memory or disk-based caching to avoid redundant processing:
cache_conf = {
"enabled": True,
"type": "memory", # or "disk"
"prefix": "myapp",
"dir": "cache_folder", # only needed for disk caching
"max_mem_items": 128,
"max_ttl_seconds": 86400
}
client = Embs(cache_config=cache_conf)
- Memory Caching: Quick lookups using LRU with TTL expiration.
- Disk Caching: Stores JSON files to a specified directory, evicting older files after TTL expiration.
Testing
embs is rigorously tested using pytest and pytest-asyncio. To ensure that retrieval, embeddings, ranking, caching, and splitting are working as expected, run:
pytest --asyncio-mode=auto
License
Licensed under the MIT License. See LICENSE for details.
Contributions are welcome! Submit issues, ideas, or pull requests to help improve embs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embs-0.1.2.tar.gz.
File metadata
- Download URL: embs-0.1.2.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffe5397afc52ed0e47def22580c6f6b16c708e438ca9f469ef56ec098321301f
|
|
| MD5 |
6ffcc6e03e3093d1e64c43e959f746b8
|
|
| BLAKE2b-256 |
79a84b640cdd14f567f4785311be9826ac95e40717205fbcfdd29c26ccff4556
|
Provenance
The following attestation bundles were made for embs-0.1.2.tar.gz:
Publisher:
release.yml on lh0x00/embs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embs-0.1.2.tar.gz -
Subject digest:
ffe5397afc52ed0e47def22580c6f6b16c708e438ca9f469ef56ec098321301f - Sigstore transparency entry: 165806391
- Sigstore integration time:
-
Permalink:
lh0x00/embs@d5e45829561c0a19f6f52a7d694fcd736e14372b -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/lh0x00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d5e45829561c0a19f6f52a7d694fcd736e14372b -
Trigger Event:
push
-
Statement type:
File details
Details for the file embs-0.1.2-py3-none-any.whl.
File metadata
- Download URL: embs-0.1.2-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06ea85f90630ab1e24a3e4ed0cdb241ab8af70c5a6deff2a2d79301163636354
|
|
| MD5 |
2610b21ef46a8da8ae0f9bf8d64c3f13
|
|
| BLAKE2b-256 |
6acaaf4a50cce4b7f2a1752bd75a221fd7b3403a8a9ab6d5384cd4b5800fea91
|
Provenance
The following attestation bundles were made for embs-0.1.2-py3-none-any.whl:
Publisher:
release.yml on lh0x00/embs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embs-0.1.2-py3-none-any.whl -
Subject digest:
06ea85f90630ab1e24a3e4ed0cdb241ab8af70c5a6deff2a2d79301163636354 - Sigstore transparency entry: 165806392
- Sigstore integration time:
-
Permalink:
lh0x00/embs@d5e45829561c0a19f6f52a7d694fcd736e14372b -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/lh0x00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d5e45829561c0a19f6f52a7d694fcd736e14372b -
Trigger Event:
push
-
Statement type: