Embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.

These details have not been verified by PyPI

Project description

embs

embs is your one-stop toolkit for handling document ingestion, embedding, and ranking workflows. Whether you're building a retrieval-augmented generation (RAG) system, a chatbot, or a semantic search engine, embs makes it easy to integrate document retrieval, embeddings, and ranking with minimal setup.

Why Choose embs?

Free External APIs:
- Docsifer for document conversion (PDFs, URLs, images, etc.) and
- Lightweight Embeddings API for generating state-of-the-art embeddings, including multi-language support and some of the best models for NLP tasks — all provided free of charge.
- These APIs support top-tier multilingual embeddings models like sentence-transformers and OpenAI-compatible embeddings, so you can achieve top-quality results with minimal configuration.
Perfect for RAG Systems: Automatically convert and split documents into meaningful chunks, generate embeddings, and rank them — all tailored for retrieval-augmented workflows like OpenAI GPT or other generative models.
Integrates with Chatbots: Preprocess, split, and embed your knowledge base to build conversational systems that respond with accurate and contextually relevant answers.
Flexible Splitting: Use built-in or custom chunking strategies to split large documents into smaller, retrievable sections. This improves relevance in retrieval-based workflows.
Unified Pipeline: Streamline everything from document ingestion to semantic ranking with a single API.
Lightweight & Extensible: No heavy dependencies beyond aiohttp. Easily fits into your existing infrastructure.

Installation

pip install embs

Or in pyproject.toml (Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Key Use Cases

Retrieval-Augmented Generation (RAG)

In RAG workflows, retrieved knowledge informs a generative model like GPT to produce accurate and relevant answers. embs simplifies this by:

Converting raw documents (PDFs, URLs) to clean text or markdown.
Splitting documents into retrievable chunks (e.g., by headers or lines).
Embedding chunks with powerful multilingual models.
Ranking the chunks for relevance to the query.

With caching enabled, repeated requests are even faster, ensuring scalability for real-world deployments.

Code Practices

Below is an end-to-end example that retrieves documents, applies the built-in Markdown splitter, generates embeddings, and ranks them by query relevance. This showcases how embs works perfectly for chatbot or RAG pipelines.

Example: Retrieve, Split, and Rank

import asyncio
from functools import partial
from embs import Embs

async def main():
    # Markdown-based splitter configuration
    split_config = {
        "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
        "return_each_line": False,  # Keep chunks as sections, not individual lines
        "strip_headers": True       # Remove header text from the chunks
    }
    md_splitter = partial(Embs.markdown_splitter, config=split_config)

    # Initialize the Embs client
    client = Embs()

    # Step 1: Retrieve documents and split them by Markdown headers
    raw_docs = await client.retrieve_documents_async(
        files=["/path/to/sample.pdf"],
        urls=["https://example.com"],
        splitter=md_splitter  # Apply built-in markdown splitter
    )
    print(f"Total chunks after splitting: {len(raw_docs)}")

    # Step 2: Rank the retrieved documents by relevance to a query
    results = await client.search_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],  # Additional files to retrieve and rank
        urls=["https://example.com/quantum.html"],
        splitter=md_splitter  # Apply splitter for additional sources
    )

    # Step 3: Output the top-ranked results
    for item in results[:3]:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(main())

Why Is This Perfect for Chatbots?

Context-Aware Answers: By splitting large documents into manageable chunks and ranking them for relevance, your chatbot always responds with the most contextually appropriate snippet.
Multilingual Embeddings: The Lightweight Embeddings API supports embeddings for multiple languages, so your chatbot can handle diverse user inputs and knowledge bases.
Caching for Scalability: Repeated retrieval or ranking operations are sped up dramatically with in-memory or disk-based caching, ensuring low-latency responses.

API Reference

Below are the primary methods in embs. All async methods have a synchronous equivalent.

1. `retrieve_documents_async` / `retrieve_documents`

Convert files and/or URLs into Markdown using Docsifer. Optionally, apply a splitter to break down large documents into chunks.

async def retrieve_documents_async(
    files=None,
    urls=None,
    openai_config=None,
    settings=None,
    concurrency=5,
    options=None,
    splitter=None
) -> List[Dict[str, str]]:
    ...

Params:
- files: List of file paths or file-like objects.
- urls: List of URLs for Docsifer to process.
- splitter: A callable that receives and returns a list of docs, e.g., Embs.markdown_splitter.
Returns: A list of documents ({"filename": <str>, "markdown": <str>}).

2. `embed_async` / `embed`

Generate embeddings for text or a list of texts using the Lightweight Embeddings API.

async def embed_async(
    text_or_texts: Union[str, List[str]],
    model=None
) -> Dict[str, Any]:
    ...

Params:
- text_or_texts: Single string or list of strings to embed.
- model: Optional; specify the embedding model (defaults to snowflake-arctic-embed-l-v2.0).
Returns: Embedding data as a dictionary.

3. `rank_async` / `rank`

Rank a list of text candidates by relevance to a query using the Lightweight Embeddings API.

async def rank_async(
    query: str,
    candidates: List[str],
    model=None
) -> List[Dict[str, Any]]:
    ...

Params:
- query: The query string.
- candidates: List of candidate texts.
- model: Optional; specify the ranking model (defaults to snowflake-arctic-embed-l-v2.0).
Returns: A ranked list of {"text": <candidate>, "probability": <float>, "cosine_similarity": <float>}.

4. `search_documents_async` / `search_documents`

Retrieve documents (files/URLs), optionally split them, and rank their chunks by relevance to a query.

async def search_documents_async(
    query: str,
    files=None,
    urls=None,
    openai_config=None,
    settings=None,
    concurrency=5,
    options=None,
    model=None,
    splitter=None
) -> List[Dict[str, Any]]:
    ...

Params:
- query: The query to rank against.
- files, urls: As in retrieve_documents_async.
- splitter: Optional; e.g., use Embs.markdown_splitter.
Returns: A ranked list of chunks with {"filename": ..., "markdown": ..., "probability": ..., "cosine_similarity": ...}.

Caching for Performance

Enable in-memory or disk-based caching to avoid redundant processing:

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # only needed for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

Memory Caching: Quick lookups using LRU with TTL expiration.
Disk Caching: Stores JSON files to a specified directory, evicting older files after TTL expiration.

Testing

embs is rigorously tested using pytest and pytest-asyncio. To ensure that retrieval, embeddings, ranking, caching, and splitting are working as expected, run:

pytest --asyncio-mode=auto

License

Licensed under the MIT License. See LICENSE for details.

Contributions are welcome! Submit issues, ideas, or pull requests to help improve embs.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.8

Feb 2, 2025

0.1.7

Feb 2, 2025

0.1.6

Feb 2, 2025

0.1.5

Feb 2, 2025

0.1.4

Feb 2, 2025

0.1.3

Feb 2, 2025

This version

0.1.2

Jan 27, 2025

0.1.1

Jan 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.2.tar.gz (15.5 kB view details)

Uploaded Jan 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embs-0.1.2-py3-none-any.whl (13.2 kB view details)

Uploaded Jan 27, 2025 Python 3

File details

Details for the file embs-0.1.2.tar.gz.

File metadata

Download URL: embs-0.1.2.tar.gz
Upload date: Jan 27, 2025
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`ffe5397afc52ed0e47def22580c6f6b16c708e438ca9f469ef56ec098321301f`
MD5	`6ffcc6e03e3093d1e64c43e959f746b8`
BLAKE2b-256	`79a84b640cdd14f567f4785311be9826ac95e40717205fbcfdd29c26ccff4556`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.2.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.2.tar.gz
- Subject digest: ffe5397afc52ed0e47def22580c6f6b16c708e438ca9f469ef56ec098321301f
- Sigstore transparency entry: 165806391
- Sigstore integration time: Jan 27, 2025
Source repository:
- Permalink: lh0x00/embs@d5e45829561c0a19f6f52a7d694fcd736e14372b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d5e45829561c0a19f6f52a7d694fcd736e14372b
- Trigger Event: push

File details

Details for the file embs-0.1.2-py3-none-any.whl.

File metadata

Download URL: embs-0.1.2-py3-none-any.whl
Upload date: Jan 27, 2025
Size: 13.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06ea85f90630ab1e24a3e4ed0cdb241ab8af70c5a6deff2a2d79301163636354`
MD5	`2610b21ef46a8da8ae0f9bf8d64c3f13`
BLAKE2b-256	`6acaaf4a50cce4b7f2a1752bd75a221fd7b3403a8a9ab6d5384cd4b5800fea91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.2-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.2-py3-none-any.whl
- Subject digest: 06ea85f90630ab1e24a3e4ed0cdb241ab8af70c5a6deff2a2d79301163636354
- Sigstore transparency entry: 165806392
- Sigstore integration time: Jan 27, 2025
Source repository:
- Permalink: lh0x00/embs@d5e45829561c0a19f6f52a7d694fcd736e14372b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d5e45829561c0a19f6f52a7d694fcd736e14372b
- Trigger Event: push

embs 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

embs

Why Choose embs?

Installation

Key Use Cases

Retrieval-Augmented Generation (RAG)

Code Practices

Example: Retrieve, Split, and Rank

Why Is This Perfect for Chatbots?

API Reference

1. retrieve_documents_async / retrieve_documents

2. embed_async / embed

3. rank_async / rank

4. search_documents_async / search_documents

Caching for Performance

Testing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. `retrieve_documents_async` / `retrieve_documents`

2. `embed_async` / `embed`

3. `rank_async` / `rank`

4. `search_documents_async` / `search_documents`