Skip to main content

Embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.

Project description

embs

PyPI License Downloads

embs is a Python toolkit that combines:

  • Document retrieval via Docsifer
  • Embeddings generation with Lightweight Embeddings API
  • Text ranking (reranking) based on query relevance
  • Optional caching (in-memory LRU or disk) for performance and scalability

It provides both asynchronous (asyncio) and synchronous methods. If you need to ingest documents (files, URLs), convert them to text/markdown, embed them, and then sort by relevance, embs simplifies these tasks in a single package.

Note: This library references external services for Docsifer (for document conversion) and Lightweight Embeddings (for text and image embeddings). Ensure you have valid endpoints or deploy your own versions.

Why Use embs?

  • Unified Pipeline: Manage document retrieval and text conversion from PDF, URLs, images, etc., then generate embeddings and rank them—all with a single API.
  • Async + Sync: Choose the style that fits your application. The library uses aiohttp internally but also offers synchronous wrappers via asyncio.run().
  • Caching: Supports in-memory LRU or disk-based caching with optional time-to-live (TTL) eviction to avoid repeated network calls and save resources.
  • Lightweight: No heavy dependencies besides aiohttp for async requests. Minimal overhead.

Installation

Install embs via pip:

pip install embs

Or add it to your pyproject.toml dependencies (if using Poetry):

[tool.poetry.dependencies]
embs = "^0.1.0"

Quick Start

1. Basic Document Retrieval

import asyncio
from embs import Embs

async def main():
    client = Embs()
    documents = await client.retrieve_documents_async(
        files=["/path/to/local/file.pdf"],
        urls=["https://example.com"]
    )
    print(documents)
    # => [{"filename": "file.pdf", "markdown": "...converted text..."}, {"filename": "example.com", "markdown": "..."}]

asyncio.run(main())

2. Generate Embeddings

import asyncio
from embs import Embs

async def main():
    client = Embs()
    embedding_result = await client.embed_async("Hello world")
    print(embedding_result)
    # => {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [...] }], "model": "...", "usage": {...}}

asyncio.run(main())

3. Rank Documents

import asyncio
from embs import Embs

async def main():
    client = Embs()
    ranked = await client.rank_async("What is AI?", ["AI is about learning", "AI stands for artificial intelligence"])
    print(ranked)
    # => [{"text": "...", "probability": 0.9, "cosine_similarity": 0.85}, ...]

asyncio.run(main())

4. Integrated Workflow: search_documents_async

import asyncio
from embs import Embs

async def main():
    client = Embs()
    results = await client.search_documents_async(
        query="Explain quantum computing",
        files=["/path/to/local/quantum.pdf"],
        urls=["https://example.com/quantum.html"]
    )
    for item in results:
        print(item["filename"], item["probability"], item["markdown"][:100])  # partial content

asyncio.run(main())

Using the Cache

Enable caching by specifying a cache_config:

from embs import Embs

cache_conf = {
    "enabled": True,
    "type": "memory",        # "memory" or "disk"
    "prefix": "myapp",       # optional prefix for cache keys
    "dir": "cache_folder",   # only needed if type="disk"
    "max_mem_items": 128,    # max items for LRU in memory
    "max_ttl_seconds": 86400 # 1-day TTL
}

client = Embs(cache_config=cache_conf)
  • Memory Caching: Uses an LRU approach; older items are removed once it exceeds max_mem_items.
  • Disk Caching: Stores .json files with a timestamp. Items older than max_ttl_seconds are discarded upon next read.

Testing

embs includes test suites that rely on pytest and pytest-asyncio to verify:

  • Document retrieval with Docsifer
  • Embeddings calls with Lightweight Embeddings
  • Ranking results
  • Caching behaviors (in-memory and on disk)

Run tests with:

pytest --asyncio-mode=auto

License

This project is licensed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embs-0.1.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file embs-0.1.1.tar.gz.

File metadata

  • Download URL: embs-0.1.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ebca05d0c86303990d42d4a6b0a088a6e99d00a56e294c1ec99db9de74fc51ef
MD5 b5792624552f8b38828b7546748b664f
BLAKE2b-256 aa9dd01693a5f18e7bc760d399709dc133439b6bf688050ca40fae80101a18d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.1.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: embs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1448b2d8eb0f9717e25dfac4df288906ced37ec76ad90d40a53d7948485a9543
MD5 b77aff682920a5ef04f5f74e554bb315
BLAKE2b-256 13646cfdbe8a725a7cb0e4f182ccce32bc4e329114ab703ec396d6f043ad0552

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.1-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page