Skip to main content

Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

Project description

embs

PyPI License Downloads

embs is a powerful Python library for document retrieval, embedding, and ranking, making it easier to build Retrieval-Augmented Generation (RAG) systems, chatbots, and semantic search engines.

Why Choose embs?

  • Web & Local Document Search:

    • DuckDuckGo-powered web search retrieves and ranks relevant documents.
    • Supports PDFs, Word, HTML, Markdown, and more.
  • Optimized for RAG, Chatbots & Multilingual Search:

    • Automatic document chunking (Splitter) for improved retrieval accuracy.
    • Rank documents by relevance to a query.
    • Strong multilingual model support for global applications. ✅ Supported multilingual models:
      • snowflake-arctic-embed-l-v2.0
      • bge-m3
      • gte-multilingual-base
      • paraphrase-multilingual-MiniLM-L12-v2
      • paraphrase-multilingual-mpnet-base-v2
      • multilingual-e5-small
      • multilingual-e5-base
      • multilingual-e5-large
  • Fast & Efficient:

    • Cache support (in-memory & disk) for faster queries.
    • Flexible batch embedding with cache optimization.
  • Scalable & Customizable:

    • Works with synchronous & asynchronous processing.
    • Supports custom splitting rules.

🚀 Installation

Install via pip:

pip install embs

For Poetry users:

[tool.poetry.dependencies]
embs = "^0.1.8"

📖 Quick Start Guide

1️⃣ Searching Documents via DuckDuckGo (Recommended!)

Retrieve relevant web pages, convert them to Markdown, and rank them using embeddings.

🚀 Always use a splitter!
Improves ranking, reduces redundancy, and ensures better retrieval.

import asyncio
from functools import partial
from embs import Embs

# Configure a Markdown-based splitter
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": True,
    "strip_headers": True,
    "split_on_double_newline": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest AI research",
        limit=3,
        blocklist=["youtube.com"],  # Exclude unwanted domains
        splitter=md_splitter,  # Enable smart chunking
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['similarity']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest AI research",
    limit=3,
    blocklist=["youtube.com"],
    splitter=md_splitter,  # Always use a splitter
    model="snowflake-arctic-embed-l-v2.0",
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['similarity']:.4f}")

2️⃣ Multilingual Document Querying (Local & Online)

Retrieve and rank multilingual documents from local files or URLs.

async def run_query():
    docs = await client.query_documents_async(
        query="Explique la mécanique quantique",  # French query
        files=["/path/to/quantum_theory.pdf"],
        urls=["https://example.com/quantum.html"],
        splitter=md_splitter,  # Chunking for better retrieval
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['similarity']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...\n")

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explique la mécanique quantique",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
)
for d in docs:
    print(d["filename"], "=> Score:", d["similarity"])

💡 Perfect for multilingual retrieval! Whether you're searching documents in English, French, Spanish, German, or other supported languages, embs ensures optimal ranking and retrieval.

⚡ Caching for Performance

Enable in-memory or disk caching to speed up repeated queries.

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # Required for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

🔍 Key Features & API Methods

🔹 search_documents_async()

Search for documents via DuckDuckGo, retrieve, and rank them.

await client.search_documents_async(
    query="Recent AI breakthroughs",
    limit=3,
    blocklist=["example.com"],
    splitter=md_splitter
)

🔹 query_documents_async()

Retrieve, split, and rank local/online documents.

await client.query_documents_async(
    query="Climate change effects",
    files=["/path/to/report.pdf"],
    urls=["https://example.com"],
    splitter=md_splitter,
)

🔹 embed_async()

Generate embeddings for texts with multilingual support.

embeddings = await client.embed_async(
    ["Este es un ejemplo de texto.", "Ceci est un exemple de phrase."],
    optimized=True  # Process one at a time for better caching
)

🔹 rank_async()

Rank candidate texts by similarity to a query.

ranked_results = await client.rank_async(
    query="Machine learning",
    candidates=["Deep learning is a subset of ML", "Quantum computing is unrelated"]
)

🔬 Testing

Run pytest and pytest-asyncio for automated testing:

pytest --asyncio-mode=auto

📝 Best Practices: Always Use a Splitter!

✅ How to Use the Built-in Markdown Splitter

from functools import partial

split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": True,
    "strip_headers": True,
    "split_on_double_newline": True,
}

md_splitter = partial(Embs.markdown_splitter, config=split_config)

docs = client.query_documents(
    query="Machine Learning Basics",
    files=["/path/to/ml_guide.pdf"],
    splitter=md_splitter
)

📜 License

Licensed under MIT License. See LICENSE for details.

🤝 Contributing

Pull requests, issues, and discussions are welcome!

🚀 With enhanced multilingual support, embs is now even more powerful for global retrieval applications! 🌍

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.8.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embs-0.1.8-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file embs-0.1.8.tar.gz.

File metadata

  • Download URL: embs-0.1.8.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.8.tar.gz
Algorithm Hash digest
SHA256 bf6d73bfaa8111a06850c9729d09a00081e02314a9afeb442544e7f44eeea615
MD5 137a0e0fdce14ebdf675793245fb8a87
BLAKE2b-256 1aa2e2cda54f503e3b65dcff6f04dafa22402308985684a1cb3e366bb7a504f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.8.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embs-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: embs-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 194c587a8a51fae7338b12e97668c0e006779efe7b06abe3cc9eb28668146f4a
MD5 7b36c944fec4983bb26b65b5536d584e
BLAKE2b-256 1493d973a318b5bbc3ebe0c988b724f2891cd01abc07391dc97850cbce33d030

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.8-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page