Embs is a lightweight Python toolkit for document retrieval, embedding generation, and ranking—ideal for RAG-based AI, chatbots, and search systems with caching support.

These details have not been verified by PyPI

Project description

embs

embs is a powerful Python library for document retrieval, embedding, and ranking, making it easier to build Retrieval-Augmented Generation (RAG) systems, chatbots, and semantic search engines.

Why Choose embs?

Web & Local Document Search:
- DuckDuckGo-powered web search retrieves and ranks relevant documents.
- Supports PDFs, Word, HTML, Markdown, and more.
Optimized for RAG & Chatbots:
- Automatic document chunking (Splitter) for improved retrieval accuracy.
- Rank documents by relevance to a query.
Fast & Efficient:
- Cache support (in-memory & disk) for faster queries.
- Flexible batch embedding with cache optimization.
Scalable & Customizable:
- Works with synchronous & asynchronous processing.
- Supports custom splitting rules.

🚀 Installation

Install via pip:

pip install embs

For Poetry users:

[tool.poetry.dependencies]
embs = "^0.1.7"

📖 Quick Start Guide

1️⃣ Searching Documents via DuckDuckGo (Recommended!)

Retrieve relevant web pages, convert them to Markdown, and rank them using embeddings.

🚀 Always use a splitter!
Improves ranking, reduces redundancy, and ensures better retrieval.

import asyncio
from functools import partial
from embs import Embs

# Configure a Markdown-based splitter
split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

client = Embs()

async def run_search():
    results = await client.search_documents_async(
        query="Latest AI research",
        limit=5,
        blocklist=["youtube.com"],  # Exclude unwanted domains
        splitter=md_splitter,  # Enable smart chunking
        options={"embeddings": True}
    )
    for item in results:
        print(f"File: {item['filename']} | Score: {item['probability']:.4f}")
        print(f"Snippet: {item['markdown'][:80]}...\n")

asyncio.run(run_search())

For synchronous usage:

results = client.search_documents(
    query="Latest AI research",
    limit=5,
    blocklist=["youtube.com"],
    splitter=md_splitter,  # Always use a splitter
    options={"embeddings": True}
)
for item in results:
    print(f"File: {item['filename']} | Score: {item['probability']:.4f}")

2️⃣ Querying Local & Online Documents with Ranking

Retrieve and rank documents from local files or URLs.

async def run_query():
    docs = await client.query_documents_async(
        query="Explain quantum computing",
        files=["/path/to/quantum_theory.pdf"],
        urls=["https://example.com/quantum.html"],
        splitter=md_splitter,  # Chunking for better retrieval
        options={"embeddings": True}
    )
    for d in docs:
        print(f"{d['filename']} => Score: {d['probability']:.4f}")
        print(f"Snippet: {d['markdown'][:80]}...\n")

asyncio.run(run_query())

For synchronous usage:

docs = client.query_documents(
    query="Explain quantum computing",
    files=["/path/to/quantum_theory.pdf"],
    splitter=md_splitter,
    options={"embeddings": True}
)
for d in docs:
    print(d["filename"], "=> Score:", d["probability"])

⚡ Caching for Performance

Enable in-memory or disk caching to speed up repeated queries.

cache_conf = {
    "enabled": True,
    "type": "memory",       # or "disk"
    "prefix": "myapp",
    "dir": "cache_folder",  # Required for disk caching
    "max_mem_items": 128,
    "max_ttl_seconds": 86400
}

client = Embs(cache_config=cache_conf)

🔍 Key Features & API Methods

🔹 `search_documents_async()`

Search for documents via DuckDuckGo, retrieve, and rank them.

await client.search_documents_async(
    query="Recent AI breakthroughs",
    limit=5,
    blocklist=["example.com"],
    splitter=md_splitter
)

query: Search term.
limit: Number of DuckDuckGo results.
blocklist: Exclude unwanted domains.
splitter: Smart chunking for better ranking.

🔹 `query_documents_async()`

Retrieve, split, and rank local/online documents.

await client.query_documents_async(
    query="Climate change effects",
    files=["/path/to/report.pdf"],
    urls=["https://example.com"],
    splitter=md_splitter,
    options={"embeddings": True}
)

query: Search query.
files: List of file paths.
urls: List of webpage URLs.
splitter: Function to split document chunks.
options: Set {"embeddings": True} to include embeddings.

🔹 `embed_async()`

Generate embeddings for texts.
By default, it processes one item at a time for better cache efficiency.

embeddings = await client.embed_async(
    ["This is a test sentence.", "Another sentence."],
    optimized=True  # Process one at a time for better caching
)

text_or_texts: Single string or list of texts.
optimized: True = Process one-by-one (better cache).
False = Process in batches of 4 (faster, but higher API load).

🔹 `rank_async()`

Rank candidate texts by similarity to a query.

ranked_results = await client.rank_async(
    query="Machine learning",
    candidates=["Deep learning is a subset of ML", "Quantum computing is unrelated"]
)

query: Search query.
candidates: List of text snippets to rank.

Returns a sorted list of items with:

"probability" (higher = more relevant)
"cosine_similarity"

🔬 Testing

Run pytest and pytest-asyncio for automated testing:

pytest --asyncio-mode=auto

📝 Best Practices: Always Use a Splitter!

Why use a splitter?

Improves retrieval by processing smaller chunks of text.
Reduces token usage when embedding & ranking.
Faster performance in RAG and chatbot applications.

✅ How to Use the Built-in Markdown Splitter

from functools import partial

split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
    "return_each_line": False,
    "strip_headers": True,
}

md_splitter = partial(Embs.markdown_splitter, config=split_config)

# Use it when querying documents
docs = client.query_documents(
    query="Machine Learning Basics",
    files=["/path/to/ml_guide.pdf"],
    splitter=md_splitter
)

📜 License

Licensed under MIT License. See LICENSE for details.

🤝 Contributing

Pull requests, issues, and discussions are welcome!

With this enhanced documentation, embs is now even easier to use and more efficient! 🚀

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.8

Feb 2, 2025

This version

0.1.7

Feb 2, 2025

0.1.6

Feb 2, 2025

0.1.5

Feb 2, 2025

0.1.4

Feb 2, 2025

0.1.3

Feb 2, 2025

0.1.2

Jan 27, 2025

0.1.1

Jan 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embs-0.1.7.tar.gz (15.4 kB view details)

Uploaded Feb 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embs-0.1.7-py3-none-any.whl (13.5 kB view details)

Uploaded Feb 2, 2025 Python 3

File details

Details for the file embs-0.1.7.tar.gz.

File metadata

Download URL: embs-0.1.7.tar.gz
Upload date: Feb 2, 2025
Size: 15.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`cbe39f242c304787d0d4dddfdb86aad30b7f4c785a96c8996970d4077e12acce`
MD5	`b97860c7bdbf5eef142e313dc808e5a9`
BLAKE2b-256	`1efcf08bbd2fac3d26492a00e1fbbc4ebe83c5e4bd836db60421b95a9229b762`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.7.tar.gz:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.7.tar.gz
- Subject digest: cbe39f242c304787d0d4dddfdb86aad30b7f4c785a96c8996970d4077e12acce
- Sigstore transparency entry: 167975294
- Sigstore integration time: Feb 2, 2025
Source repository:
- Permalink: lh0x00/embs@2c15d70dd90e4061f373ed0bf54c160c0ecd8079
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2c15d70dd90e4061f373ed0bf54c160c0ecd8079
- Trigger Event: push

File details

Details for the file embs-0.1.7-py3-none-any.whl.

File metadata

Download URL: embs-0.1.7-py3-none-any.whl
Upload date: Feb 2, 2025
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embs-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4af12a1dc1c2cc96d93f3cc8a9529c5f1e4758382817838654217c509b6bc1c`
MD5	`7cc9977b6b3124081d2b4a5189944f61`
BLAKE2b-256	`b89ff1a98e01dd591b73f367c96baec81222b2f1dcc4d654a1fd397cf70bdb08`

See more details on using hashes here.

Provenance

The following attestation bundles were made for embs-0.1.7-py3-none-any.whl:

Publisher: release.yml on lh0x00/embs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: embs-0.1.7-py3-none-any.whl
- Subject digest: b4af12a1dc1c2cc96d93f3cc8a9529c5f1e4758382817838654217c509b6bc1c
- Sigstore transparency entry: 167975295
- Sigstore integration time: Feb 2, 2025
Source repository:
- Permalink: lh0x00/embs@2c15d70dd90e4061f373ed0bf54c160c0ecd8079
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/lh0x00
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2c15d70dd90e4061f373ed0bf54c160c0ecd8079
- Trigger Event: push

embs 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

embs

Why Choose embs?

🚀 Installation

📖 Quick Start Guide

1️⃣ Searching Documents via DuckDuckGo (Recommended!)

2️⃣ Querying Local & Online Documents with Ranking

⚡ Caching for Performance

🔍 Key Features & API Methods

🔹 search_documents_async()

🔹 query_documents_async()

🔹 embed_async()

🔹 rank_async()

🔬 Testing

📝 Best Practices: Always Use a Splitter!

✅ How to Use the Built-in Markdown Splitter

📜 License

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

🔹 `search_documents_async()`

🔹 `query_documents_async()`

🔹 `embed_async()`

🔹 `rank_async()`