Skip to main content

Lightweight, no-database document search engine using quantized numpy vectors with BM25 and definition-aware ranking.

Project description

DBless

Lightweight, no-database PDF search engine — pure Python, no servers, no setup.

PyPI version Python License: MIT


💡 What is DBless?

DBless is a lightweight document search engine that works entirely in-memory — no database, no server, no external dependencies beyond numpy. Point it at a PDF and start searching in seconds.

Why DBless?

  • 🚀 Zero setup — no database to install or configure
  • 🎯 Definition-aware — understands "What is X?" style queries
  • 🪶 Lightweight — only requires numpy and pymupdf
  • 📄 Multi-domain — works on legal, medical, corporate, and technical PDFs
  • Fast — sub-100ms search on CPU

📚 Documentation Contents


🚀 Quick Start

from dbless.engine import DBlessEngine

# 1. Load and index a PDF
engine = DBlessEngine.from_pdf(
    "document.pdf",
    chunk_size=100,   # words per chunk
    overlap=20        # overlap between chunks
)

# 2. Search
results = engine.search("What is machine learning?", k=5)

# 3. Print results
for result in results:
    print(f"Score : {result['score']:.2f}")
    print(f"Snippet: {result['snippet']}")
    print("---")

📦 Installation

pip install dbless

Or install from source:

git clone https://github.com/rahulreddy9725/Dbless.git
cd Dbless
pip install -e .

🔧 API Reference

DBlessEngine.from_pdf(path, chunk_size, overlap, vector_dim, factor_rank)

Loads a PDF, chunks it, embeds it, and builds the search index.

Parameter Type Default Description
path str / Path required Path to the PDF file
chunk_size int 600 Number of words per chunk
overlap int 150 Word overlap between consecutive chunks
vector_dim int 512 Hash vector dimensionality
factor_rank int 128 SVD factorization rank

Returns: DBlessEngine instance


engine.search(query, k)

Search the indexed PDF for the most relevant chunks.

Parameter Type Default Description
query str required Natural language query
k int 5 Number of results to return

Returns: list[dict] — each result contains:

Key Description
text Full chunk text
snippet Best-matching sentence(s) from the chunk
score Relevance score (0–100)
page Page number in the PDF
chunk_id Chunk index

Example:

results = engine.search("What are exceptions?", k=3)
for r in results:
    print(r["snippet"])   # best answer sentence
    print(r["score"])     # relevance score 0-100
    print(r["page"])      # page number

🖥️ CLI Usage

DBless ships with a command-line tool:

# Index a PDF and show chunk count
dbless index document.pdf

# Query a PDF and get top results
dbless query document.pdf "What is machine learning?" -k 5

⚙️ How It Works

Step Description
1. Chunk PDF is split into overlapping word-based chunks
2. Embed Each chunk is hashed into a sparse numpy vector
3. IDF Weight Term frequency weighting applied across all chunks
4. SVD Compress Dimensionality reduced via matrix factorization
5. Quantize Vectors quantized to Int8 for memory efficiency
6. Search Query vector matched via dot product + BM25 boosting
7. Re-rank Definition-style queries get special re-ranking

🧪 Testing

pytest tests/

Test coverage includes:

  • ✅ PDF ingestion and chunking
  • ✅ Vector quantization accuracy
  • ✅ Phrase and keyword search
  • ✅ End-to-end engine from PDF to results

📊 Performance

Metric Value
Memory per 100-page PDF ~10–50 MB
Search speed < 100ms on CPU
Top-3 accuracy (definitions) 85–95%
Python support 3.8 – 3.12

🤝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Commit your changes: git commit -m "Add my feature"
  4. Push and open a Pull Request

Please open an issue first for major changes.


📄 License

MIT License — see LICENSE for details.


🔗 Quick Links


Built with ❤️ by Rahul Reddy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbless-0.1.0.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbless-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file dbless-0.1.0.tar.gz.

File metadata

  • Download URL: dbless-0.1.0.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for dbless-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ad4c42955549324d05795c46a2a195ffd7c9b781eb4a11be8fa68b62ce49298e
MD5 608e162b1b49c08bf067b48c46df9a4f
BLAKE2b-256 98801154a11ed001ea3532ecf6b4dcc58b82f39cb136631151f02ed56566e59a

See more details on using hashes here.

File details

Details for the file dbless-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dbless-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for dbless-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 039607cc5aa0a90e81da8038087b7611e1f17de5e815d095beefcb9fc6a1ec1e
MD5 87111002c43bfff1b34371b48de536e4
BLAKE2b-256 c8c3c438c376967c0adbbd843afbc4f9739600b5fb287ac015868e36e6ca935c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page