Lightweight, no-database document search engine using quantized numpy vectors with BM25 and definition-aware ranking.
Project description
DBless
Lightweight, no-database PDF search engine — pure Python, no servers, no setup.
💡 What is DBless?
DBless is a lightweight document search engine that works entirely in-memory — no database, no server, no external dependencies beyond numpy. Point it at a PDF and start searching in seconds.
Why DBless?
- 🚀 Zero setup — no database to install or configure
- 🎯 Definition-aware — understands "What is X?" style queries
- 🪶 Lightweight — only requires
numpyandpymupdf - 📄 Multi-domain — works on legal, medical, corporate, and technical PDFs
- ⚡ Fast — sub-100ms search on CPU
📚 Documentation Contents
- Quick Start — Up and running in 2 minutes
- Installation — Install via pip or from source
- API Reference — Full API for
DBlessEngine - CLI Usage — Command-line interface
- How It Works — Architecture overview
- Contributing — How to contribute
🚀 Quick Start
from dbless.engine import DBlessEngine
# 1. Load and index a PDF
engine = DBlessEngine.from_pdf(
"document.pdf",
chunk_size=100, # words per chunk
overlap=20 # overlap between chunks
)
# 2. Search
results = engine.search("What is machine learning?", k=5)
# 3. Print results
for result in results:
print(f"Score : {result['score']:.2f}")
print(f"Snippet: {result['snippet']}")
print("---")
📦 Installation
pip install dbless
Or install from source:
git clone https://github.com/rahulreddy9725/Dbless.git
cd Dbless
pip install -e .
🔧 API Reference
DBlessEngine.from_pdf(path, chunk_size, overlap, vector_dim, factor_rank)
Loads a PDF, chunks it, embeds it, and builds the search index.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str / Path |
required | Path to the PDF file |
chunk_size |
int |
600 |
Number of words per chunk |
overlap |
int |
150 |
Word overlap between consecutive chunks |
vector_dim |
int |
512 |
Hash vector dimensionality |
factor_rank |
int |
128 |
SVD factorization rank |
Returns: DBlessEngine instance
engine.search(query, k)
Search the indexed PDF for the most relevant chunks.
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str |
required | Natural language query |
k |
int |
5 |
Number of results to return |
Returns: list[dict] — each result contains:
| Key | Description |
|---|---|
text |
Full chunk text |
snippet |
Best-matching sentence(s) from the chunk |
score |
Relevance score (0–100) |
page |
Page number in the PDF |
chunk_id |
Chunk index |
Example:
results = engine.search("What are exceptions?", k=3)
for r in results:
print(r["snippet"]) # best answer sentence
print(r["score"]) # relevance score 0-100
print(r["page"]) # page number
🖥️ CLI Usage
DBless ships with a command-line tool:
# Index a PDF and show chunk count
dbless index document.pdf
# Query a PDF and get top results
dbless query document.pdf "What is machine learning?" -k 5
⚙️ How It Works
| Step | Description |
|---|---|
| 1. Chunk | PDF is split into overlapping word-based chunks |
| 2. Embed | Each chunk is hashed into a sparse numpy vector |
| 3. IDF Weight | Term frequency weighting applied across all chunks |
| 4. SVD Compress | Dimensionality reduced via matrix factorization |
| 5. Quantize | Vectors quantized to Int8 for memory efficiency |
| 6. Search | Query vector matched via dot product + BM25 boosting |
| 7. Re-rank | Definition-style queries get special re-ranking |
🧪 Testing
pytest tests/
Test coverage includes:
- ✅ PDF ingestion and chunking
- ✅ Vector quantization accuracy
- ✅ Phrase and keyword search
- ✅ End-to-end engine from PDF to results
📊 Performance
| Metric | Value |
|---|---|
| Memory per 100-page PDF | ~10–50 MB |
| Search speed | < 100ms on CPU |
| Top-3 accuracy (definitions) | 85–95% |
| Python support | 3.8 – 3.12 |
🤝 Contributing
Contributions are welcome!
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m "Add my feature" - Push and open a Pull Request
Please open an issue first for major changes.
📄 License
MIT License — see LICENSE for details.
🔗 Quick Links
Built with ❤️ by Rahul Reddy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbless-0.1.0.tar.gz.
File metadata
- Download URL: dbless-0.1.0.tar.gz
- Upload date:
- Size: 4.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad4c42955549324d05795c46a2a195ffd7c9b781eb4a11be8fa68b62ce49298e
|
|
| MD5 |
608e162b1b49c08bf067b48c46df9a4f
|
|
| BLAKE2b-256 |
98801154a11ed001ea3532ecf6b4dcc58b82f39cb136631151f02ed56566e59a
|
File details
Details for the file dbless-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dbless-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
039607cc5aa0a90e81da8038087b7611e1f17de5e815d095beefcb9fc6a1ec1e
|
|
| MD5 |
87111002c43bfff1b34371b48de536e4
|
|
| BLAKE2b-256 |
c8c3c438c376967c0adbbd843afbc4f9739600b5fb287ac015868e36e6ca935c
|