DocRag: An advanced document search and retrieval system leveraging Retrieval-Augmented Generation for intelligent natural language search across PDF document collections.
Project description
DocRag
DocRag is an advanced document search and retrieval system that leverages Retrieval-Augmented Generation (RAG) to provide intelligent natural language search capabilities across PDF document collections. This system combines sophisticated PDF processing, vector embeddings, and large language models to enable semantic understanding of document content and context-aware responses to complex queries.
Installation
pip install docrag
Getting Started
For a comprehensive tutorial on using DocSearch, check out our Getting Started notebook which covers:
- Setting up your Google Gemini API key
- Creating Document objects from PDFs
- Exploring document content (figures, tables, text, formulas)
- Converting documents to markdown
- Saving and exporting processed data
The notebook provides step-by-step examples and explanations to help you get up and running quickly with DocSearch.
Quick Example
import os
from docrag import Document
# Set your Google Gemini API key
os.environ['GEMINI_API_KEY'] = 'your-api-key-here'
# Process a PDF document
doc = Document.from_pdf('your_document.pdf')
# Access different content types
print(f"Found {len(doc.figures)} figures, {len(doc.tables)} tables")
# Convert to markdown
markdown = doc.to_markdown()
print(markdown)
# Save processed document
doc.save('output_directory')
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Authors
Logan Lang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docrag-0.0.3.tar.gz.
File metadata
- Download URL: docrag-0.0.3.tar.gz
- Upload date:
- Size: 13.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6004fd10357256255a1f607bfa29a1dfa524e50b598aa2ddbb725135db990541
|
|
| MD5 |
3872c62a270012a05f300bd5e022cb58
|
|
| BLAKE2b-256 |
dea7b4b3ca4395dacbaace085236a0b592318454fb539dc672ad286bf52591ef
|
File details
Details for the file docrag-0.0.3-py3-none-any.whl.
File metadata
- Download URL: docrag-0.0.3-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
628d1bb33a2d3fc713c8144a06a6cc0e89bbb3ec88cecfc818ddb5c56c0c9521
|
|
| MD5 |
66f7f38a2fac4da8836b3b5572ff33f5
|
|
| BLAKE2b-256 |
ae3a82e36b1bc25893ef4cdd1bfdb3ea0428058758965e2200963c2b6585fa99
|