Skip to main content

DocRag: An advanced document search and retrieval system leveraging Retrieval-Augmented Generation for intelligent natural language search across PDF document collections.

Project description

DocRag

DocRag is an advanced document search and retrieval system that leverages Retrieval-Augmented Generation (RAG) to provide intelligent natural language search capabilities across PDF document collections. This system combines sophisticated PDF processing, vector embeddings, and large language models to enable semantic understanding of document content and context-aware responses to complex queries.

Installation

pip install docrag

Getting Started

For a comprehensive tutorial on using DocSearch, check out our Getting Started notebook which covers:

  • Setting up your Google Gemini API key
  • Creating Document objects from PDFs
  • Exploring document content (figures, tables, text, formulas)
  • Converting documents to markdown
  • Saving and exporting processed data

The notebook provides step-by-step examples and explanations to help you get up and running quickly with DocSearch.

Quick Example

import os
from docrag import Document

# Set your Google Gemini API key
os.environ['GEMINI_API_KEY'] = 'your-api-key-here'

# Process a PDF document
doc = Document.from_pdf('your_document.pdf')

# Access different content types
print(f"Found {len(doc.figures)} figures, {len(doc.tables)} tables")

# Convert to markdown
markdown = doc.to_markdown()
print(markdown)

# Save processed document
doc.save('output_directory')

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Authors

Logan Lang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docrag-0.0.3.tar.gz (13.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docrag-0.0.3-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file docrag-0.0.3.tar.gz.

File metadata

  • Download URL: docrag-0.0.3.tar.gz
  • Upload date:
  • Size: 13.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docrag-0.0.3.tar.gz
Algorithm Hash digest
SHA256 6004fd10357256255a1f607bfa29a1dfa524e50b598aa2ddbb725135db990541
MD5 3872c62a270012a05f300bd5e022cb58
BLAKE2b-256 dea7b4b3ca4395dacbaace085236a0b592318454fb539dc672ad286bf52591ef

See more details on using hashes here.

File details

Details for the file docrag-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: docrag-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docrag-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 628d1bb33a2d3fc713c8144a06a6cc0e89bbb3ec88cecfc818ddb5c56c0c9521
MD5 66f7f38a2fac4da8836b3b5572ff33f5
BLAKE2b-256 ae3a82e36b1bc25893ef4cdd1bfdb3ea0428058758965e2200963c2b6585fa99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page