Skip to main content

DocRag: An advanced document search and retrieval system leveraging Retrieval-Augmented Generation for intelligent natural language search across PDF document collections.

Project description

DocRag

DocRag is an advanced document search and retrieval system that leverages Retrieval-Augmented Generation (RAG) to provide intelligent natural language search capabilities across PDF document collections. This system combines sophisticated PDF processing, vector embeddings, and large language models to enable semantic understanding of document content and context-aware responses to complex queries.

Installation

pip install docrag

Getting Started

For a comprehensive tutorial on using DocSearch, check out our Getting Started notebook which covers:

  • Setting up your Google Gemini API key
  • Creating Document objects from PDFs
  • Exploring document content (figures, tables, text, formulas)
  • Converting documents to markdown
  • Saving and exporting processed data

The notebook provides step-by-step examples and explanations to help you get up and running quickly with DocSearch.

Quick Example

import os
from docrag import Document

# Set your Google Gemini API key
os.environ['GEMINI_API_KEY'] = 'your-api-key-here'

# Process a PDF document
doc = Document.from_pdf('your_document.pdf')

# Access different content types
print(f"Found {len(doc.figures)} figures, {len(doc.tables)} tables")

# Convert to markdown
markdown = doc.to_markdown()
print(markdown)

# Save processed document
doc.save('output_directory')

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Authors

Logan Lang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docrag-0.0.2.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docrag-0.0.2-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file docrag-0.0.2.tar.gz.

File metadata

  • Download URL: docrag-0.0.2.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docrag-0.0.2.tar.gz
Algorithm Hash digest
SHA256 af6a47fd9c62d5c9c5d6eb702059d05d9672c80674803b20042a6cc2a54634fe
MD5 6cdf12611ec7ae158a0b355eab133219
BLAKE2b-256 b5d2b2da5f5715b0c3412c87c1a51be92128b95e2bf7bdb91ba18e3d799148d2

See more details on using hashes here.

File details

Details for the file docrag-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: docrag-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docrag-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e0da5220bb3824b75a80734d801541346eb00498e76ea7568bbf35e13edcecc4
MD5 839b4be91b9ffc5378c4d6b6c15712ff
BLAKE2b-256 7549288558bb3a6e3b089f31154416de6b15f2c70dee6a39906e71772373bcc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page