Skip to main content

Extracts images from PDFs, stores them in S3, and retrieves based on keyword search

Project description

PDF Image Retrieval

Overview

This Python package extracts images from PDFs, stores them in AWS S3, and retrieves relevant images based on keyword extraction.

Features

  • Extract images from PDFs automatically
  • Upload images to AWS S3 for storage
  • Extract keywords from PDFs using NLP
  • Train on PDFs to improve retrieval accuracy
  • Search and retrieve images based on keywords
  • Open-source & developer-friendly

Installation

Install the package from PyPI:

pip install pdf-image-retrieval

Or install directly from GitHub:

pip install git+https://github.com/aryadhandhukiya/pdf-image-retrieval.git

Usage

Extracting Images from a PDF and Storing in S3

from pdf_image_retrieval import PdfImageExtractor

# Initialize extractor
extractor = PdfImageExtractor(pdf_path="sample.pdf", s3_bucket="my-bucket")

# Extract images and store them in S3
extractor.extract_and_upload()

Retrieving Relevant Images Based on a Query

from pdf_image_retrieval import PdfImageRetriever

# Initialize retriever
retriever = PdfImageRetriever(s3_bucket="my-bucket")

# Search for images based on keywords
images = retriever.search_images("machine learning diagram")

# Print retrieved image URLs
for img in images:
    print(img)

Configuration

Set up AWS credentials via environment variables:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="your-region"

Contributing

We welcome contributions! Here's how you can help:

  1. Fork the repo and create a new branch.

  2. Make your changes and commit them.

  3. Open a pull request with a description of your changes.

To set up for development:

git clone https://github.com/aryandhandhukiya/pdf-image-retrieval.git
cd pdf-image-retrieval
pip install -r requirements.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_image_retrieval-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf_image_retrieval-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_image_retrieval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b2c0a6a7946c868d2ad2bca872fc5dedfb8ce606ae4655ba6b4e91197b877ded
MD5 e584d1dbf821ab753cb454c3d9240976
BLAKE2b-256 e813d5f36a014527cf9881e438d89029cc577c7f69f9822916b62d89bf10a893

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page