Extracts images from PDFs, stores them in S3, and retrieves based on keyword search

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PDF Image Retrieval

Overview

This Python package extracts images from PDFs, stores them in AWS S3, and retrieves relevant images based on keyword extraction.

Features

Extract images from PDFs automatically
Upload images to AWS S3 for storage
Extract keywords from PDFs using NLP
Train on PDFs to improve retrieval accuracy
Search and retrieve images based on keywords
Open-source & developer-friendly

Installation

Install the package from PyPI:

pip install pdf-image-retrieval

Or install directly from GitHub:

pip install git+https://github.com/aryadhandhukiya/pdf-image-retrieval.git

Usage

Extracting Images from a PDF and Storing in S3

from pdf_image_retrieval import PdfImageExtractor

# Initialize extractor
extractor = PdfImageExtractor(pdf_path="sample.pdf", s3_bucket="my-bucket")

# Extract images and store them in S3
extractor.extract_and_upload()

Retrieving Relevant Images Based on a Query

from pdf_image_retrieval import PdfImageRetriever

# Initialize retriever
retriever = PdfImageRetriever(s3_bucket="my-bucket")

# Search for images based on keywords
images = retriever.search_images("machine learning diagram")

# Print retrieved image URLs
for img in images:
    print(img)

Configuration

Set up AWS credentials via environment variables:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="your-region"

Contributing

We welcome contributions! Here's how you can help:

Fork the repo and create a new branch.
Make your changes and commit them.
Open a pull request with a description of your changes.

To set up for development:

git clone https://github.com/aryandhandhukiya/pdf-image-retrieval.git
cd pdf-image-retrieval
pip install -r requirements.txt

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Nov 2, 2025

0.1.0

Mar 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_image_retrieval-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Nov 2, 2025 Python 3

File details

Details for the file pdf_image_retrieval-0.1.1-py3-none-any.whl.

File metadata

Download URL: pdf_image_retrieval-0.1.1-py3-none-any.whl
Upload date: Nov 2, 2025
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for pdf_image_retrieval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2c0a6a7946c868d2ad2bca872fc5dedfb8ce606ae4655ba6b4e91197b877ded`
MD5	`e584d1dbf821ab753cb454c3d9240976`
BLAKE2b-256	`e813d5f36a014527cf9881e438d89029cc577c7f69f9822916b62d89bf10a893`

See more details on using hashes here.

pdf-image-retrieval 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Image Retrieval

Overview

Features

Installation

Usage

Configuration

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes