Skip to main content

A tool to extract PDF files to markdown, or any other format using AI

Project description

AIPDF: Simple PDF OCR with GPT-like Multimodal Models

Screw traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!

AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON.

Installation

pip install aipdf

in macOS you will need to install poppler

brew install poppler 

Quick Start

from aipdf.ocr import ocr

# Your OpenAI API key   
api_key = 'your_openai_api_key'

file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key, prompt="extract markdown, extract tables and turn charts into tables")

Ollama

You can use with any ollama multi-modal models

ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=DEFAULT_PROMPT)

Any file system

We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc

From url

pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)

# extract markdown
pages = ocr(pdf_file, api_key, prompt="extract tables and turn charts into tables, return each table in json")

From S3

s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
                  aws_access_key_id=access_token,
                  aws_secret_access_key='', # Not needed for token-based auth
                  aws_session_token=access_token)


pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract markdown
pages = ocr(pdf_file, api_key, prompt="extract tables and turn charts into tables, return each table in json")

Why AIPDF?

  1. Simplicity: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.
  2. Flexibility: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.
  3. Power of AI: Leverages state-of-the-art multi modal models (gpt, llama, ..).
  4. Customizable: Tailor the extraction process to your specific needs with custom prompts.
  5. Efficient: Utilizes parallel processing for faster extraction of multi-page PDFs.

Requirements

  • Python 3.7+

We will keep this super clean, only 3 required libraries:

  • openai library to talk to completion endpoints
  • pdf2image library (for PDF to image conversion)
  • Pillow (PIL) library

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any problems or have any questions, please open an issue on the GitHub repository.


AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aipdf-0.0.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

aipdf-0.0.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file aipdf-0.0.1.tar.gz.

File metadata

  • Download URL: aipdf-0.0.1.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for aipdf-0.0.1.tar.gz
Algorithm Hash digest
SHA256 49eb1444a533bb802238476d117f7e82005b111020f32aee4a6e9741a6263fce
MD5 0de7c9e6184d64d8b16c54c20fc151ad
BLAKE2b-256 fb8bf08b5373eb2fb5ab3dc8608f7cd69b8da9b5240e29b47d22de03d9466266

See more details on using hashes here.

File details

Details for the file aipdf-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: aipdf-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for aipdf-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0618333f15f48b5d0d5d32871d3befa5cfff4977dc8dbfd6360dce38f967e081
MD5 0337abdb3e52a4d76eb3c3ecb5e3bd51
BLAKE2b-256 dc7ee8cf2d4afc0c44e1db8bee2dd39f284faafdde45858989a72e48a25a0ec5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page