Skip to main content

A tool to extract PDF files to markdown, or any other format using AI

Project description

AIPDF: Simple PDF OCR with GPT-like Multimodal Models

Screw traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!

AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON.

Installation

pip install aipdf

in macOS you will need to install poppler

brew install poppler 

Quick Start

from aipdf import ocr

# Your OpenAI API key   
api_key = 'your_openai_api_key'

file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key)

Ollama

You can use with any ollama multi-modal models

ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=...)

Any file system

We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc

From url

pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)

# extract
pages = ocr(pdf_file, api_key, prompt="extract tables, return each table in json")

From S3

s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
                  aws_access_key_id=access_token,
                  aws_secret_access_key='', # Not needed for token-based auth
                  aws_session_token=access_token)


pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract 
pages = ocr(pdf_file, api_key, prompt="extract charts data, turn it into tables that represent the variables in the chart")

Why AIPDF?

  1. Simplicity: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.
  2. Flexibility: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.
  3. Power of AI: Leverages state-of-the-art multi modal models (gpt, llama, ..).
  4. Customizable: Tailor the extraction process to your specific needs with custom prompts.
  5. Efficient: Utilizes parallel processing for faster extraction of multi-page PDFs.

Requirements

  • Python 3.7+

We will keep this super clean, only 3 required libraries:

  • openai library to talk to completion endpoints
  • pdf2image library (for PDF to image conversion)
  • Pillow (PIL) library

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any problems or have any questions, please open an issue on the GitHub repository.


AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aipdf-0.0.4.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

aipdf-0.0.4-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file aipdf-0.0.4.tar.gz.

File metadata

  • Download URL: aipdf-0.0.4.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for aipdf-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d0fce45c256b23c50ef13fb5095dae114364e994bf6f2ac9821aaad358fd7578
MD5 5407e1757199c25b6dcf03761d1d2875
BLAKE2b-256 34f985a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433

See more details on using hashes here.

File details

Details for the file aipdf-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: aipdf-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for aipdf-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 65de523e9ce064982546525bf90021d217c01b8b9655ac3f855594ba52d8f271
MD5 b3a8fdad17378f19893735e636ac1faf
BLAKE2b-256 d2af773140ec0643c11ad1d68ecb3c9fb942ffbfd46ce909bb3aa8aa2f0bc8df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page