Skip to main content

Turn PDFs into image files for use in machine learning projects

Project description

pdfsplitter

A simple way to extract and parse images for machine learning workflows.

This file will become your README and also the index of your documentation.

Install

pip install --upgrade pdfsplitter

How to use

The highest-level function for exporting image files from a series of images is extract_images_from_pdfs, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:

source = Path("./tryout/")
destination = Path("./tryout/processed")

# download all the PDFs listed on a particular list of URLs
download_pdf_files(
    get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)

# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
# get stats on the downloaded PDF files
display_stats(get_stats(source))
                                  Stats for your PDF Files                                   
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ PageCou…  Filename                                       ocr_lay…  pdf_fil…  author   ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│       27  2014_ACFO_Report_FINAL_REPORT.pdf              False     236655    Stephan… │
│          │                                               │          │          │ Carr     │
│        3  7-26-2013_Determination.pdf                    False     214683             │
│        2  DA Determination-DCRIT Hawaii Water Wells.pdf  False     115574             │
│        3  12-18-14_Determination.pdf                     False     50925              │
│        4  6-1-2012_Determination.pdf                     False     463902             │
│        2  8-19-2021_Determination.pdf                    False     350438             │
│       15  2012_ACFO_Report_FINAL_REPORT.pdf              False     242305    CarrS    │
│        3  2-12-2014_Determination.pdf                    False     23823     timothy… │
│        2  DA%20Determination%20DoD%20Flights.pdf         False     111521             │
│       22  2013_ACFO_Report_FINAL_REPORT.pdf              False     258462    CarrS    │
│        2  2-15-2018_Determination.pdf                    False     342195             │
│       49  DoDFY2020AnnualFOIA_Report.pdf                 False     1247446            │
│        3  7-5-2019_Determination.pdf                     False     204453             │
│       30  2017_DoD_Chief_FOIA_Officer_Report.pdf         False     4810077            │
│       28  2021_DoD_Chief_FOIA_Officer_Report.pdf         False     1131474            │
│       10  2011_DoD_Chief_FOIA_OfficerReport.pdf          False     113387    CarrS    │
│       27  2018_DoD_Chief_FOIA_Officer_Report.pdf         False     788227    brandoct │
│        2  8-3-15_Determination.pdf                       False     105563             │
│        3  1-21-2016_Determination.pdf                    False     122706             │
│        2  12-6-2017_Determination.pdf                    False     189563    deleonv  │
│        2  12-18-2018_Determination.pdf                   False     153675             │
│       30  2016_ACFO_Report_FINAL_REPORT.pdf              False     1108008            │
│        2  11-29-2017_Determination.pdf                   False     369290             │
│        2  DoD SAP IT DCRIT Determination.pdf             False     127858             │
│        3  10-19-2018_Determination.pdf                   False     70088     JAMES    │
│          │                                               │          │          │ HOGAN    │
│       30  2015_ACFO_Report_FINAL_REPORT.pdf              False     287445    Stephan… │
│          │                                               │          │          │ Carr     │
│        3  7-31-2020_Determination.pdf                    False     88447     Dziecic… │
│          │                                               │          │          │ Gerald J │
│          │                                               │          │          │ Jr CIV   │
│          │                                               │          │          │ OSD OGC  │
│          │                                               │          │          │ (USA)    │
└──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘
TOTAL PAGECOUNT: 311

What is pdfsplitter?

Features

  • statistics generation
  • image extraction

Install

How to use

Roadmap

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsplitter-0.0.5.tar.gz (14.5 kB view hashes)

Uploaded source

Built Distribution

pdfsplitter-0.0.5-py3-none-any.whl (11.8 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page