Turn PDFs into image files for use in machine learning projects
Project description
pdfsplitter
A simple way to extract and parse images for machine learning workflows.
This file will become your README and also the index of your documentation.
Install
pip install --upgrade pdfsplitter
How to use
The highest-level function for exporting image files from a series of images is extract_images_from_pdfs
, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:
source = Path("./tryout/")
destination = Path("./tryout/processed")
# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)
# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
# get stats on the downloaded PDF files
display_stats(get_stats(source))
Stats for your PDF Files ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ PageCou… ┃ Filename ┃ ocr_lay… ┃ pdf_fil… ┃ author ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ 27 │ 2014_ACFO_Report_FINAL_REPORT.pdf │ False │ 236655 │ Stephan… │ │ │ │ │ │ Carr │ │ 3 │ 7-26-2013_Determination.pdf │ False │ 214683 │ │ │ 2 │ DA Determination-DCRIT Hawaii Water Wells.pdf │ False │ 115574 │ │ │ 3 │ 12-18-14_Determination.pdf │ False │ 50925 │ │ │ 4 │ 6-1-2012_Determination.pdf │ False │ 463902 │ │ │ 2 │ 8-19-2021_Determination.pdf │ False │ 350438 │ │ │ 15 │ 2012_ACFO_Report_FINAL_REPORT.pdf │ False │ 242305 │ CarrS │ │ 3 │ 2-12-2014_Determination.pdf │ False │ 23823 │ timothy… │ │ 2 │ DA%20Determination%20DoD%20Flights.pdf │ False │ 111521 │ │ │ 22 │ 2013_ACFO_Report_FINAL_REPORT.pdf │ False │ 258462 │ CarrS │ │ 2 │ 2-15-2018_Determination.pdf │ False │ 342195 │ │ │ 49 │ DoDFY2020AnnualFOIA_Report.pdf │ False │ 1247446 │ │ │ 3 │ 7-5-2019_Determination.pdf │ False │ 204453 │ │ │ 30 │ 2017_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 4810077 │ │ │ 28 │ 2021_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 1131474 │ │ │ 10 │ 2011_DoD_Chief_FOIA_OfficerReport.pdf │ False │ 113387 │ CarrS │ │ 27 │ 2018_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 788227 │ brandoct │ │ 2 │ 8-3-15_Determination.pdf │ False │ 105563 │ │ │ 3 │ 1-21-2016_Determination.pdf │ False │ 122706 │ │ │ 2 │ 12-6-2017_Determination.pdf │ False │ 189563 │ deleonv │ │ 2 │ 12-18-2018_Determination.pdf │ False │ 153675 │ │ │ 30 │ 2016_ACFO_Report_FINAL_REPORT.pdf │ False │ 1108008 │ │ │ 2 │ 11-29-2017_Determination.pdf │ False │ 369290 │ │ │ 2 │ DoD SAP IT DCRIT Determination.pdf │ False │ 127858 │ │ │ 3 │ 10-19-2018_Determination.pdf │ False │ 70088 │ JAMES │ │ │ │ │ │ HOGAN │ │ 30 │ 2015_ACFO_Report_FINAL_REPORT.pdf │ False │ 287445 │ Stephan… │ │ │ │ │ │ Carr │ │ 3 │ 7-31-2020_Determination.pdf │ False │ 88447 │ Dziecic… │ │ │ │ │ │ Gerald J │ │ │ │ │ │ Jr CIV │ │ │ │ │ │ OSD OGC │ │ │ │ │ │ (USA) │ └──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘
TOTAL PAGECOUNT: 311
What is pdfsplitter?
Features
- statistics generation
- image extraction
Install
How to use
Roadmap
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdfsplitter-0.0.5.tar.gz
(14.5 kB
view hashes)
Built Distribution
Close
Hashes for pdfsplitter-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eed9abf748bbc45e571e747e21a7d822b10e5556368b24e31483bb433f68e3d6 |
|
MD5 | 3848b6bf9b0af9aef3dae03234a825c8 |
|
BLAKE2b-256 | 5b8b0986c67b124ea31a6b27e00ef0b19615476407783bc99e2f792784ee319c |