Extracts data from PDF files and saves it to Excel files.
Project description
📄 pdfsp
pdfsp is a Python package that extracts tables from PDF files and saves them to Excel. It also provides a simple Streamlit app for interactive viewing of the extracted data.
🚀 Features
- Extracts tabular data from PDFs using
pdfplumber - Converts tables into
pandasDataFrames - Saves output as
.xlsxExcel files usingopenpyxl - Ensures column names are unique to prevent issues
- Visualizes DataFrames with
streamlit
📦 Installation
Make sure you're using Python 3.10 or newer, then install with:
pip install pdfsp -U
python script
# pdf.py
from pdfsp import extract_tables, Options
# Define extraction options
source_folder = "."
output_folder = "output"
combine_tables = True
options = Options(
source_folder=source_folder,
output_folder=output_folder,
combine=combine_tables
)
# Run the table extraction
extract_tables(options)
From console / Terminal / Command Line
# Extract all tables from all PDF files in the current folder and save them to the current folder
pdfsp . .
# Extract and COMBINE large tables (spanning multiple pages) into single files, saved to the current folder
pdfsp . . --combine
# Extract and COMBINE tables, skipping the first row of each table (e.g., header rows)
pdfsp . . --combine --skiprows=1
# Extract all tables from PDF files in 'someFolder' and save them to 'SomeOutFolder'
pdfsp someFolder SomeOutFolder
# Extract all tables from 'some.pdf' and save them to the current folder
pdfsp some.pdf .
# Extract all tables from 'some.pdf' and save them to 'toThisFolder'
pdfsp some.pdf toThisFolder
=== 📊 Extraction Summary Report ===
✅ Successful Files: 3
- pdfs/report1.pdf → 🗂️ 5 tables extracted
- pdfs/summary2.pdf → 🗂️ 3 tables extracted
- pdfs/report2.pdf → 🗂️ 7 tables extracted
❌ Failed Files: 1
- pdfs/corrupted.pdf
⚠️ Some files failed to process. See details above.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdfsp-0.1.11.tar.gz
(7.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pdfsp-0.1.11-py3-none-any.whl
(12.2 kB
view details)
File details
Details for the file pdfsp-0.1.11.tar.gz.
File metadata
- Download URL: pdfsp-0.1.11.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1773871ab5178dbfd09760f048e35244ab328246e9e2e591c937ad2b4ac14f1d
|
|
| MD5 |
c6b2be0687564d10e1c1a69dd9a83e51
|
|
| BLAKE2b-256 |
5ddac087db137037e479fb88d108821f3b6b1d55fad8ace2be630233630d2b7f
|
File details
Details for the file pdfsp-0.1.11-py3-none-any.whl.
File metadata
- Download URL: pdfsp-0.1.11-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62b3f06eb170c13c9692c8f7ac6aa21de06da912f9388a82320a81481b7a1d5d
|
|
| MD5 |
a407e13e30724dab0b5a3e2fa05efcf1
|
|
| BLAKE2b-256 |
3c4249386cfa52c6d02acbabb2d53c178bcb71ae7fe30896f0be3d1c3e98f309
|