a very fast and efficient text and image pdf extractor.
Project description
PdfTextract
A very fast and efficient python PDF text & images extractor that uses the xpdf c++ library.
Features
- Several times fatser then any python based pdf text extractor
- very easy and simple to use
- Extract text while maintaining original document layout
- Trys to automaticaly extract tables if they exist (still in beta)
- No local server setup required
- No dependencies needed
Instalation
Install via PyPi:
pip install pdftextract
or via github:
- first clone the repo:
git clone https://github.com/Bnilss/pdftextract.git
- then run
python setup.py
Usage
- Importing the package
from pdftextract import XPdf
file_path = "examples/pubmed_example.pdf"
pdf = XPdf(file_path)
- Get the PDF meta-data
print(pdf.info) # this will return a dict of pdf metadata (author, size, pages..)
# to get the number of pages for example
print(pdf.info['Pages'])
- Extracting text from all pages in a PDF and return it as string
txt = pdf.to_text()
print(txt)
- Extracting a single page only, to get the 3rd page for example
# we can extrat using the previous method (start_index=1)
txt = pdf.to_text(just_one=3)
# or use the bracket notation (start_index=0)
txt = pdf[2]
- Extracting text from a single page (page 7) and saving it to .txt file
pdf.to_text("page7.txt", just_one=7)
- Extracting text from page 1 to 5
txt = pdf.to_text(start=1, end=5)
# or
txt = pdf[:5]
- Extract tables
pdf = XPdf("examples/table_sample.pdf")
txt = pdf.to_text(table=True)
# the use a regex or something to parse the text ..
# or try automatic paring (still in beta)
tables = pdf.table[:]
print(len(tables)) # 3
print(tables[0]) # print formated content of table 1
#Number of Coils | Number of Paperclips
#______________________________________
# 5 | 3, 5, 4
# 10 | 7, 8, 6
# 15 | 11, 10, 12
# 20 | 15, 13, 14
table1_data = tables[0].data # will return all rows in table except headers
OS support
by default the package support windows.
to use it in linux or mac:
- download xpdf files for linux or mac
- extract the files in got to os version (32/64 bit)
- copy these files: pdftotext, pdfinfo, pdfimages
- Got to the packages-directory/pdftextract/xpdf
- past the file in that directory
Credits
- xpdf c++ by Derek Noonburg
License
pdftextract is licensed under the GNU General Public License (GPL), version 3.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdftextract-0.0.5.tar.gz.
File metadata
- Download URL: pdftextract-0.0.5.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3319dd7e87533187701ddf9a1218ebc738f12f7eb102671ff5119a8b5eea66aa
|
|
| MD5 |
6649e0efe2e535359419b6220e7e7162
|
|
| BLAKE2b-256 |
8a98cae04a81da26650858f90368b0c1f70cf262c2c5fda02e46927d6ff17d83
|
File details
Details for the file pdftextract-0.0.5-py3-none-any.whl.
File metadata
- Download URL: pdftextract-0.0.5-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0f1f4f9a2b4b99ec403a2a8646b556c55a8da0ece74cbd0452bc6e01a7fada1
|
|
| MD5 |
6dbc7769fd9b2b96b31adb23d1e8abc9
|
|
| BLAKE2b-256 |
a96d7a54570c2e16da40e0fbb8d2a5ab67fc95c569fde73b080ea8d6aa138d01
|