a very fast and efficient text and image pdf extractor
Project description
PdfTextract
A very fast and efficient python PDF text & images extractor that uses the xpdf c++ library.
Features
- Several times fatser then any python based pdf text extractor
- very easy and simple to use
- Extract text while maintaining original document layout
- Trys to automaticaly extract tables if they exist (still in beta)
- No local server setup required
- No dependencies needed
Instalation
Install via PyPi:
pip install pdftextract
or via github:
- first clone the repo:
git clone https://github.com/Bnilss/pdftextract.git
- then run
python setup.py
Usage
- Importing the package
from pdftextract import XPdf
file_path = "examples/pubmed_example.pdf"
pdf = XPdf(file_path)
- Get the PDF meta-data
print(pdf.info) # this will return a dict of pdf metadata (author, size, pages..)
# to get the number of pages for example
print(pdf.info['Pages'])
- Extracting text from all pages in a PDF and return it as string
txt = pdf.to_text()
print(txt)
- Extracting a single page only, to get the 3rd page for example
# we can extrat using the previous method (start_index=1)
txt = pdf.to_text(just_one=3)
# or use the bracket notation (start_index=0)
txt = pdf[2]
- Extracting text from a single page (page 7) and saving it to .txt file
pdf.to_text("page7.txt", just_one=7)
- Extracting text from page 1 to 5
txt = pdf.to_text(start=1, end=5)
# or
txt = pdf[:5]
- Extract tables
pdf = XPdf("examples/table_sample.pdf")
txt = pdf.to_text(table=True)
# the use a regex or something to parse the text ..
# or try automatic paring (still in beta)
tables = pdf.table[:]
print(len(tables)) # 3
print(tables[0]) # print formated content of table 1
#Number of Coils | Number of Paperclips
#______________________________________
# 5 | 3, 5, 4
# 10 | 7, 8, 6
# 15 | 11, 10, 12
# 20 | 15, 13, 14
table1_data = tables[0].data # will return all rows in table except headers
OS support
by default the package support windows.
to use it in linux or mac:
- download xpdf files for linux or mac
- extract the files in got to os version (32/64 bit)
- copy these files: pdftotext, pdfinfo, pdfimages
- Got to the packages-directory/pdftextract/xpdf
- past the file in that directory
Credits
- xpdf c++ by Derek Noonburg
License
pdftextract
is licensed under the GNU General Public License (GPL), version 3.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftextract-0.0.2.tar.gz
(2.3 MB
view hashes)
Built Distribution
Close
Hashes for pdftextract-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90b529b0ddf5a514fb5f97edabe6dc30e04ba62763e5b011df50f5a144bc3e59 |
|
MD5 | 6c3ab9cd0bb22646e91deff67897d01d |
|
BLAKE2b-256 | 9a71b52331ebb3e861b84a6151ff6584a8db06203f3e77ee03d4c8aade72b1cf |