a very fast and efficient text and image pdf extractor.

These details have not been verified by PyPI

Project links

Homepage

Project description

PdfTextract

A very fast and efficient python PDF text & images extractor that uses the xpdf c++ library.

Features

Several times fatser then any python based pdf text extractor
very easy and simple to use
Extract text while maintaining original document layout
Trys to automaticaly extract tables if they exist (still in beta)
No local server setup required
No dependencies needed

Instalation

Install via PyPi:

pip install pdftextract

or via github:

first clone the repo:

git clone https://github.com/Bnilss/pdftextract.git

then run

python setup.py

Usage

Importing the package

from pdftextract import XPdf

file_path = "examples/pubmed_example.pdf"
pdf = XPdf(file_path)

Get the PDF meta-data

print(pdf.info) # this will return a dict of pdf metadata (author, size, pages..)
# to get the number of pages for example
print(pdf.info['Pages'])

Extracting text from all pages in a PDF and return it as string

txt = pdf.to_text()
print(txt)

Extracting a single page only, to get the 3rd page for example

# we can extrat using the previous method (start_index=1)
txt = pdf.to_text(just_one=3)
# or use the bracket notation (start_index=0)
txt = pdf[2]

Extracting text from a single page (page 7) and saving it to .txt file

pdf.to_text("page7.txt", just_one=7)

Extracting text from page 1 to 5

txt = pdf.to_text(start=1, end=5)
# or
txt = pdf[:5]

Extract tables

pdf = XPdf("examples/table_sample.pdf")
txt = pdf.to_text(table=True)
# the use a regex or something to parse the text ..
# or try automatic paring (still in beta)
tables = pdf.table[:]
print(len(tables)) # 3
print(tables[0]) # print formated content of table 1
#Number of Coils | Number of Paperclips
#______________________________________
#       5        |       3, 5, 4
#      10        |       7, 8, 6
#      15        |      11, 10, 12
#      20        |      15, 13, 14
table1_data = tables[0].data # will return all rows in table except headers

OS support

by default the package support windows.

to use it in linux or mac:

download xpdf files for linux or mac
extract the files in got to os version (32/64 bit)
copy these files: pdftotext, pdfinfo, pdfimages
Got to the packages-directory/pdftextract/xpdf
past the file in that directory

Credits

xpdf c++ by Derek Noonburg

License

pdftextract is licensed under the GNU General Public License (GPL), version 3.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.5

Mar 18, 2021

0.0.3

Mar 14, 2021

0.0.2

Mar 14, 2021

0.0.1

Mar 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftextract-0.0.5.tar.gz (2.3 MB view details)

Uploaded Mar 18, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdftextract-0.0.5-py3-none-any.whl (1.4 MB view details)

Uploaded Mar 18, 2021 Python 3

File details

Details for the file pdftextract-0.0.5.tar.gz.

File metadata

Download URL: pdftextract-0.0.5.tar.gz
Upload date: Mar 18, 2021
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for pdftextract-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`3319dd7e87533187701ddf9a1218ebc738f12f7eb102671ff5119a8b5eea66aa`
MD5	`6649e0efe2e535359419b6220e7e7162`
BLAKE2b-256	`8a98cae04a81da26650858f90368b0c1f70cf262c2c5fda02e46927d6ff17d83`

See more details on using hashes here.

File details

Details for the file pdftextract-0.0.5-py3-none-any.whl.

File metadata

Download URL: pdftextract-0.0.5-py3-none-any.whl
Upload date: Mar 18, 2021
Size: 1.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for pdftextract-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0f1f4f9a2b4b99ec403a2a8646b556c55a8da0ece74cbd0452bc6e01a7fada1`
MD5	`6dbc7769fd9b2b96b31adb23d1e8abc9`
BLAKE2b-256	`a96d7a54570c2e16da40e0fbb8d2a5ab67fc95c569fde73b080ea8d6aa138d01`

See more details on using hashes here.

pdftextract 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PdfTextract

Features

Instalation

Usage

OS support

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes