Skip to main content

a very fast and efficient text and image pdf extractor.

Project description

PdfTextract

A very fast and efficient python PDF text & images extractor that uses the xpdf c++ library.

Features

  • Several times fatser then any python based pdf text extractor
  • very easy and simple to use
  • Extract text while maintaining original document layout
  • Trys to automaticaly extract tables if they exist (still in beta)
  • No local server setup required
  • No dependencies needed

Instalation

Install via PyPi:

pip install pdftextract

or via github:

  1. first clone the repo:
git clone https://github.com/Bnilss/pdftextract.git
  1. then run
python setup.py

Usage

  1. Importing the package
from pdftextract import XPdf

file_path = "examples/pubmed_example.pdf"
pdf = XPdf(file_path)
  1. Get the PDF meta-data
print(pdf.info) # this will return a dict of pdf metadata (author, size, pages..)
# to get the number of pages for example
print(pdf.info['Pages'])
  1. Extracting text from all pages in a PDF and return it as string
txt = pdf.to_text()
print(txt)
  1. Extracting a single page only, to get the 3rd page for example
# we can extrat using the previous method (start_index=1)
txt = pdf.to_text(just_one=3)
# or use the bracket notation (start_index=0)
txt = pdf[2]
  1. Extracting text from a single page (page 7) and saving it to .txt file
pdf.to_text("page7.txt", just_one=7)
  1. Extracting text from page 1 to 5
txt = pdf.to_text(start=1, end=5)
# or
txt = pdf[:5]
  1. Extract tables
pdf = XPdf("examples/table_sample.pdf")
txt = pdf.to_text(table=True)
# the use a regex or something to parse the text ..
# or try automatic paring (still in beta)
tables = pdf.table[:]
print(len(tables)) # 3
print(tables[0]) # print formated content of table 1
#Number of Coils | Number of Paperclips
#______________________________________
#       5        |       3, 5, 4
#      10        |       7, 8, 6
#      15        |      11, 10, 12
#      20        |      15, 13, 14
table1_data = tables[0].data # will return all rows in table except headers

OS support

by default the package support windows.

to use it in linux or mac:

  1. download xpdf files for linux or mac
  2. extract the files in got to os version (32/64 bit)
  3. copy these files: pdftotext, pdfinfo, pdfimages
  4. Got to the packages-directory/pdftextract/xpdf
  5. past the file in that directory

Credits

License

pdftextract is licensed under the GNU General Public License (GPL), version 3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftextract-0.0.5.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdftextract-0.0.5-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file pdftextract-0.0.5.tar.gz.

File metadata

  • Download URL: pdftextract-0.0.5.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for pdftextract-0.0.5.tar.gz
Algorithm Hash digest
SHA256 3319dd7e87533187701ddf9a1218ebc738f12f7eb102671ff5119a8b5eea66aa
MD5 6649e0efe2e535359419b6220e7e7162
BLAKE2b-256 8a98cae04a81da26650858f90368b0c1f70cf262c2c5fda02e46927d6ff17d83

See more details on using hashes here.

File details

Details for the file pdftextract-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: pdftextract-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for pdftextract-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a0f1f4f9a2b4b99ec403a2a8646b556c55a8da0ece74cbd0452bc6e01a7fada1
MD5 6dbc7769fd9b2b96b31adb23d1e8abc9
BLAKE2b-256 a96d7a54570c2e16da40e0fbb8d2a5ab67fc95c569fde73b080ea8d6aa138d01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page