Skip to main content

A tool to extract text from PDF files.

Project description

extractpdf

A python package focused on extracting content out of PDF files.

There seems to be many options out there, but no single solution that is easy to install, even on Windows, and focus specifically on PDF files. So we have created this extractpdf package.

It is based on Textract structure, but focuses on PDF only, and adds also other tools to the pipline, such as PyPDF2 and Camelot.

Usage:

To use this package, install it from pypi using:

pip install extractpdf

Then use it like so:

import extractpdf as epdf

# local file
content = epdf.process('my_file.pdf')
# url:
content = epdf.process('http://www.example.com/some_file.pdf')

Advanced Usage:

To control more features, one can use the PDFExtractor itself:

from extractpdf import PDFExtractor
epdf = PDFExtractor()
content = epdf.get_content('http://www.example.com/some_file.pdf', keep_download=True)
f = epdf.filename # f = some_file.pdf
epdf.delete_file()

Development

We welcome contributers warmly!

For running this project locally, you need first to install the dependency packages. To install them, you can use pipenv:

Installation using pipenv (which combines virtualenv with pip)

Install pipenv

# if you haven't installed pip
sudo easy_install pip

# install pipenv
pip install pipenv

On MacOS - you can use homebrew:

brew install pipenv

Set the pipenv to be local in the project: On Windows:

set PIPENV_VENV_IN_PROJECT=true 

On Mac/Linux:

export PIPENV_VENV_IN_PROJECT=true 

... and then, install the packages and run the server

 # install all packages
pipenv install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractpdf-0.0.4.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

extractpdf-0.0.4-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file extractpdf-0.0.4.tar.gz.

File metadata

  • Download URL: extractpdf-0.0.4.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for extractpdf-0.0.4.tar.gz
Algorithm Hash digest
SHA256 6a94e12dea1ce7b33e3016e0f4d00f2150b9850952cf107e0db441844b442c59
MD5 5f4c6d83d8b693a6ea38bf58e54e5be0
BLAKE2b-256 ce2bac1cd6ddd8a6a6e9c606bf9b83cf06e95598b7b8d54a740eccc5ecf937ca

See more details on using hashes here.

File details

Details for the file extractpdf-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: extractpdf-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for extractpdf-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 08ce7a29bbd88a2c4bcfd175c690ac2ad2cd0a76d0884ef4e11bea80916eb1c8
MD5 df83014e1ad537291bb680f0100c1c5f
BLAKE2b-256 8bb6e5d89dd613136096631bc0ed47513025e431c767d69a516900939a685436

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page