Skip to main content

Extract data from pdf to a dataframe

Project description

Pdf2Df

This is a simple python package to create a dataframe with the text extracted from PDFs.

To install:

$ pip install pdf2df

Get Started

To use the package, first import it:

from pdf2df import Pdf2df

sfd = Pdf2df(path, page=True, single_file=False)
df = sfd.get_text()

Arguments

  • path (str) : Where the files are located. It could be a single file or a folder containing multiple pdf files
  • page (bool) : If True, the dataframe will contain each page of the pdf in a new row, if flase, all the text in the pdf will be in the same row.
  • single_file (bool) : This tell is method if the path is a folder containing multiple pdf files or a single pdf file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2df-0.0.2.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2df-0.0.2-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file pdf2df-0.0.2.tar.gz.

File metadata

  • Download URL: pdf2df-0.0.2.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for pdf2df-0.0.2.tar.gz
Algorithm Hash digest
SHA256 acfeb70e2536f46572e6459c719fc324d0252bf6f41b69fa9d28f7c3f1bd63f9
MD5 bde72e8d666e17af3f0786746fd6e12b
BLAKE2b-256 8b1aa98e150441ec649fe682d0a00d828154e5b4ee085c264b4534207200d270

See more details on using hashes here.

File details

Details for the file pdf2df-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pdf2df-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for pdf2df-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 706724b5ade29c50aee78e6afd159b8c4bc722a31d6e5aa45ebd4d1d7a349d2f
MD5 e60396fe70956058c00a4489f58ba352
BLAKE2b-256 6544c186ff4cfa109ac8bd2747d17af3a5dffecfa8d81463a6ad0c8dd595d27c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page