Skip to main content

Extract data from pdf to a dataframe

Project description

Pdf2Df

This is a simple python package to create a dataframe with the text extracted from PDFs.

To install:

$ pip install pdf2df
$ pip install PyMuPDF==1.16.14

Get Started

To use the package, first import it:

from pdf2df import Pdf2df

sfd = Pdf2df(path, page=True, single_file=False)
df = sfd.get_text()

Arguments

  • path (str) : Where the files are located. It could be a single file or a folder containing multiple pdf files
  • page (bool) : If True, the dataframe will contain each page of the pdf in a new row, if flase, all the text in the pdf will be in the same row.
  • single_file (bool) : This tell is method if the path is a folder containing multiple pdf files or a single pdf file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2df-0.0.3.tar.gz (2.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2df-0.0.3-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file pdf2df-0.0.3.tar.gz.

File metadata

  • Download URL: pdf2df-0.0.3.tar.gz
  • Upload date:
  • Size: 2.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for pdf2df-0.0.3.tar.gz
Algorithm Hash digest
SHA256 55c6297c00220f3dace131434d88350d0246d7e3bcc4689d6f1d39ad31171eac
MD5 7c44f9cb0850f74966327b97ff52861b
BLAKE2b-256 210de1af2566ff823a6df577fcfc9cba738b6e485a76a624d818f6016bf0aabd

See more details on using hashes here.

File details

Details for the file pdf2df-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pdf2df-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for pdf2df-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b29fd5956ed20c183a0d499b14743fe17ace5c8027f4713258b4d8bd7e898398
MD5 8e3b83bb4788a5a98a7de2973cbe90b2
BLAKE2b-256 996d1504d23452f48209965909ca68f2ad3a50d24f0dea1be91e86887a28a33a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page