Skip to main content

Extracting text and data from PDFs

Project description

Python 3.6, 3.7, 3.8, 3.9 Downloads Total Downloads per Month

{pdfdata}

Python package for extracting text and data from PDFs.

Installation

pip install pdfdata

Usage

from pdfdata import *
from pprint import pprint


# parse pdf as dictionary
pdf_parsed = parse_pdf('pdfs/0641-20.pdf')
res        = pdf_doc_extract_span_list(pdf_parsed)

pprint(res, depth=3)



# parse pdf as list of spans
pdf_parsed = parse_pdf('pdfs/0641-20.pdf')
res        = pdf_doc_extract_span_df(pdf_parsed)

pprint(res[0])




# transform pdf text to jsonnl
pdf_text_to_jsonnl('pdfs/0641-20.pdf', '0641-20.jsonnl')

DevNotes

build

python -m build

pypi test upload

python -m twine upload --repository testpypi dist/* --skip-existing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdata-0.1.3.2.tar.gz (4.8 kB view hashes)

Uploaded Source

Built Distribution

pdfdata-0.1.3.2-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page