Extracting text and data from PDFs
Project description
{pdfdata}
Python package for extracting text and data from PDFs.
Installation
pip install pdfdata
Usage
from pdfdata import *
from pprint import pprint
# parse pdf as dictionary
pdf_parsed = parse_pdf('pdfs/0641-20.pdf')
res = pdf_doc_extract_span_list(pdf_parsed)
pprint(res, depth=3)
# parse pdf as list of spans
pdf_parsed = parse_pdf('pdfs/0641-20.pdf')
res = pdf_doc_extract_span_df(pdf_parsed)
pprint(res[0])
# transform pdf text to jsonnl
pdf_text_to_jsonnl('pdfs/0641-20.pdf', '0641-20.jsonnl')
DevNotes
build
python -m build
pypi test upload
python -m twine upload --repository testpypi dist/* --skip-existing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdfdata-0.1.3.2.tar.gz
(4.8 kB
view hashes)
Built Distribution
Close
Hashes for pdfdata-0.1.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d31532d7bf22c892af14cc81c9dfb88635e773d0b01088ed4d99413594bc7640 |
|
MD5 | f07d0001c90bacaa335414a44c3e72ac |
|
BLAKE2b-256 | 7026b2f78af285146c7be2bd584265a345dff2ab257e49e26ce1de1ad3e4ed13 |