No project description provided

Project description

About

This script converts PDF to txt using PDFMiner (http://www.unixuser.org/~euske/python/pdfminer/index.html).

PDFMiner is a pdf parsing library written in Python by Yusuke Shinyama.

In addition to the pdf2txt.py and dumppdf.py command line tools, there is a way of analyzing the content tree of each page programmatically.

This is a more complete example of programming with PDFMiner, which continues where the default documentation (http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout) stops.

This code is still a work-in-progress, with room for improvement.

Install

Since it's available on PyPI, it's super easy to instal.

pip3 install pdf_layout_scanner

Advantages over PDFMiner

This script will extract text from PDFs with multiple columns.

Usage

General Usage

from pdf_layout_scanner import layout_scanner

# get a list of the table of contents
get_toc()

# get the full text
get_pages()

Practical examples

from pdf_layout_scanner import layout_scanner
toc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf')
print(len(toc))
# the number of elements in the pdf document's table of contents

print(toc[0])
# a tuple containing the ordinal sequence and the title string,
#  for example:
#  (1, u'Introduction')

pages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf')
print(len(pages))
# should return the number of pages in the pdf document
print(pages[0])
# a string of all the text on the first page

Room for Improvement

Column Merging - while the fuzzy heuristic I described works well for the pdf files I've parsed so far, I can imagine more complex documents where it would break-down (perhaps this is where the analysis should be more sophisticated, and not ignore so many types of pdfminer.layout.LT* objects).
Image Extraction - I'd like to be able to be at least as good as pdftoimages, and save every file in ppm or pnm default format, but I'm not sure what I could be doing differently
Title and Heading Capitalization - this seems to be an issue with PDFMiner, since I get similar results in using the command line tools, but it is annoying to have to go back and fix all the mis-capitalizations manually, particularly for larger documents.
Title and Heading Fonts and Spacing - a related issue, though probably something in my own code, is that those same title and paragraph headings aren't distinguished from the rest of the text. In many cases, I have to go back and add vertical spacing and font attributes for those manually.
Page Number Removal - originally, I thought I could just use a regex for an all-numeric value on a single physical line, but each document does page numbering slightly differently, and it's very difficult to get rid of these without manually proofreading each page.
Footnotes - handling these where the note and the reference both appear on the same page is hard enough, but doing it when they span different (even consecutive) pages is worse.

Contribution

In this forked project, I made a bit changes into the original one.

Added support for texts in LTFigures
Optimized data manipulation and storage changed from simple dict to dataframe. This should make further contributions easier.
Added Progressbar

Project details

Release history Release notifications | RSS feed

1.3.3

Sep 1, 2019

1.3.2

Aug 7, 2019

1.3.1

Aug 7, 2019

1.3

Aug 7, 2019

1.2

Jul 24, 2019

This version

1.1

Jul 24, 2019

1.0

Jul 24, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PDF Layout Scanner-1.1.tar.gz (6.2 kB view details)

Uploaded Jul 24, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

PDF_Layout_Scanner-1.1-py3-none-any.whl (7.6 kB view details)

Uploaded Jul 24, 2019 Python 3

File details

Details for the file PDF Layout Scanner-1.1.tar.gz.

File metadata

Download URL: PDF Layout Scanner-1.1.tar.gz
Upload date: Jul 24, 2019
Size: 6.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for PDF Layout Scanner-1.1.tar.gz
Algorithm	Hash digest
SHA256	`1d98f5e46c611141ed021f3f0c81e42f701a5bf49c3146560cc2d20ee9744a00`
MD5	`af08cd4eab1ab581e4fd6828b8144a42`
BLAKE2b-256	`916e6fa485f12042360aa05e1b8c04f34ec298645f8f2e8f1e8c71983cd1401d`

See more details on using hashes here.

File details

Details for the file PDF_Layout_Scanner-1.1-py3-none-any.whl.

File metadata

Download URL: PDF_Layout_Scanner-1.1-py3-none-any.whl
Upload date: Jul 24, 2019
Size: 7.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for PDF_Layout_Scanner-1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c8b160720a07dd2e8757a9814c2ef8044b42f78d83b50bf2cee4128636dd54c`
MD5	`e3a67072a9305cff6c3b5b94ebb956df`
BLAKE2b-256	`c90725121e337b49fc12fadd7764dc4da9b56ae114ca4eb2afd1b6ba943697b4`

See more details on using hashes here.

PDF-Layout-Scanner 1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

About

Install

Advantages over PDFMiner

Usage

General Usage

Practical examples

Room for Improvement

Contribution

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes