Skip to main content

Library for Parsing PDF/TXT and Python Objects with Text Using Regular Expressions

Project description

icon

prettyparser is a Python library for parsing PDF/TXT and Python objects with text (str, list) using regular expressions. In case of PDF files, the package reads the content using pdfplumber and then performs a series of data manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.

Installation

$ git clone https://github.com/leandroroser/prettyparser
$ cd prettyparser
$ pip install -e .

or

$ pip install prettyparser

Example: processing a folder with multiple PDF files

import regex as re
from prettyparser import PrettyParser

directory = "./BOOKS/PDF"
output = "./BOOKS/TXT"
parser = PrettyParser(directory, output, mode = 'pdf',
                      args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
                            [r"\n\s*-\d-\s*\n", r'\n\n'],
                            [r"\n\s*(\* *)+\s*\n", r'\n\n'],
                            [r"__some_header_text", r'\n\n', re.IGNORECASE]],
                            remove_whitelines = True,
                            paragraphs_spacing = 1,
                            remove_hyphen_eol = True)
parser.run()

Example: processing a folder with multiple TXT files

Let’s assume that the previous output isn’t good enough and needs additional corrections. A quicker way for testing additional corrections can be implemented by using the previous TXT output:

directory = "./BOOKS/TXT"
output = "./BOOKS/TXT_REPARSED"
parser = PrettyParser(directory, output,  mode = 'txt',
                        args=[[r"some other header.*\d+", r''],
                            [r"^\d+.*", r'', re.MULTILINE],
                            [r"([A-Z]+)( *\n)([A-Z]+)", r'\1\3'],
                            remove_whitelines = True,
                            paragraphs_spacing = 1,
                            remove_hyphen_eol = True)
parser.run()

Example: processing a Python str for a quick test of the app

import regex as re
from prettyparser import PrettyParser


txt = """
header to remove

This is a text with multiple problems. For exam-
ple the latter word can be joined.
The portions of this line can be
joined
in a single line.
HERE ALSO IS SOME
UPPERCASE TEXT
TO JOIN
Some Other Ugly Stuff To Remove IGNORING Case.

Remove the line below:

* * *

Remove empty lines and finally separate paragraphs with a blank line.


Below is the page number->.
99
"""
parser = PrettyParser(txt, mode = "pyobj", args = [[r"\s*header to remove\s*\n",r""],
                                                    [r"(\n\s*\d+\s*\n)", r'\n\n'],
                                                    [r"\n\s*(\* *)+\s*\n", r'\n\n'],
                                                    [r"\n.*some other ugly stuff.*",
                                                    r'\n\n', re.IGNORECASE]],
                                                    remove_whitelines = True,
                                                    paragraphs_spacing = 1,
                                                    remove_hyphen_eol = True)
output = parser.run()
print(output[0])
This is a text with multiple problems. For example the latter word can be joined.

The portions of this line can be joined in a single line.

HERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN

Remove the line below:

Remove empty lines and finally separate each line with a blank line.

Below is the page number->.

Arguments

  • files (list or str): Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is ‘pdf’ or ‘txt’. If a str or list is passed when mode is ‘pyobj’, it will be treated as a str/list of text files already loaded in memory in the corresponding object

  • output (str): output directory

  • args (list): list of tuples of the form (regex, replacement, flags). The flag can be absent.

  • mode (str): ‘pdf’, ‘txt’ or ‘pyobj’ (the latter for Python lists and strings)

  • default (bool): if True, perform several default cleanup operations (default)

  • remove_whitelines (bool): if True, remove whitespaces

  • paragraphs_spacing (int): number of newlines between paragraphs

  • page_spacing (str): string to insert between pages

  • remove_hyphen_eol (bool): if True, remove end of line hyphens and merge subwords

  • custom_pdf_fun (Callable): custom function to parse pdf files It must accept a pdfplumber page as argument and return a text to be joined with previous pages

Current language support for the default parser

English, Spanish, German, French, Portuguese

License

© Leandro Roser, 2021. Licensed under an Apache-2 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prettyparser-1.0.15.tar.gz (10.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page