Skip to main content

This packages can read PDF documents and automatically recognise chapter-titles, enumerations and other elements in the text and summarize the document part-by-part

Project description

Textsplitter package

This package is meant for structure recognition in PDF documents. It reads the text from the document using the pdfminer.six library. Then, it uses an in-house rule-based algorithm to identify chapter titles, enumerations, etc.

The package also offers functionality to summarize the document part-by-part. This means that a chapter is summarized by summarizing the summaries of the various sections beneath it. Likewise, the document is summarized by summarizing the summaries of the chapters in it, and so on. The summaries are generated by calling the ChatGPT api This means that an access token to a paid ChatGPT account is required to use this functionality.

Finally, the package can also process the splitted and summarized document into a local html page using an in-house parser.

Installation

To install from pypi use:

pip install pdftextsplitter

Getting started

After installing the package, simply run:

from pdftextsplitter import textsplitter 
mysplitter = textsplitter() 
mysplitter.set_documentpath("/absolute/path/to/the/folder/of/your/document/") 
mysplitter.set_documentname("your_document_name") # no .pdf-extension! 
mysplitter.set_outputpath("/absolute/path/to/where/you/want/your/outputs/") 
mysplitter.standard_params() 
mysplitter.process() 

After running the process-command, it can take a long time (up to an hour) to process your document, depending on the document size. Afterwards, you can enjoy the results by opeing the outputs in the specified output folder (have a look at the html-file), or you can further process the results using your own code.

Setting some properties

If you wish to configure some parameters yourself, specify any of the following values between the standard_params-command and the process-command:

mysplitter.set_histogramsize(100)               # Specifies the granularity of the histoggrams used to perform calculations on fontsize and whitelines in the document. 
mysplitter.set_MaxSummaryLength(50)             # Guideline for how long a summary of a single textpart should be in words (outcomes are not exact, but they can be steered with this). 
mysplitter.set_summarization_threshold(50)      # If a textpart has fewer words then this amount, it will not be summarized, but copied. This is done to save costs. 
mysplitter.set_LanguageModel("gpt-3.5-turbo")   # Choice for the Large Language Model you would ChatGPT like to use. 
mysplitter.set_LanguageChoice("Default")        # Language you would like to receive your summaries in from ChatGPT. 
mysplitter.set_LanguageTemperature(0.1)         # Temperature of the ChatGPT responses. 
mysplitter.set_MaxCallRepeat(20)                # If the ChatGPT api returns an unusable response or an error, we attempt the call again, until this maximum. Higher value 
                                                # will give a more trusted output, but also potentially higher costs. 
mysplitter.set_UseDummySummary(True)            # If this is set to True, summaries will be created in-house by selecting the first n words from the text, no ChatGPT calls are used 
                                                # Hence, a ChatGPT account is not required in this case. 
                                                # This is a great way to experiment with the package without burning any money. Set it to False for receiving usable summaries. 

To let the package use your own paid ChatGPT-account (only required when set_UseDummySummary=False), run:

mysplitter.ChatGPT_Key = "my personal access token"

between the standard_params-command and the process-command.

Controlling the terminal output

You can control the amount of messages received in the terminal, by running mysplitter.process(0) instead. Values of -1, 0, 1, 2, etc. can be used. The higher the value, the more output you receive. -1 makes the terminal completely quiet and also suppresses the output-files.

NOTE: Choosing option -1 here will also change the html-visualisation from a complete page to a fraction of a page. This is because option -1 is meant for when this package is used as a building block for a webapplication. Django is the preferred framework to use here. In that case, you want html that you can send to a Django-template. This is not the same as a standalone html-page. So -1 gives you a html-output you can use in a webapp while other choices give you a standalone html-page.

Controlling individual actions of the package

You can replace the process-command by any (or a combination of) of the following commands:

mysplitter.document_metadata()                  # This will extract meta-data from the PDF like author, creation date, etc. suing the PyPDF2 library. 
mysplitter.read_native_toc("pdfminer")          # If the document has a table of contents in its meta-data, this is extracted. Supported libraries are pymupdf and pdfminer. 
mysplitter.textgeneration("pdfminer")           # This will read the text from the PDF document. Supported libraries are: pypdf2 (limited support), pymupdf, pdfminer. 
mysplitter.export("default")                    # This will write the extracted text to a .txt-file for your convencience. 
mysplitter.fontsizehist()                       # This creates hisograms about all the font sizes encountered in the document. 
mysplitter.findfontregions()                    # This utilizes those histograms to decide which font sizes are large or small. 
mysplitter.calculate_footerboundaries(0)        # This will calculate the cut-offs between headers, footers and body text in the document -1, 0, 1, 2, etc.=terminal output. 
mysplitter.whitelinehist()                      # Same as before, but now for white lines between textlines in the document. 
mysplitter.findlineregions()                    # Same as before, but now for white lines between textlines in the document. 
mysplitter.passinfo()                           # This will pass the calculated information to the internal rules so the structure elements in the document can be indentified. 
mysplitter.breakdown()                          # This will split the text in the document into distinct chapters, sections, etc. 
mysplitter.shiftcontents()                      # This refines the outcome of breakdown in the case the PDF document quotes any personal letters. 
mysplitter.calculatefulltree()                  # This calculates which sections belong to which chapter and so on. 
num_errors = mysplitter.layered_summary(0)      # This will call ChatGPT to create summaries for each part of your document. -1,0,1,2,etc. can be entered to control termal output. 
mysplitter.exportdecisions()                    # This will write the decisions per textline made by breakdown to a .txt-file for future analysis. 
mysplitter.exportalineas("default")             # This will write the outcome of breakdown to a .txt-file for future analysis. 
mysplitter.alineas_to_html()                    # This will generate a standalone html-page with all output from the package (enter "django" for incomplete html). 

Note that the process-command is nothing more then the sequential execution of all commands specified above. It is possible to only execute some of the commands while skipping others that you do not need. However, many commands need the outcome of some of the previous commands. So not all orders and combinations of the above commands will result in workable code. At the very least, one should remember to always execute passinfo immediatly before breakdown.

Access to the produced data

After running the process-command (or some decomposition of the commands), you can directly access the produced data by retrieving the class members. For a full discussion of all members, we like to refer to the source code (textsplitter/Textpart/textsplitter.py), but we will discuss the most important members here.

At first, each associated set-function discussed above under 'setting parameters' comes with a get-function to retrieve the parameters you used. This can also be done without setting the parameters in advance, in which case you will obtain the default-values.

Secondly, the most important class member is mysplitter.textalineas which is an array of textalinea-objects (textsplitter/Textpart/textalinea.py). Useful members are:

mysplitter.textalineas[index].texttitle: str            # The title of this textpart (chapter, section, etc.) 
mysplitter.textalineas[index].textlevel: int            # How deep this part occurs in the document. 0=entire document, 1=chapter, 2=section, etc. 
mysplitter.textalineas[index].summary: str              # The summary generated by ChatGPT from this textpart. 
mysplitter.textalineas[index].textcontent: list[str]    # The original text of this textpart, as obtained from the PDF. 
mysplitter.textalineas[index].nativeID: int             # The order in which the textparts are identified in the original PDF. 
mysplitter.textalineas[index].parentID: int             # The nativeID of the parent of this textpart. For example, a section belongs to a chapter (we call this the parent). 
                                                        # Note: a summary is generated by adding the textcontent of this textpart to the summaries of all children of this textpart. 

Some other useful members are:

mysplitter.native_TOC: list[Native_TOC_Element] # see textsplitter/TextPart/read_native_toc.py This holds the table of contents from the PDF obtained from the metadata (if present). 
mysplitter.doc_metadata_author: str             # The author of the PDF, as obtained from its metadata (if present). 
mysplitter.doc_metadata_creator: str            # Same, but now the creator-field. 
mysplitter.doc_metadata_producer: str           # Same, but now the creator-field. 
mysplitter.doc_metadata_title: str              # Same, but now the creator-field. 
mysplitter.doc_metadata_subject: str            # Same, but now the creator-field. 
mysplitter.html_visualization: str              # The full html-code of the output (as produced by alineas_to_html()). 
mysplitter.api_wrongcalls_duetomaxwhile: int    # The number of calls to ChatGPT that could not be corrected because it would cross the limit set by set_MaxCallRepeat() 
# For a trustworthy output, this number should equal zero. 
mysplitter.api_totalprice: float                # Total price in dollars that processing this document by ChatGPT took. For this field to contain useful information, one should set 
# the following fields prior to running process: mysplitter.Costs_price & mysplitter.Costs_tokenportion 
# these fields should match the pricing of your LLM choice, see [OpenAI pricing](https://openai.com/pricing) 

Database models

This package comes with a collection of Django database models that can be used for integrating the functionality of this package into a Django webapplication. In case you want to use these models, run pip install djangotextsplitter. For further details on these models, we refer to the documentation of djangotextsplitter

Testing tools

the package also provides some testing tools out-of-the-box. See the README.md-file in textsplitter/Tests/Tools/ for more information on how to work with them.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftextsplitter-2.1.4.tar.gz (80.4 MB view hashes)

Uploaded Source

Built Distribution

pdftextsplitter-2.1.4-py3-none-any.whl (122.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page