Skip to main content

This packages can read PDF documents and automatically recognise chapter-titles, enumerations and other elements in the text and summarize the document part-by-part

Project description

Textsplitter package

This package is meant for structure recognition in PDF documents. It reads the text from the document using the pdfminer.six library. Then, it uses an in-house rule-based algorithm to identify chapter titles, enumerations, etc.

The package also offers functionality to summarize the document part-by-part. This means that a chapter is summarized by summarizing the summaries of the various sections beneath it. Likewise, the document is summarized by summarizing the summaries of the chapters in it, and so on. The summaries are generated by calling the ChatGPT api This means that an access token to a paid ChatGPT account is required to use this functionality.

Finally, the package can also process the splitted and summarized document into a local html page using an in-house parser.

Installation

To install from pypi use:

pip install pdftextsplitter

Getting started

After installing the package, simply run:

from pdftextsplitter import textsplitter 
mysplitter = textsplitter() 
mysplitter.set_documentpath("/absolute/path/to/the/folder/of/your/document/") 
mysplitter.set_documentname("your_document_name") # no .pdf-extension! 
mysplitter.set_outputpath("/absolute/path/to/where/you/want/your/outputs/") 
mysplitter.standard_params() 
mysplitter.process() 

After running the process-command, it can take a long time (up to an hour) to process your document, depending on the document size. Afterwards, you can enjoy the results by opeing the outputs in the specified output folder (have a look at the html-file), or you can further process the results using your own code.

Setting some properties

If you wish to configure some parameters yourself, specify any of the following values between the standard_params-command and the process-command:

mysplitter.set_histogramsize(100)               # Specifies the granularity of the histoggrams used to perform calculations on fontsize and whitelines in the document. 
mysplitter.set_MaxSummaryLength(50)             # Guideline for how long a summary of a single textpart should be in words (outcomes are not exact, but they can be steered with this). 
mysplitter.set_summarization_threshold(50)      # If a textpart has fewer words then this amount, it will not be summarized, but copied. This is done to save costs. 
mysplitter.set_LanguageModel("gpt-3.5-turbo")   # Choice for the Large Language Model you would ChatGPT like to use. 
mysplitter.set_LanguageChoice("Default")        # Language you would like to receive your summaries in from ChatGPT. 
mysplitter.set_LanguageTemperature(0.1)         # Temperature of the ChatGPT responses. 
mysplitter.set_MaxCallRepeat(20)                # If the ChatGPT api returns an unusable response or an error, we attempt the call again, until this maximum. Higher value 
                                                # will give a more trusted output, but also potentially higher costs. 
mysplitter.set_UseDummySummary(True)            # If this is set to True, summaries will be created in-house by selecting the first n words from the text, no ChatGPT calls are used 
                                                # Hence, a ChatGPT account is not required in this case. 
                                                # This is a great way to experiment with the package without burning any money. Set it to False for receiving usable summaries. 

To let the package use your own paid ChatGPT-account (only required when set_UseDummySummary=False), run:

mysplitter.ChatGPT_Key = "my personal access token"

between the standard_params-command and the process-command.

Controlling the terminal output

You can control the amount of messages received in the terminal, by running mysplitter.process(0) instead. Values of -1, 0, 1, 2, etc. can be used. The higher the value, the more output you receive. -1 makes the terminal completely quiet and also suppresses the output-files.

NOTE: Choosing option -1 here will also change the html-visualisation from a complete page to a fraction of a page. This is because option -1 is meant for when this package is used as a building block for a webapplication. Django is the preferred framework to use here. In that case, you want html that you can send to a Django-template. This is not the same as a standalone html-page. So -1 gives you a html-output you can use in a webapp while other choices give you a standalone html-page.

Controlling individual actions of the package

You can replace the process-command by any (or a combination of) of the following commands:

mysplitter.document_metadata()                  # This will extract meta-data from the PDF like author, creation date, etc. suing the PyPDF2 library. 
mysplitter.read_native_toc("pdfminer")          # If the document has a table of contents in its meta-data, this is extracted. Supported libraries are pymupdf and pdfminer. 
mysplitter.textgeneration("pdfminer")           # This will read the text from the PDF document. Supported libraries are: pypdf2 (limited support), pymupdf, pdfminer. 
mysplitter.export("default")                    # This will write the extracted text to a .txt-file for your convencience. 
mysplitter.fontsizehist()                       # This creates hisograms about all the font sizes encountered in the document. 
mysplitter.findfontregions()                    # This utilizes those histograms to decide which font sizes are large or small. 
mysplitter.calculate_footerboundaries(0)        # This will calculate the cut-offs between headers, footers and body text in the document -1, 0, 1, 2, etc.=terminal output. 
mysplitter.whitelinehist()                      # Same as before, but now for white lines between textlines in the document. 
mysplitter.findlineregions()                    # Same as before, but now for white lines between textlines in the document. 
mysplitter.passinfo()                           # This will pass the calculated information to the internal rules so the structure elements in the document can be indentified. 
mysplitter.breakdown()                          # This will split the text in the document into distinct chapters, sections, etc. 
mysplitter.shiftcontents()                      # This refines the outcome of breakdown in the case the PDF document quotes any personal letters. 
mysplitter.calculatefulltree()                  # This calculates which sections belong to which chapter and so on. 
num_errors = mysplitter.layered_summary(0)      # This will call ChatGPT to create summaries for each part of your document. -1,0,1,2,etc. can be entered to control termal output. 
mysplitter.exportdecisions()                    # This will write the decisions per textline made by breakdown to a .txt-file for future analysis. 
mysplitter.exportalineas("default")             # This will write the outcome of breakdown to a .txt-file for future analysis. 
mysplitter.alineas_to_html()                    # This will generate a standalone html-page with all output from the package (enter "django" for incomplete html). 

Note that the process-command is nothing more then the sequential execution of all commands specified above. It is possible to only execute some of the commands while skipping others that you do not need. However, many commands need the outcome of some of the previous commands. So not all orders and combinations of the above commands will result in workable code. At the very least, one should remember to always execute passinfo immediatly before breakdown.

Access to the produced data

After running the process-command (or some decomposition of the commands), you can directly access the produced data by retrieving the class members. For a full discussion of all members, we like to refer to the source code (textsplitter/Textpart/textsplitter.py), but we will discuss the most important members here.

At first, each associated set-function discussed above under 'setting parameters' comes with a get-function to retrieve the parameters you used. This can also be done without setting the parameters in advance, in which case you will obtain the default-values.

Secondly, the most important class member is mysplitter.textalineas which is an array of textalinea-objects (textsplitter/Textpart/textalinea.py). Useful members are:

mysplitter.textalineas[index].texttitle: str            # The title of this textpart (chapter, section, etc.) 
mysplitter.textalineas[index].textlevel: int            # How deep this part occurs in the document. 0=entire document, 1=chapter, 2=section, etc. 
mysplitter.textalineas[index].summary: str              # The summary generated by ChatGPT from this textpart. 
mysplitter.textalineas[index].textcontent: list[str]    # The original text of this textpart, as obtained from the PDF. 
mysplitter.textalineas[index].nativeID: int             # The order in which the textparts are identified in the original PDF. 
mysplitter.textalineas[index].parentID: int             # The nativeID of the parent of this textpart. For example, a section belongs to a chapter (we call this the parent). 
                                                        # Note: a summary is generated by adding the textcontent of this textpart to the summaries of all children of this textpart. 

Some other useful members are:

mysplitter.native_TOC: list[Native_TOC_Element] # see textsplitter/TextPart/read_native_toc.py This holds the table of contents from the PDF obtained from the metadata (if present). 
mysplitter.doc_metadata_author: str             # The author of the PDF, as obtained from its metadata (if present). 
mysplitter.doc_metadata_creator: str            # Same, but now the creator-field. 
mysplitter.doc_metadata_producer: str           # Same, but now the creator-field. 
mysplitter.doc_metadata_title: str              # Same, but now the creator-field. 
mysplitter.doc_metadata_subject: str            # Same, but now the creator-field. 
mysplitter.html_visualization: str              # The full html-code of the output (as produced by alineas_to_html()). 
mysplitter.api_wrongcalls_duetomaxwhile: int    # The number of calls to ChatGPT that could not be corrected because it would cross the limit set by set_MaxCallRepeat() 
# For a trustworthy output, this number should equal zero. 
mysplitter.api_totalprice: float                # Total price in dollars that processing this document by ChatGPT took. For this field to contain useful information, one should set 
# the following fields prior to running process: mysplitter.Costs_price & mysplitter.Costs_tokenportion 
# these fields should match the pricing of your LLM choice, see [OpenAI pricing](https://openai.com/pricing) 

Database models

This package comes with a collection of Django database models that can be used for integrating the functionality of this package into a Django webapplication. In case you want to use these models, run pip install djangotextsplitter. For further details on these models, we refer to the documentation of djangotextsplitter

Testing tools

the package also provides some testing tools out-of-the-box. See the README.md-file in textsplitter/Tests/Tools/ for more information on how to work with them.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftextsplitter-2.1.4.tar.gz (80.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdftextsplitter-2.1.4-py3-none-any.whl (122.2 kB view details)

Uploaded Python 3

File details

Details for the file pdftextsplitter-2.1.4.tar.gz.

File metadata

  • Download URL: pdftextsplitter-2.1.4.tar.gz
  • Upload date:
  • Size: 80.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pdftextsplitter-2.1.4.tar.gz
Algorithm Hash digest
SHA256 44945172340d0bb59d980483e76c0854ae9709b88a9082ac43f41e4b27a6798f
MD5 b165cb1d3319aa379a442e89c95697a0
BLAKE2b-256 8b93619ebd910396517b716fc24dc9acdd0db99b3f7e2d61c48bb2e03d5ac363

See more details on using hashes here.

File details

Details for the file pdftextsplitter-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: pdftextsplitter-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 122.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pdftextsplitter-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5a2e73e061990b5bb6dc8e770090b2ecddabab019e7c1eb235572a64e0b0427a
MD5 7346a0142ffdbd0d6968520892ac83da
BLAKE2b-256 1872c0b8e7bf2d3e976abdbc214d86f815b8226aa422583dba43203eff41de3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page