This packages can read PDF documents and automatically recognise chapter-titles, enumerations and other elements in the text and summarize the document part-by-part

These details have not been verified by PyPI

Project links

Project description

Textsplitter package

This package is meant for structure recognition in PDF documents. It reads the text from the document using the pdfminer.six library. Then, it uses an in-house rule-based algorithm to identify chapter titles, enumerations, etc.

The package also offers functionality to summarize the document part-by-part. This means that a chapter is summarized by summarizing the summaries of the various sections beneath it. Likewise, the document is summarized by summarizing the summaries of the chapters in it, and so on. The summaries are generated by calling the ChatGPT api This means that an access token to a paid ChatGPT account is required to use this functionality.

Finally, the package can also process the splitted and summarized document into a local html page using an in-house parser.

Installation

To install from pypi use:

pip install pdftextsplitter

Getting started

After installing the package, simply run:

from pdftextsplitter import textsplitter 
mysplitter = textsplitter() 
mysplitter.set_documentpath("/absolute/path/to/the/folder/of/your/document/") 
mysplitter.set_documentname("your_document_name") # no .pdf-extension! 
mysplitter.set_outputpath("/absolute/path/to/where/you/want/your/outputs/") 
mysplitter.standard_params() 
mysplitter.process()

After running the process-command, it can take a long time (up to an hour) to process your document, depending on the document size. Afterwards, you can enjoy the results by opeing the outputs in the specified output folder (have a look at the html-file), or you can further process the results using your own code.

Setting some properties

If you wish to configure some parameters yourself, specify any of the following values between the standard_params-command and the process-command:

mysplitter.set_histogramsize(100)               # Specifies the granularity of the histoggrams used to perform calculations on fontsize and whitelines in the document. 
mysplitter.set_MaxSummaryLength(50)             # Guideline for how long a summary of a single textpart should be in words (outcomes are not exact, but they can be steered with this). 
mysplitter.set_summarization_threshold(50)      # If a textpart has fewer words then this amount, it will not be summarized, but copied. This is done to save costs. 
mysplitter.set_LanguageModel("gpt-3.5-turbo")   # Choice for the Large Language Model you would ChatGPT like to use. 
mysplitter.set_LanguageChoice("Default")        # Language you would like to receive your summaries in from ChatGPT. 
mysplitter.set_LanguageTemperature(0.1)         # Temperature of the ChatGPT responses. 
mysplitter.set_MaxCallRepeat(20)                # If the ChatGPT api returns an unusable response or an error, we attempt the call again, until this maximum. Higher value 
                                                # will give a more trusted output, but also potentially higher costs. 
mysplitter.set_UseDummySummary(True)            # If this is set to True, summaries will be created in-house by selecting the first n words from the text, no ChatGPT calls are used 
                                                # Hence, a ChatGPT account is not required in this case. 
                                                # This is a great way to experiment with the package without burning any money. Set it to False for receiving usable summaries.

To let the package use your own paid ChatGPT-account (only required when set_UseDummySummary=False), run:

mysplitter.ChatGPT_Key = "my personal access token"

between the standard_params-command and the process-command.

Controlling the terminal output

You can control the amount of messages received in the terminal, by running mysplitter.process(0) instead. Values of -1, 0, 1, 2, etc. can be used. The higher the value, the more output you receive. -1 makes the terminal completely quiet and also suppresses the output-files.

NOTE: Choosing option -1 here will also change the html-visualisation from a complete page to a fraction of a page. This is because option -1 is meant for when this package is used as a building block for a webapplication. Django is the preferred framework to use here. In that case, you want html that you can send to a Django-template. This is not the same as a standalone html-page. So -1 gives you a html-output you can use in a webapp while other choices give you a standalone html-page.

Controlling individual actions of the package

You can replace the process-command by any (or a combination of) of the following commands:

mysplitter.document_metadata()                  # This will extract meta-data from the PDF like author, creation date, etc. suing the PyPDF2 library. 
mysplitter.read_native_toc("pdfminer")          # If the document has a table of contents in its meta-data, this is extracted. Supported libraries are pymupdf and pdfminer. 
mysplitter.textgeneration("pdfminer")           # This will read the text from the PDF document. Supported libraries are: pypdf2 (limited support), pymupdf, pdfminer. 
mysplitter.export("default")                    # This will write the extracted text to a .txt-file for your convencience. 
mysplitter.fontsizehist()                       # This creates hisograms about all the font sizes encountered in the document. 
mysplitter.findfontregions()                    # This utilizes those histograms to decide which font sizes are large or small. 
mysplitter.calculate_footerboundaries(0)        # This will calculate the cut-offs between headers, footers and body text in the document -1, 0, 1, 2, etc.=terminal output. 
mysplitter.whitelinehist()                      # Same as before, but now for white lines between textlines in the document. 
mysplitter.findlineregions()                    # Same as before, but now for white lines between textlines in the document. 
mysplitter.passinfo()                           # This will pass the calculated information to the internal rules so the structure elements in the document can be indentified. 
mysplitter.breakdown()                          # This will split the text in the document into distinct chapters, sections, etc. 
mysplitter.shiftcontents()                      # This refines the outcome of breakdown in the case the PDF document quotes any personal letters. 
mysplitter.calculatefulltree()                  # This calculates which sections belong to which chapter and so on. 
num_errors = mysplitter.layered_summary(0)      # This will call ChatGPT to create summaries for each part of your document. -1,0,1,2,etc. can be entered to control termal output. 
mysplitter.exportdecisions()                    # This will write the decisions per textline made by breakdown to a .txt-file for future analysis. 
mysplitter.exportalineas("default")             # This will write the outcome of breakdown to a .txt-file for future analysis. 
mysplitter.alineas_to_html()                    # This will generate a standalone html-page with all output from the package (enter "django" for incomplete html).

Note that the process-command is nothing more then the sequential execution of all commands specified above. It is possible to only execute some of the commands while skipping others that you do not need. However, many commands need the outcome of some of the previous commands. So not all orders and combinations of the above commands will result in workable code. At the very least, one should remember to always execute passinfo immediatly before breakdown.

Access to the produced data

After running the process-command (or some decomposition of the commands), you can directly access the produced data by retrieving the class members. For a full discussion of all members, we like to refer to the source code (textsplitter/Textpart/textsplitter.py), but we will discuss the most important members here.

At first, each associated set-function discussed above under 'setting parameters' comes with a get-function to retrieve the parameters you used. This can also be done without setting the parameters in advance, in which case you will obtain the default-values.

Secondly, the most important class member is mysplitter.textalineas which is an array of textalinea-objects (textsplitter/Textpart/textalinea.py). Useful members are:

mysplitter.textalineas[index].texttitle: str            # The title of this textpart (chapter, section, etc.) 
mysplitter.textalineas[index].textlevel: int            # How deep this part occurs in the document. 0=entire document, 1=chapter, 2=section, etc. 
mysplitter.textalineas[index].summary: str              # The summary generated by ChatGPT from this textpart. 
mysplitter.textalineas[index].textcontent: list[str]    # The original text of this textpart, as obtained from the PDF. 
mysplitter.textalineas[index].nativeID: int             # The order in which the textparts are identified in the original PDF. 
mysplitter.textalineas[index].parentID: int             # The nativeID of the parent of this textpart. For example, a section belongs to a chapter (we call this the parent). 
                                                        # Note: a summary is generated by adding the textcontent of this textpart to the summaries of all children of this textpart.

Some other useful members are:

mysplitter.native_TOC: list[Native_TOC_Element] # see textsplitter/TextPart/read_native_toc.py This holds the table of contents from the PDF obtained from the metadata (if present). 
mysplitter.doc_metadata_author: str             # The author of the PDF, as obtained from its metadata (if present). 
mysplitter.doc_metadata_creator: str            # Same, but now the creator-field. 
mysplitter.doc_metadata_producer: str           # Same, but now the creator-field. 
mysplitter.doc_metadata_title: str              # Same, but now the creator-field. 
mysplitter.doc_metadata_subject: str            # Same, but now the creator-field. 
mysplitter.html_visualization: str              # The full html-code of the output (as produced by alineas_to_html()). 
mysplitter.api_wrongcalls_duetomaxwhile: int    # The number of calls to ChatGPT that could not be corrected because it would cross the limit set by set_MaxCallRepeat() 
# For a trustworthy output, this number should equal zero. 
mysplitter.api_totalprice: float                # Total price in dollars that processing this document by ChatGPT took. For this field to contain useful information, one should set 
# the following fields prior to running process: mysplitter.Costs_price & mysplitter.Costs_tokenportion 
# these fields should match the pricing of your LLM choice, see [OpenAI pricing](https://openai.com/pricing)

Database models

This package comes with a collection of Django database models that can be used for integrating the functionality of this package into a Django webapplication. In case you want to use these models, run pip install djangotextsplitter. For further details on these models, we refer to the documentation of djangotextsplitter

Testing tools

the package also provides some testing tools out-of-the-box. See the README.md-file in textsplitter/Tests/Tools/ for more information on how to work with them.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.1.4

Dec 23, 2023

2.1.2

Dec 21, 2023

2.1.1

Dec 21, 2023

2.1.0

Dec 21, 2023

2.0.6

Dec 18, 2023

2.0.5

Dec 15, 2023

2.0.4

Dec 15, 2023

2.0.3

Nov 16, 2023

2.0.2

Nov 16, 2023

2.0.1

Nov 14, 2023

2.0.0

Nov 14, 2023

1.2.6

Oct 26, 2023

1.2.5

Oct 25, 2023

1.2.4

Oct 23, 2023

1.2.3

Oct 19, 2023

1.2.2

Oct 18, 2023

1.2.1

Oct 17, 2023

1.2.0

Oct 16, 2023

1.1.4

Oct 13, 2023

1.1.3

Oct 13, 2023

1.1.2

Oct 13, 2023

1.1.1

Oct 13, 2023

1.1.0

Oct 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftextsplitter-2.1.4.tar.gz (80.4 MB view details)

Uploaded Dec 23, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdftextsplitter-2.1.4-py3-none-any.whl (122.2 kB view details)

Uploaded Dec 23, 2023 Python 3

File details

Details for the file pdftextsplitter-2.1.4.tar.gz.

File metadata

Download URL: pdftextsplitter-2.1.4.tar.gz
Upload date: Dec 23, 2023
Size: 80.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pdftextsplitter-2.1.4.tar.gz
Algorithm	Hash digest
SHA256	`44945172340d0bb59d980483e76c0854ae9709b88a9082ac43f41e4b27a6798f`
MD5	`b165cb1d3319aa379a442e89c95697a0`
BLAKE2b-256	`8b93619ebd910396517b716fc24dc9acdd0db99b3f7e2d61c48bb2e03d5ac363`

See more details on using hashes here.

File details

Details for the file pdftextsplitter-2.1.4-py3-none-any.whl.

File metadata

Download URL: pdftextsplitter-2.1.4-py3-none-any.whl
Upload date: Dec 23, 2023
Size: 122.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pdftextsplitter-2.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a2e73e061990b5bb6dc8e770090b2ecddabab019e7c1eb235572a64e0b0427a`
MD5	`7346a0142ffdbd0d6968520892ac83da`
BLAKE2b-256	`1872c0b8e7bf2d3e976abdbc214d86f815b8226aa422583dba43203eff41de3d`

See more details on using hashes here.

pdftextsplitter 2.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Textsplitter package

Installation

Getting started

Setting some properties

Controlling the terminal output

Controlling individual actions of the package

Access to the produced data

Database models

Testing tools

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes