Skip to main content

A tool to quickly create sweet PDF files from text files.

Project description

PDF Confectionary :cupcake:

Work in the NLP domain and find that your end users/clients don't like using .txt files with your excellent results? Look no further!

PDF Confectionary is a tool for quickly creating templated PDFs from text files using FPDF2. Essentially, point it at a directory of text files, and generate some sweet PDFs.


Quickly convert text files into readable, paragraph-segmented PDFs that are easy to navigate.


Table of Contents


About

The focus of this repo is to provide a simple, easy-to-use, and extensible PDF creation tool. Relevant features in PDF Confectionary include:

  • Automatic paragraph separation via the textsplit module
  • Fast navigation through generated PDFs via links between TOC to chapters & footer links back to TOC.
  • Keyword extraction for each text file

This module was inspired by the need to create clean output documents for reading & review speech transcription from the vid2cleantxt project. PDF Confectionary was initially designed as a command-line tool but provides a Python API for more advanced use cases.


Installation

Requirements

Primary modules used by confectionary are: FPDF2, textsplit, gensim, and clean-text.

All dependencies are listed in the requirements.txt file.

Package Installation

The package can be installed using pip:

pip install confectionary

To install as a python package without pip, run:

git clone <https://github.com/pszemraj/confectionary.git>
cd confectionary
pip install -e .

Usage

There are two ways to use PDF Confectionary:

  1. command line, via python confectionary/text2pdf.py -i <input_dir> -o <output_dir>

  2. Python API via functions in the confectionary.text2pdf module. The dir_to_pdf function is the equivalent of the command line tool application.

Both create one pdf from all txt files in the input directory, saved to output_dir. Add the -r switch (or recurse=True in function) to load files recursively.

Command Line Usage

Call python confectionary/text2pdf.py -i /path/to/input/dir -o /path/to/output/dir to create a pdf from all txt files in the input directory and save it to the output directory:

    python confectionary/text2pdf.py -i /path/to/input/dir -o /path/to/output/dir \
    -kw "my keywords to label this document."

The below example shows the output of the command line tool and uses the -m switch to specify a specific word2vec model.

$ python confectionary/text2pdf.py -i "example/text-files" -o "example/outputs" -kw "my keywords to label this document" \
    -m "glove-wiki-gigaword-200"

Output:

Since the GPL-licensed package `unidecode` is not installed, using Pythons `unicodedata` package which yields worse results.
3 files found matching extension .txt

# entries is 3, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                                                    | 0/3 [00:00<?, ?it/s]
No local model file - downloading glove-wiki-gigaword-200 from gensim-data API
[==================================================] 100.0% 252.1/252.1MB downloaded

Loaded word2vec model glove-wiki-gigaword-200
Building Chapters in PDF file: 100%|████████████████████████████████████████████████████████████████████████████| 3/3 [01:23<00:00, 27.77s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3484.61it/s]

PDF file written to example/outputs/text-to-PDF/my keywords to label this document_txt2pdf_Oct-18-2022_standard.pdf

Find out more about the command line tool by running python confectionary/text2pdf.py -h.

Basic Usage within Python

Three basic functions are available in confectionary.text2pdf: dir_to_pdf, file_to_pdf, and str_to_pdf:

  • dir_to_pdf takes a directory path and creates a pdf from all txt files in the directory.
  • file_to_pdf takes a file path and creates a pdf from the file.
  • str_to_pdf takes a string and creates a pdf from the string.

Details on the function arguments can be found in the relevant function docstrings (or call help()). To replicate the above command line usage, use the following code:

from confectionary.text2pdf import dir_to_pdf

report_path = dir_to_pdf(
    input_dir="/path/to/input/dir",
    output_dir="/path/to/output/dir",
    keywords="my keywords to label this document",
)

print(f"Report saved to {report_path}")

Check out the dir_to_pdf docstring for more options:

import inspect
from confectionary.text2pdf import dir_to_pdf

inspect.getdoc(dir_to_pdf)

Note: word2vec model loading

Splitting input text into paragraphs is enabled by default and uses a word2vec model. If it doesn't exist, it will be downloaded via gensim's API and saved to the ./models directory.

  • the quality of the paragraph splitting and runtime of the script both depend on the size and complexity of the word2vec model. If you want to use a different model, you can pass the path to the model to the dir_to_pdf function via the word2vec_model argument.
    • the default model is glove-wiki-gigaword-100 and is a 100-dimensional model and has a download size of ~130 MB.
  • additional models that are downloadable are listed here. This info is also available by passing the --api-info flag to the command line tool or calling the confectionary.utils.print_api_info() function.
  • Using paragraph splitting is not required and can be disabled by setting the do_paragraph_splitting parameter to False or, in command line mode, by adding the --no-split switch.

TODO list

  • convert the text2pdf.py script to a module/function
  • publish to PyPI
  • add alternate, smaller, word2vec models for splitting paragraphs
  • improve TOC calculation beyond a simple title threshold
  • Add a basic notebook demo

License

Apache License 2.0


Contributing

  • Given the open-ended nature of documentation creation, there are a lot of features that are not yet implemented. Please feel free to contribute!
  • Developers can contribute to this project by submitting pull requests in this repo - see details in the contributing guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confectionary-0.3.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

confectionary-0.3-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file confectionary-0.3.tar.gz.

File metadata

  • Download URL: confectionary-0.3.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for confectionary-0.3.tar.gz
Algorithm Hash digest
SHA256 ca0e2793debfac2b0db07a4d45da80f84afe3c56a66e53f7e101f9b0696bb6a0
MD5 5db2e2f067f2b06def5317ccf827d3e2
BLAKE2b-256 b3d8b1a8a0344e825e9732baf3c2bb8cbf7f20aea022422f795b1a8cc2996f46

See more details on using hashes here.

File details

Details for the file confectionary-0.3-py3-none-any.whl.

File metadata

  • Download URL: confectionary-0.3-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for confectionary-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c6b4d02984c9ed9d4a2b3904fc0fb409a5e25c3e6425692841bc83b9fe4f9cc1
MD5 4ca0ab1a8d4d2d60d058aba57083ca23
BLAKE2b-256 9bbc2e766776d70d22896c8f7aadf3d4cd42f5b59a6d98a3c07a0bd7f0a9058c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page