Skip to main content

utility for using transformers summarization models on text docs

Project description

textsum

Open In Colab PyPI-Server

utility for using transformers summarization models on text docs

This package provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.

For details, explanations, and docs, see the wiki



Installation

Install using pip:

# create a virtual environment (optional)
pip install textsum

The textsum package is now installed in your virtual environment. CLI commands/python API can summarize text docs from anywhere. see the Usage section for more details.

Full Installation

To install all the dependencies (includes PDF OCR, gradio UI demo, optimum, etc), run:

git clone https://github.com/pszemraj/textsum.git
cd textsum
# create a virtual environment (optional)
pip install -e .[all]

Additional Details

This package uses the clean-text python package, and like the "base" version of the package, does not include the GPL-licensed unidecode dependency. If you want to use the unidecode package, install the package as an extra with pip:

pip install textsum[unidecode]

In practice, text cleaning pre-summarization with/without unidecode should not make a significant difference.

Usage

There are three ways to use this package:

  1. python API
  2. CLI
  3. Demo App

Python API

To use the python API, import the Summarizer class and instantiate it. This will load the default model and parameters.

You can then use the summarize_string method to summarize a long text string.

from textsum.summarize import Summarizer

summarizer = Summarizer() # loads default model and parameters

# summarize a long string
out_str = summarizer.summarize_string('This is a long string of text that will be summarized.')
print(f'summary: {out_str}')

you can also directly summarize a file:

out_path = summarizer.summarize_file('/path/to/file.txt')
print(f'summary saved to {out_path}')

CLI

To summarize a directory of text files, run the following command:

textsum-dir /path/to/dir

The following options are available:

usage: textsum-dir [-h] [-o OUTPUT_DIR] [-m MODEL_NAME] [--no_cuda] [--tf32] [-8bit]
                   [-batch BATCH_LENGTH] [-stride BATCH_STRIDE] [-nb NUM_BEAMS]
                   [-l2 LENGTH_PENALTY] [-r2 REPETITION_PENALTY]
                   [-length_ratio MAX_LENGTH_RATIO] [-ml MIN_LENGTH]
                   [-enc_ngram ENCODER_NO_REPEAT_NGRAM_SIZE] [-dec_ngram NO_REPEAT_NGRAM_SIZE]
                   [--no_early_stopping] [--shuffle] [--lowercase] [-v] [-vv] [-lf LOGFILE]
                   input_dir

For more information, run the following:

textsum-dir --help

Demo App

For convenience, a UI demo[^1] is provided using gradio. To ensure you have the dependencies installed, clone the repo and run the following command:

pip install textsum[app]

To run the demo, run the following command:

textsum-ui

This will start a local server that you can access in your browser & a shareable link will be printed to the console.

[^1]: The demo is minimal but will be expanded to accept other arguments and options.

Using Big Models

Summarization is a memory-intensive task, and the default model is relatively small and efficient for long-form text summarization. If you want to use a bigger model, you can specify the model_name_or_path argument when instantiating the Summarizer class.

summarizer = Summarizer(model_name_or_path='pszemraj/long-t5-tglobal-xl-16384-book-summary')

You can also use the -m argument when using the CLI:

textsum-dir /path/to/dir -m pszemraj/long-t5-tglobal-xl-16384-book-summary

Reducing Memory Usage

EFficient Inference

Some methods of reducing memory usage if you have compatible hardware include loading the model in 8-bit precision via LLM.int8 and using the --tf32 flag to use TensorFloat32 precision. See the transformers docs for more details on how this works. Using LLM.int8 requires the bitsandbytes package, which can either be installed directly or via the textsum[8bit] extra:

pip install textsum[8bit]

To use these options, use the -8bit and --tf32 flags when using the CLI:

textsum-dir /path/to/dir -8bit --tf32

Or in python, using the load_in_8bit argument:

summarizer = Summarizer(load_in_8bit=True)

If using the python API, it's better to initiate tf32 yourself; see here for how.

Parameters

Memory usage can also be reduced by adjusting the parameters for inference. This is discussed in detail in the project wiki.

tl;dr for this README, you can use the .set_inference_params() and .get_inference_params() methods to adjust the parameters for inference.


Contributing

Contributions are welcome! Please open an issue or PR if you have any ideas or suggestions.

See the CONTRIBUTING.md file for details on how to contribute.

Roadmap

  • add CLI for summarization of all text files in a directory
  • python API for summarization of text docs
  • add argparse CLI for UI demo
  • put on PyPI
  • LLM.int8 inference
  • optimum inference integration
  • better documentation in the wiki, details on improving performance (speed, quality, memory usage, etc.)
  • improvements to the PDF OCR helper module

Other ideas? Open an issue or PR!


Project generated with PyScaffold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textsum-0.1.5.tar.gz (34.8 kB view hashes)

Uploaded Source

Built Distribution

textsum-0.1.5-py3-none-any.whl (26.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page