Skip to main content

Python library for articles summarization with LoRA and CLI support

Project description

DeepCompend

DeepCompend is a Python library capable of using any Hugging Face summarization model for quick generation of summaries of scientific articles in TXT-format.

Installation

Firstly, we clone the repository:

git clone https://github.com/spolivin/deep-compend.git

Next, we need to create and activate the virtual environment:

  • Windows
python -m venv ./.venv
.venv\Scripts\activate.bat
  • Linux
sudo apt-get update
sudo apt-get install python3.12-venv
python3.12 -m venv .venv
source .venv/bin/activate

Lastly, we install the library in editable mode:

pip install setuptools wheel
pip install -e .

Since the library makes use of Spacy's language models for important words extraction, it is also necessary to install some model, for instance, en_core_web_sm:

# Windows
python -m spacy download en_core_web_sm

# Linux
python3.12 -m spacy download en_core_web_sm

One can of course opt for loading other Spacy's language models: en_core_web_md or en_core_web_lg

Python API

Suppose, we have a test article located on articles/test1.pdf. Hence, we do the following. Firstly, we import necessary classes for summarization:

from deep_compend import ArticleSummarizer, SummaryGenerationConfig
  • ArticleSummarizer - core class for conducting summarization, loading models and generating reports.
  • SummaryGenerationConfig - configuration for storing the parameters of the summary generation (min/max length of summary, penalties for repetition and length, etc.).

Next, we instantiate objects for these two classes. For instance, if we want to use facebook/bart-large-cnn model for summarization:

# Specifying the config for summary generation (given with default values)
summ_config = SummaryGenerationConfig()

# Instantiating a Summarizer object with specifying the device
summarizer = ArticleSummarizer(model_path="facebook/bart-large-cnn", run_on="cuda")

We can optionally attach LoRA adapters compatible with the model we used in model_path:

# Attaching compatible LoRA adapters if needed
summarizer.load_lora_adapters(lora_adapters_path="spolivin/bart-arxiv-lora")

We can now specify the path to the article we need to summarize and can easily generate the summary:

# Generating summary
generated_summary = summarizer.summarize(pdf_path="articles/test1.pdf", config=summ_config)

The text in generated_summary now contains the summary of the article from articles/test1.pdf. Lastly, we generate the report:

# Generating summary report
summarizer.generate_summary_report("summary_report.txt")

After successful generation, one will see a message mentioning where summary has been saved (by default summary is saved in a txt-file in summaries folder created if non-existent).

Command Line Interface (CLI)

In order to make the library useful, after library installation a user has access to deep-compend command for launching summarization. This CLI command is equipped with the following sub-commands that extends the analysis of an article to be summarized:

$ deep-compend --help

usage: deep-compend [-h] {summarize,extract-text,extract-keywords} ...

Article summarization tool

positional arguments:
  {summarize,extract-text,extract-keywords}
    summarize           Summarizes a PDF article using a Hugging Face model
    extract-text        Extracts text from article
    extract-keywords    Extracts keywords from article

options:
  -h, --help            show this help message and exit
  • summarize

This subcommand launches the process of summarization and report generation like so for instance:

deep-compend summarize articles/test1.pdf --config=configs/config.json

More examples of using this subcommand can be consulted here.

Other CLI arguments for this command are as follows:

$ deep-compend summarize --help

usage: deep-compend summarize [-h] [-c CONFIG] [-mp MODEL_PATH] [-tp TOKENIZER_PATH] [-mxot MAX_OUTPUT_TOKENS] [-mnot MIN_OUTPUT_TOKENS] [-nb NUM_BEAMS]
                              [-lp LENGTH_PENALTY] [-rp REPETITION_PENALTY] [-nrns NO_REPEAT_NGRAM_SIZE] [-lap LORA_ADAPTERS_PATH] [-lw LINE_WIDTH]
                              [-mkn MAX_KEYWORDS_NUM] [-mkl MIN_KEYWORDS_LENGTH] [-rn REPORT_NAME] [-sf SAVE_FOLDER] [-slm SPACY_LANG_MODEL]
                              filepath

Summarizes a PDF article using a Hugging Face model

positional arguments:
  filepath              Path to the PDF article to be summarized

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Path to the config JSON file
  -mp MODEL_PATH, --model-path MODEL_PATH
                        Path to summarization model
  -tp TOKENIZER_PATH, --tokenizer-path TOKENIZER_PATH
                        Path to summarization model tokenizer
  -mxot MAX_OUTPUT_TOKENS, --max-output-tokens MAX_OUTPUT_TOKENS
                        Maximum number of output tokens
  -mnot MIN_OUTPUT_TOKENS, --min-output-tokens MIN_OUTPUT_TOKENS
                        Minimum number of output tokens
  -nb NUM_BEAMS, --num-beams NUM_BEAMS
                        Number of beams for beam search
  -lp LENGTH_PENALTY, --length-penalty LENGTH_PENALTY
                        Penalty for the summary length
  -rp REPETITION_PENALTY, --repetition-penalty REPETITION_PENALTY
                        Penalty for repetitive words
  -nrns NO_REPEAT_NGRAM_SIZE, --no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE
                        Avoid repetitive phrases
  -lap LORA_ADAPTERS_PATH, --lora-adapters-path LORA_ADAPTERS_PATH
                        Path to LoRA adapters
  -lw LINE_WIDTH, --line-width LINE_WIDTH
                        Maximum line width for report formatting
  -mkn MAX_KEYWORDS_NUM, --max-keywords-num MAX_KEYWORDS_NUM
                        Maximum number of keywords in the summary report
  -mkl MIN_KEYWORDS_LENGTH, --min-keywords-length MIN_KEYWORDS_LENGTH
                        Minimum length of keywords to consider in the summary report
  -rn REPORT_NAME, --report-name REPORT_NAME
                        Name of the output summary report
  -sf SAVE_FOLDER, --save-folder SAVE_FOLDER
                        Folder to save the generated summary
  -slm SPACY_LANG_MODEL, --spacy-lang-model SPACY_LANG_MODEL
                        Name of Spacy language model to be used for keyword extraction
  • extract-text

This subcommand enables seeing before running the summarization the preprocessed input article text that goes as input to the model specified for the summarization. In other words, the retrieved article text starting from the Introduction and ending before References:

deep-compend extract-text articles/test1.pdf

Command can be useful for understanding whether the text was retrieved correctly and allows for analyzing the input before actually running any models

  • extract-keywords

This subcommand retrieved the keywords from the article using Spacy's language models and can be useful for getting the general insight into what the paper is about:

deep-compend extract-keywords articles/test1.pdf --spacy-lang-model=en_core_web_lg --max-keywords-num=10 --min-keywords-length=7

Command allows specifying the language model to use for extraction (--spacy-lang-model), maximum number of keywords to show (--max-keywords-num) and what minimum keyword length to consider (--min-keywords-length).

Overriding arguments

There are two ways that one can specify arguments for the script:

  • Configuration file (--config flag) => examples of configs can be found here.

  • CLI arguments.

The script is programmed in such a way that when specifying both config and CLI arguments, argument with the same name in config and CLI will be overridden with the value specified in CLI. For instance, after using this command, the --num-beams argument will be overridden with the value of 5:

deep-compend summarize articles/test1.pdf --config=configs/t5_small_config.json --num-beams=5

Example scripts

I have prepared a few shell-scripts with examples of using the script for summarization in order to demonstrate how it can be used. One can run them in the following way for some test article. I have prepared a script for automatic downloading of an article from ArXiv given its ID. For instance, we can load a famous Deep Residual Learning for Image Recognition paper which has the ArXiv ID of 1512.03385:

# Loading paper from ArXiv and saving it in 'articles' folder
python pull_arxiv_paper.py 1512.03385

Now we can run each of the below scripts one by one to test the CLI and different configurations:

# Using "facebook/bart-large-cnn"
bash scripts/run_bart_large.sh articles/1512.03385.pdf

# Using "facebook/bart-large-cnn" with LoRA adapters
bash scripts/run_bart_lora.sh articles/1512.03385.pdf

# Using default settings
bash scripts/run_default.sh articles/1512.03385.pdf

# Using "google-t5/t5-base"
bash scripts/run_t5_base.sh articles/1512.03385.pdf

# Using "google-t5/t5-small"
bash scripts/run_t5_small.sh articles/1512.03385.pdf

# Using "google-t5/t5-base" with overridden arguments
bash scripts/run_t5_small_override.sh articles/1512.03385.pdf

After running these commands, the respective summary reports with additional information and statistics will be generated and saved in summaries folder (by default).

Library limitations

The main limitation consists in the way article sections are named. The library is written to retrieve text starting from "Introduction-like" sections until "References-like" sections to use the result as input to summary generation models. While the library is able to track the most common ways Introduction and References sections are usually named and thus retrieve text accordingly, sometimes these sections can have other names that can pose a problem for retrieving the text correctly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_compend-0.1.1.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deep_compend-0.1.1-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file deep_compend-0.1.1.tar.gz.

File metadata

  • Download URL: deep_compend-0.1.1.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for deep_compend-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1953d6513faf2751e2214e2ca65aa27b4948ba897bfd59ee06bdac468a75fc83
MD5 eca3b37b650c69fe731e68e8c54752eb
BLAKE2b-256 c450849f6893769f2fb3f220bf03a64e6269b170d4c43c20c183bb6c5ca71999

See more details on using hashes here.

File details

Details for the file deep_compend-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: deep_compend-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for deep_compend-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 38653bb47de865677fec8a8b2ef2c115b38c54f11d628595e8bd21f988d58286
MD5 ef64db17230d521991716b0c31312c1f
BLAKE2b-256 a3718614271587ef5cfd476a921659d312922edb57262a08cc1bbe2530f20a27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page