Python library for articles summarization with LoRA and CLI support
Project description
DeepCompend
DeepCompend is a Python library capable of using any Hugging Face summarization model for quick generation of summaries of scientific articles in TXT-format.
Installation
Firstly, we clone the repository:
git clone https://github.com/spolivin/deep-compend.git
Next, we need to create and activate the virtual environment:
- Windows
python -m venv ./.venv
.venv\Scripts\activate.bat
- Linux
sudo apt-get update
sudo apt-get install python3.12-venv
python3.12 -m venv .venv
source .venv/bin/activate
Lastly, we install the library in editable mode:
pip install setuptools wheel
pip install -e .
Since the library makes use of Spacy's language models for important words extraction, it is also necessary to install some model, for instance, en_core_web_sm:
# Windows
python -m spacy download en_core_web_sm
# Linux
python3.12 -m spacy download en_core_web_sm
One can of course opt for loading other Spacy's language models:
en_core_web_mdoren_core_web_lg
Python API
Suppose, we have a test article located on articles/test1.pdf. Hence, we do the following. Firstly, we import necessary classes for summarization:
from deep_compend import ArticleSummarizer, SummaryGenerationConfig
ArticleSummarizer- core class for conducting summarization, loading models and generating reports.SummaryGenerationConfig- configuration for storing the parameters of the summary generation (min/max length of summary, penalties for repetition and length, etc.).
Next, we instantiate objects for these two classes. For instance, if we want to use facebook/bart-large-cnn model for summarization:
# Specifying the config for summary generation (given with default values)
summ_config = SummaryGenerationConfig()
# Instantiating a Summarizer object with specifying the device
summarizer = ArticleSummarizer(model_path="facebook/bart-large-cnn", run_on="cuda")
We can optionally attach LoRA adapters compatible with the model we used in model_path:
# Attaching compatible LoRA adapters if needed
summarizer.load_lora_adapters(lora_adapters_path="spolivin/bart-arxiv-lora")
We can now specify the path to the article we need to summarize and can easily generate the summary:
# Generating summary
generated_summary = summarizer.summarize(pdf_path="articles/test1.pdf", config=summ_config)
The text in generated_summary now contains the summary of the article from articles/test1.pdf. Lastly, we generate the report:
# Generating summary report
summarizer.generate_summary_report("summary_report.txt")
After successful generation, one will see a message mentioning where summary has been saved (by default summary is saved in a txt-file in summaries folder created if non-existent).
Command Line Interface (CLI)
In order to make the library useful, after library installation a user has access to deep-compend command for launching summarization. This CLI command is equipped with the following sub-commands that extends the analysis of an article to be summarized:
$ deep-compend --help
usage: deep-compend [-h] {summarize,extract-text,extract-keywords} ...
Article summarization tool
positional arguments:
{summarize,extract-text,extract-keywords}
summarize Summarizes a PDF article using a Hugging Face model
extract-text Extracts text from article
extract-keywords Extracts keywords from article
options:
-h, --help show this help message and exit
summarize
This subcommand launches the process of summarization and report generation like so for instance:
deep-compend summarize articles/test1.pdf --config=configs/config.json
More examples of using this subcommand can be consulted here.
Other CLI arguments for this command are as follows:
$ deep-compend summarize --help
usage: deep-compend summarize [-h] [-c CONFIG] [-mp MODEL_PATH] [-tp TOKENIZER_PATH] [-mxot MAX_OUTPUT_TOKENS] [-mnot MIN_OUTPUT_TOKENS] [-nb NUM_BEAMS]
[-lp LENGTH_PENALTY] [-rp REPETITION_PENALTY] [-nrns NO_REPEAT_NGRAM_SIZE] [-lap LORA_ADAPTERS_PATH] [-lw LINE_WIDTH]
[-mkn MAX_KEYWORDS_NUM] [-mkl MIN_KEYWORDS_LENGTH] [-rn REPORT_NAME] [-sf SAVE_FOLDER] [-slm SPACY_LANG_MODEL]
filepath
Summarizes a PDF article using a Hugging Face model
positional arguments:
filepath Path to the PDF article to be summarized
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Path to the config JSON file
-mp MODEL_PATH, --model-path MODEL_PATH
Path to summarization model
-tp TOKENIZER_PATH, --tokenizer-path TOKENIZER_PATH
Path to summarization model tokenizer
-mxot MAX_OUTPUT_TOKENS, --max-output-tokens MAX_OUTPUT_TOKENS
Maximum number of output tokens
-mnot MIN_OUTPUT_TOKENS, --min-output-tokens MIN_OUTPUT_TOKENS
Minimum number of output tokens
-nb NUM_BEAMS, --num-beams NUM_BEAMS
Number of beams for beam search
-lp LENGTH_PENALTY, --length-penalty LENGTH_PENALTY
Penalty for the summary length
-rp REPETITION_PENALTY, --repetition-penalty REPETITION_PENALTY
Penalty for repetitive words
-nrns NO_REPEAT_NGRAM_SIZE, --no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE
Avoid repetitive phrases
-lap LORA_ADAPTERS_PATH, --lora-adapters-path LORA_ADAPTERS_PATH
Path to LoRA adapters
-lw LINE_WIDTH, --line-width LINE_WIDTH
Maximum line width for report formatting
-mkn MAX_KEYWORDS_NUM, --max-keywords-num MAX_KEYWORDS_NUM
Maximum number of keywords in the summary report
-mkl MIN_KEYWORDS_LENGTH, --min-keywords-length MIN_KEYWORDS_LENGTH
Minimum length of keywords to consider in the summary report
-rn REPORT_NAME, --report-name REPORT_NAME
Name of the output summary report
-sf SAVE_FOLDER, --save-folder SAVE_FOLDER
Folder to save the generated summary
-slm SPACY_LANG_MODEL, --spacy-lang-model SPACY_LANG_MODEL
Name of Spacy language model to be used for keyword extraction
extract-text
This subcommand enables seeing before running the summarization the preprocessed input article text that goes as input to the model specified for the summarization. In other words, the retrieved article text starting from the Introduction and ending before References:
deep-compend extract-text articles/test1.pdf
Command can be useful for understanding whether the text was retrieved correctly and allows for analyzing the input before actually running any models
extract-keywords
This subcommand retrieved the keywords from the article using Spacy's language models and can be useful for getting the general insight into what the paper is about:
deep-compend extract-keywords articles/test1.pdf --spacy-lang-model=en_core_web_lg --max-keywords-num=10 --min-keywords-length=7
Command allows specifying the language model to use for extraction (
--spacy-lang-model), maximum number of keywords to show (--max-keywords-num) and what minimum keyword length to consider (--min-keywords-length).
Overriding arguments
There are two ways that one can specify arguments for the script:
-
Configuration file (
--configflag) => examples of configs can be found here. -
CLI arguments.
The script is programmed in such a way that when specifying both config and CLI arguments, argument with the same name in config and CLI will be overridden with the value specified in CLI. For instance, after using this command, the --num-beams argument will be overridden with the value of 5:
deep-compend summarize articles/test1.pdf --config=configs/t5_small_config.json --num-beams=5
Example scripts
I have prepared a few shell-scripts with examples of using the script for summarization in order to demonstrate how it can be used. One can run them in the following way for some test article. I have prepared a script for automatic downloading of an article from ArXiv given its ID. For instance, we can load a famous Deep Residual Learning for Image Recognition paper which has the ArXiv ID of 1512.03385:
# Loading paper from ArXiv and saving it in 'articles' folder
python pull_arxiv_paper.py 1512.03385
Now we can run each of the below scripts one by one to test the CLI and different configurations:
# Using "facebook/bart-large-cnn"
bash scripts/run_bart_large.sh articles/1512.03385.pdf
# Using "facebook/bart-large-cnn" with LoRA adapters
bash scripts/run_bart_lora.sh articles/1512.03385.pdf
# Using default settings
bash scripts/run_default.sh articles/1512.03385.pdf
# Using "google-t5/t5-base"
bash scripts/run_t5_base.sh articles/1512.03385.pdf
# Using "google-t5/t5-small"
bash scripts/run_t5_small.sh articles/1512.03385.pdf
# Using "google-t5/t5-base" with overridden arguments
bash scripts/run_t5_small_override.sh articles/1512.03385.pdf
After running these commands, the respective summary reports with additional information and statistics will be generated and saved in summaries folder (by default).
Library limitations
The main limitation consists in the way article sections are named. The library is written to retrieve text starting from "Introduction-like" sections until "References-like" sections to use the result as input to summary generation models. While the library is able to track the most common ways Introduction and References sections are usually named and thus retrieve text accordingly, sometimes these sections can have other names that can pose a problem for retrieving the text correctly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deep_compend-0.1.1.tar.gz.
File metadata
- Download URL: deep_compend-0.1.1.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1953d6513faf2751e2214e2ca65aa27b4948ba897bfd59ee06bdac468a75fc83
|
|
| MD5 |
eca3b37b650c69fe731e68e8c54752eb
|
|
| BLAKE2b-256 |
c450849f6893769f2fb3f220bf03a64e6269b170d4c43c20c183bb6c5ca71999
|
File details
Details for the file deep_compend-0.1.1-py3-none-any.whl.
File metadata
- Download URL: deep_compend-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38653bb47de865677fec8a8b2ef2c115b38c54f11d628595e8bd21f988d58286
|
|
| MD5 |
ef64db17230d521991716b0c31312c1f
|
|
| BLAKE2b-256 |
a3718614271587ef5cfd476a921659d312922edb57262a08cc1bbe2530f20a27
|