Skip to main content

THExt - Transformer-based Highlights Extraction

Project description

THExt

Transformer-based Highlights Extraction from scientific papers (THExt)

Examples and demo

All examples provided below have been extracted using the best-performing model reported in the paper. No manual pre- or post- processing has been applied for highlights extraction. The text of the papers has been parsed from PDF files using GROBID.

Pre-trained models will be released after paper revision process.

Installation

Run the following to install

pip install git+https://github.com/MorenoLaQuatra/THExt.git
python -m spacy download en_core_web_lg

Usage

Using pretrained models

from thext import SentenceRankerPlus
from thext import RedundancyManager
from thext import Highlighter

base_model_name = "morenolq/thext-cs-scibert"
model_name_or_path = "morenolq/thext-cs-scibert"
sr = SentenceRankerPlus()
sr.load_model(base_model_name=base_model_name, model_name_or_path=model_name_or_path)
h = Highlighter(sr)

# Define a set of sentences
sentences = [
    "We propose a new approach, based on Transformer-based encoding, to highlight extraction. To the best of our knowledge, this is the first attempt to use transformer architectures to address automatic highlight generation.", 
    "We design a context-aware sentence-level regressor, in which the semantic similarity between candidate sentences and highlights is estimated by also attending the contextual knowledge provided by the other paper sections.",
    "Fig. 2, Fig. 3, Fig. 4 show the effect of varying the number K of selected highlights on the extraction performance. As expected, recall values increase while increasing the number of selected highlights, whereas precision values show an opposite trend.",
]
abstract = "Highlights are short sentences used to annotate scientific papers. They complement the abstract content by conveying the main result findings. To automate the process of paper annotation, highlights extraction aims at extracting from 3 to 5 paper sentences via supervised learning. Existing approaches rely on ad hoc linguistic features, which depend on the analyzed context, and apply recurrent neural networks, which are not effective in learning long-range text dependencies. This paper leverages the attention mechanism adopted in transformer models to improve the accuracy of sentence relevance estimation. Unlike existing approaches, it relies on the end-to-end training of a deep regression model. To attend patterns relevant to highlights content it also enriches sentence encodings with a section-level contextualization. The experimental results, achieved on three different benchmark datasets, show that the designed architecture is able to achieve significant performance improvements compared to the state-of-the-art."

num_highlights = 1

highlights = h.get_highlights_simple(sentences, abstract,
                rel_w=1.0, 
                pos_w=0.0, 
                red_w=0.0, 
                prefilter=False, 
                NH = num_highlights)

for i, h in enumerate(highlights):
    print (f"{i}\t{h}")

Developing THExt

To install THExt, along with the tools you need to develop and run tests, run the following in your virtualenv

$ pip install -e .[dev]

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thext-1.0.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

thext-1.0-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file thext-1.0.tar.gz.

File metadata

  • Download URL: thext-1.0.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for thext-1.0.tar.gz
Algorithm Hash digest
SHA256 0fce700df47d2e6082a661f7ea16ee093aace22b8dd242ef882c7f4db98236d1
MD5 422c0260fb397022083767ae06e15622
BLAKE2b-256 a83d349d38a92be7e3a4b4d5d847863acc6a4f6542688159f4b47cfa89a7b747

See more details on using hashes here.

File details

Details for the file thext-1.0-py3-none-any.whl.

File metadata

  • Download URL: thext-1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for thext-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efe25dc66611aa9adf6066b2521ef91f95de8d5eaa5aa604e572a41eed0d5b1d
MD5 e6196337a8d36fc5f94ef1049a513d1c
BLAKE2b-256 9391068e2406d264a2169b7d1e615d7bb51cbfba4429a524fe9709722f9465e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page