Skip to main content

ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications

Project description

ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications

This repository contains data for our paper "ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications" and a small utility class to work with it.

HuggingFace datasets

You can also use Huggin Face datasets to load ACLSum (dataset link). This would be convenient if you want to train transformer models using our dataset.

Just do,

from datasets import load_dataset
dataset = load_dataset("sobamchan/aclsum")

Our utility class

If you want to see what's in our data more carefully, the following example code on how to use our utility class may be helpful.

You can install the library with the dataset via pip, just run,

pip install aclsum

then you can load the dataset from your python code as,

from aclsum import ACLSum

# Load per split ("train", "val", "test")
train = ACLSum("train")

# One data sample (= paper)
document = train[0]

# Three summaries on each aspect (dict[aspect, summary])
document.summaries

# Get all the sentences from the paper (we only work with abstract, introduction, and conclusion sections) (list[str])
document.get_all_sentences() 

# You can specify sections to extract sentences from
document.get_all_sentences(["abstract", "conclusion"])

# Get highlight labels (list[0 or 1])
document.get_all_highlights()

# Get highlighted sentences (list[str])
document.get_all_highlighted_sentences()

Get original PDF parses

While not all the texts are included in the final dataset (only Abstract, Introduction, and Conclusion are included), you can also get the raw output data from Grobid as following,

# This will load a json file in our repo.
raw_data_from_grobid_in_dict = document.get_fulltext_parse()

# For instance you can get author information
raw_data_from_grobid_in_dict["authors"]

# Or the fulltext including other sections
# This will return a list of dicts, {"text": str, "cite_spans": list, "eq_spans": list, "section": str, "sec_num": str}
raw_data_from_grobid_in_dict["pdf_parse"]["body_text"]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aclsum-0.1.2.tar.gz (5.0 MB view details)

Uploaded Source

Built Distribution

aclsum-0.1.2-py3-none-any.whl (5.2 MB view details)

Uploaded Python 3

File details

Details for the file aclsum-0.1.2.tar.gz.

File metadata

  • Download URL: aclsum-0.1.2.tar.gz
  • Upload date:
  • Size: 5.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.15.4 CPython/3.10.8 Darwin/22.4.0

File hashes

Hashes for aclsum-0.1.2.tar.gz
Algorithm Hash digest
SHA256 39fd602a036ab304651a5dd0f1ffdbf58579d2e9c669ae5ebef552f59bd85965
MD5 9149881809e1765cbc3050026d9bbd85
BLAKE2b-256 cfdf7a59eb2f3ceca862a3bbb21722cfbadc4e6f76bc1220d614de884541265d

See more details on using hashes here.

File details

Details for the file aclsum-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: aclsum-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.15.4 CPython/3.10.8 Darwin/22.4.0

File hashes

Hashes for aclsum-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c75ffc48d3eabc5f37c68ea087943780a85950087957759e89746f21111f2213
MD5 8006363afa381fcf36d32e3c6c990177
BLAKE2b-256 55fa636d77b88139b7856c9e980a99a35298c420ad12ac90912fa9f83f9d147a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page