Skip to main content

Entity and Relation Extraction on scientific text using HGERE with a span-pruning stage.

Project description

gsapere — Entity and Relation Extraction for Scientific Text

A fork of HGERE adapted for scientific text, with a two-stage pipeline for joint entity and relation extraction (ERE).

Paper under review. Configs used for our experiments are in configs/.

The pipeline consists of:

  1. Rule-based pre-filter (optional) — removes deterministically non-entity spans (punctuation, function-word sequences, etc.) before the neural pruner sees training data, reducing trivial negatives and speeding up training
  2. Span Pruner — a binary classifier that scores remaining candidate n-grams and filters them to a manageable set (target: ≥ 98 % entity recall)
  3. HGERE — a Hypergraph GNN that jointly predicts entity types and relations on the pruned candidates

Supported datasets: GSAP-ERE, SciER, SciNLP, SciERC


Changes from the original

  • Large-scale code restructuring: Pydantic-first configs, typed signatures throughout, proper package layout under src/
  • All dependencies updated to current versions
  • The transformer package is no longer hardcoded — any compatible HuggingFace transformers version works
  • Added rule-based pre-filter, span pruner stage, multi-dataset joint training, and full CLI entry points
  • Tests for all major components

Requirements

  • Python 3.9 (tested; <3.11 required by some dependencies)
  • CUDA 12.8 (adjust pyproject.toml for other CUDA versions)
  • A GPU with at least ~24 GB VRAM for default batch sizes (tested on A40 / 40 GB)

Installation

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository and install:

git clone <repo-url>
cd HGERE
uv sync

Datasets

Datasets are loaded from their original sources via the download command:

uv run gsapere-download-dataset --list          # list available datasets
uv run gsapere-download-dataset gsap-ere
uv run gsapere-download-dataset scier
uv run gsapere-download-dataset scinlp
uv run gsapere-download-dataset scierc
uv run gsapere-download-dataset --all           # download everything

See documentation/download-dataset.md for split details and manual download fallbacks.

GSAP-ERE

Fine-grained entity and relation extraction focused on machine learning — 100 annotated full-text ML publications, 63K entities, 35K relations, 10 entity types, 18 relation types. DOI: https://doi.org/10.60914/c4c1d-s0587

Otto et al., "GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning", AAAI 2026. https://ojs.aaai.org/index.php/AAAI/article/view/40537

SciER

Entity and relation extraction dataset for datasets, methods, and tasks in scientific documents — 106 annotated full-text papers, 24k entities, 12k relations.

Dziadek et al., "SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents", EMNLP 2024. https://aclanthology.org/2024.emnlp-main.726/

SciNLP

Full-text entity and relation extraction benchmark for the NLP domain — 60 annotated ACL papers, 6,409 entities, 1,648 relations.

"SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP", EMNLP 2025. https://aclanthology.org/2025.emnlp-main.732/

SciERC

Scientific information extraction benchmark — 500 annotated AI abstracts, 6 entity types, 7 relation types.

Luan et al., "Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction", EMNLP 2018. https://aclanthology.org/D18-1360/


Training

Training is a two-step process: first train the pruner, then train HGERE on the pruner's output.

Step 1 — Fit the rule-based pre-filter (optional)

uv run gsapere-fit-rulebased-pruner configs/train/gsap/fit_rulebased_pruner.yaml

This fits token n-gram patterns from the training data that deterministically exclude non-entity spans. The saved JSON file is referenced in the pruner training config to speed up training.

Step 2 — Train the span pruner

uv run gsapere-train-pruner configs/train/gsap/train_gsap_pruner.yaml

After training, run pruner inference on train/dev/test to produce the enriched input files for HGERE (see scripts/pruner/).

Step 3 — Train HGERE (single dataset)

uv run gsapere-train-hgere configs/train/gsap/train_gsap_hgere.yaml

Example config:

schema_version: "1.0"
label_set: gsap
model_dir: saves/hgere/gsap
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
ner_prediction_dir: saves/pruner/gsap/output
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true

train_params:
  learning_rate: 1e-5
  num_train_epochs: 8
  per_gpu_train_batch_size: 21
  fp16: true
  evaluate_during_training: true
  eval_epochs: 1
  loss_re_weight_alpha: 0.9
  log_wandb: true

Step 3 (alt) — Train HGERE on multiple datasets jointly

Multi-dataset mode trains a shared encoder with per-dataset NER and relation heads. Each dataset must have its own pruner output directory.

uv run gsapere-train-hgere configs/multi-sciere-scinlp-gsap-ere/train/hgere/train_multi.yaml

Example config:

schema_version: "1.0"
model_dir: saves/multi/hgere/run1
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true
sampling_temperature: 0.8   # 0 = always largest dataset, 1 = proportional to size
seeds: [42, 43, 44]          # run once per seed; _seed<n> appended to model_dir

datasets:
  - label_set: scier
    ner_prediction_dir: saves/pruner/scier/output
    train_file: ent_pred_train.json
    dev_file: ent_pred_dev.json
    test_file: ent_pred_test.json
  - label_set: scinlp
    ner_prediction_dir: saves/pruner/scinlp/output
    train_file: ent_pred_train.json
    dev_file: ent_pred_dev.json   # omit (null) to skip dev evaluation for this dataset
  - label_set: gsap
    ner_prediction_dir: saves/pruner/gsap/output
    train_file: ent_pred_train.json

train_params:
  learning_rate: 1e-5
  num_train_epochs: 8
  per_gpu_train_batch_size: 21
  fp16: true
  evaluate_during_training: true
  log_wandb: true

Inference

Full pipeline (pruner → HGERE)

CUDA_VISIBLE_DEVICES=0 uv run gsapere-pipeline \
    --config configs/inference/gsap-pipeline-best.yaml \
    --input input/ \
    --output output/

--input can be a .jsonl file or a directory of .jsonl files. Ready-to-use configs for all supported datasets are in configs/inference/.

The pipeline config combines pruner and HGERE settings in a single YAML file:

label_set: gsap

pruner:
  model_dir: saves/pruner/gsap/best
  base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
  model_type: bertspanmarkerpruner
  max_seq_length: 256
  per_gpu_eval_batch_size: 32
  final_pruning:
    method: threshold
    threshold: 0.0005

hgere:
  model_dir: saves/hgere/gsap/best
  base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
  model_type: hyper
  max_seq_length: 512
  per_gpu_eval_batch_size: 32
  n_iter: 3
  layernorm: true
  attn_self: true
  pre_filter_params:
    method: threshold
    value: 0.0125

Docker API

The pipeline can be served as a REST API. Build and run with Docker (requires --gpus all):

docker build -t gsapere-api .

docker run --gpus all \
    -v /path/to/models:/app/models \
    -v /path/to/config.yaml:/app/config.yaml \
    -e PIPELINE_CONFIG=/app/config.yaml \
    -p 8000:8000 \
    gsapere-api

Models and the pipeline config are mounted at runtime — the image itself contains only the code.

Endpoints:

Method Path Description
GET /health Liveness check
POST /predict Run the pipeline on a batch of documents

Example request:

curl -X POST http://localhost:8000/predict \
    -H "Content-Type: application/json" \
    -d '{"documents": [{"doc_key": "doc1", "sentences": [["We", "train", "BERT", "."]]}]}'

CLI reference

Command Description
gsapere-train-pruner Train the span pruner
gsapere-train-hgere Train the HGERE ERE model
gsapere-pipeline Run the full two-stage pipeline on new documents
gsapere-download-dataset Download supported datasets
gsapere-tune-pruner Threshold sweep and optimisation for the pruner
gsapere-fit-rulebased-pruner Fit a rule-based pruner baseline
infer-fixed-spans Run HGERE on fixed (gold) spans
infer-pruner-augmented Run HGERE on pruner-predicted spans
gsap-ere-benchmark-pipeline Benchmark pipeline throughput
gsapere-fix-gold-annos Add gold annotations to prediction files
gsapere-analysis-ner-length-distribution Analyse entity length distributions
gsapere-generate-pruner-docs Regenerate parameter docs in documentation/api/

Development

uv run pytest                          # run tests
uv run ruff format src/ tests/         # format
uv run ruff check src/ tests/          # lint

Building and publishing

uv build                               # produces dist/ wheel + sdist
bash publish.sh                        # build + upload to PyPI (requires .pypi token file)

Citation

Please cite this work and the original HGERE:

@article{Otto2026GSAP-ERE,
  title   = {{GSAP-ERE}: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning},
  author  = {Otto, Wolfgang and Gan, Lu and Upadhyaya, Sharmila and Karmakar, Saurav and Dietze, Stefan},
  journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume  = {40},
  number  = {38},
  pages   = {32600--32609},
  year    = {2026},
  month   = {Mar.},
  doi     = {10.1609/aaai.v40i38.40537},
  url     = {https://ojs.aaai.org/index.php/AAAI/article/view/40537},
}

@misc{yan2023joint,
  title         = {Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks},
  author        = {Zhaohui Yan and Songlin Yang and Wei Liu and Kewei Tu},
  year          = {2023},
  eprint        = {2310.17238},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsapere-0.2.0.tar.gz (838.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsapere-0.2.0-py3-none-any.whl (302.1 kB view details)

Uploaded Python 3

File details

Details for the file gsapere-0.2.0.tar.gz.

File metadata

  • Download URL: gsapere-0.2.0.tar.gz
  • Upload date:
  • Size: 838.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for gsapere-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a59911e2af80329888420ea63adcc15baa320aa9fc5827c719486fd871d8f5f0
MD5 2744cd2fbebda9472695b52e3952b74a
BLAKE2b-256 82b719661fabd5e8d1ef4488dcdb4e2765fc6553ec874c64abb3e3c5f79b6a3d

See more details on using hashes here.

File details

Details for the file gsapere-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gsapere-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 302.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for gsapere-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 054ffd2ea8cddffc8bf7fd49af819ccbdb59c7125d1a35eeffaf93fea38a2bbd
MD5 9c15cdc43875f375f0a9e1253994a0c4
BLAKE2b-256 852e4ec637895f75178e8ddbd857f08a55d043b1d7f6e23cf67a784022a0581e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page