Entity and Relation Extraction on scientific text using HGERE with a span-pruning stage.
Project description
gsapere — Entity and Relation Extraction for Scientific Text
A fork of HGERE adapted for scientific text, with a two-stage pipeline for joint entity and relation extraction (ERE).
Paper under review. Configs used for our experiments are in
configs/.
The pipeline consists of:
- Rule-based pre-filter (optional) — removes deterministically non-entity spans (punctuation, function-word sequences, etc.) before the neural pruner sees training data, reducing trivial negatives and speeding up training
- Span Pruner — a binary classifier that scores remaining candidate n-grams and filters them to a manageable set (target: ≥ 98 % entity recall)
- HGERE — a Hypergraph GNN that jointly predicts entity types and relations on the pruned candidates
Supported datasets: GSAP-ERE, SciER, SciNLP, SciERC
Changes from the original
- Large-scale code restructuring: Pydantic-first configs, typed signatures throughout, proper package layout under
src/ - All dependencies updated to current versions
- The transformer package is no longer hardcoded — any compatible HuggingFace
transformersversion works - Added rule-based pre-filter, span pruner stage, multi-dataset joint training, and full CLI entry points
- Tests for all major components
Requirements
- Python 3.9 (tested;
<3.11required by some dependencies) - CUDA 12.8 (adjust
pyproject.tomlfor other CUDA versions) - A GPU with at least ~24 GB VRAM for default batch sizes (tested on A40 / 40 GB)
Installation
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone the repository and install:
git clone <repo-url>
cd HGERE
uv sync
Datasets
Datasets are loaded from their original sources via the download command:
uv run gsapere-download-dataset --list # list available datasets
uv run gsapere-download-dataset gsap-ere
uv run gsapere-download-dataset scier
uv run gsapere-download-dataset scinlp
uv run gsapere-download-dataset scierc
uv run gsapere-download-dataset --all # download everything
See documentation/download-dataset.md for split details and manual download fallbacks.
GSAP-ERE
Fine-grained entity and relation extraction focused on machine learning — 100 annotated full-text ML publications, 63K entities, 35K relations, 10 entity types, 18 relation types. DOI: https://doi.org/10.60914/c4c1d-s0587
Otto et al., "GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning", AAAI 2026. https://ojs.aaai.org/index.php/AAAI/article/view/40537
SciER
Entity and relation extraction dataset for datasets, methods, and tasks in scientific documents — 106 annotated full-text papers, 24k entities, 12k relations.
Dziadek et al., "SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents", EMNLP 2024. https://aclanthology.org/2024.emnlp-main.726/
SciNLP
Full-text entity and relation extraction benchmark for the NLP domain — 60 annotated ACL papers, 6,409 entities, 1,648 relations.
"SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP", EMNLP 2025. https://aclanthology.org/2025.emnlp-main.732/
SciERC
Scientific information extraction benchmark — 500 annotated AI abstracts, 6 entity types, 7 relation types.
Luan et al., "Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction", EMNLP 2018. https://aclanthology.org/D18-1360/
Training
Training is a two-step process: first train the pruner, then train HGERE on the pruner's output.
Step 1 — Fit the rule-based pre-filter (optional)
uv run gsapere-fit-rulebased-pruner configs/train/gsap/fit_rulebased_pruner.yaml
This fits token n-gram patterns from the training data that deterministically exclude non-entity spans. The saved JSON file is referenced in the pruner training config to speed up training.
Step 2 — Train the span pruner
uv run gsapere-train-pruner configs/train/gsap/train_gsap_pruner.yaml
After training, run pruner inference on train/dev/test to produce the enriched input files for HGERE (see scripts/pruner/).
Step 3 — Train HGERE (single dataset)
uv run gsapere-train-hgere configs/train/gsap/train_gsap_hgere.yaml
Example config:
schema_version: "1.0"
label_set: gsap
model_dir: saves/hgere/gsap
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
ner_prediction_dir: saves/pruner/gsap/output
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true
train_params:
learning_rate: 1e-5
num_train_epochs: 8
per_gpu_train_batch_size: 21
fp16: true
evaluate_during_training: true
eval_epochs: 1
loss_re_weight_alpha: 0.9
log_wandb: true
Step 3 (alt) — Train HGERE on multiple datasets jointly
Multi-dataset mode trains a shared encoder with per-dataset NER and relation heads. Each dataset must have its own pruner output directory.
uv run gsapere-train-hgere configs/multi-sciere-scinlp-gsap-ere/train/hgere/train_multi.yaml
Example config:
schema_version: "1.0"
model_dir: saves/multi/hgere/run1
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true
sampling_temperature: 0.8 # 0 = always largest dataset, 1 = proportional to size
seeds: [42, 43, 44] # run once per seed; _seed<n> appended to model_dir
datasets:
- label_set: scier
ner_prediction_dir: saves/pruner/scier/output
train_file: ent_pred_train.json
dev_file: ent_pred_dev.json
test_file: ent_pred_test.json
- label_set: scinlp
ner_prediction_dir: saves/pruner/scinlp/output
train_file: ent_pred_train.json
dev_file: ent_pred_dev.json # omit (null) to skip dev evaluation for this dataset
- label_set: gsap
ner_prediction_dir: saves/pruner/gsap/output
train_file: ent_pred_train.json
train_params:
learning_rate: 1e-5
num_train_epochs: 8
per_gpu_train_batch_size: 21
fp16: true
evaluate_during_training: true
log_wandb: true
Inference
Full pipeline (pruner → HGERE)
CUDA_VISIBLE_DEVICES=0 uv run gsapere-pipeline \
--config configs/inference/gsap-pipeline-best.yaml \
--input input/ \
--output output/
--input can be a .jsonl file or a directory of .jsonl files.
Ready-to-use configs for all supported datasets are in configs/inference/.
The pipeline config combines pruner and HGERE settings in a single YAML file:
label_set: gsap
pruner:
model_dir: saves/pruner/gsap/best
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
model_type: bertspanmarkerpruner
max_seq_length: 256
per_gpu_eval_batch_size: 32
final_pruning:
method: threshold
threshold: 0.0005
hgere:
model_dir: saves/hgere/gsap/best
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
model_type: hyper
max_seq_length: 512
per_gpu_eval_batch_size: 32
n_iter: 3
layernorm: true
attn_self: true
pre_filter_params:
method: threshold
value: 0.0125
Docker API
The pipeline can be served as a REST API. Build and run with Docker (requires --gpus all):
docker build -t gsapere-api .
docker run --gpus all \
-v /path/to/models:/app/models \
-v /path/to/config.yaml:/app/config.yaml \
-e PIPELINE_CONFIG=/app/config.yaml \
-p 8000:8000 \
gsapere-api
Models and the pipeline config are mounted at runtime — the image itself contains only the code.
Endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness check |
POST |
/predict |
Run the pipeline on a batch of documents |
Example request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"documents": [{"doc_key": "doc1", "sentences": [["We", "train", "BERT", "."]]}]}'
CLI reference
| Command | Description |
|---|---|
gsapere-train-pruner |
Train the span pruner |
gsapere-train-hgere |
Train the HGERE ERE model |
gsapere-pipeline |
Run the full two-stage pipeline on new documents |
gsapere-download-dataset |
Download supported datasets |
gsapere-tune-pruner |
Threshold sweep and optimisation for the pruner |
gsapere-fit-rulebased-pruner |
Fit a rule-based pruner baseline |
infer-fixed-spans |
Run HGERE on fixed (gold) spans |
infer-pruner-augmented |
Run HGERE on pruner-predicted spans |
gsap-ere-benchmark-pipeline |
Benchmark pipeline throughput |
gsapere-fix-gold-annos |
Add gold annotations to prediction files |
gsapere-analysis-ner-length-distribution |
Analyse entity length distributions |
gsapere-generate-pruner-docs |
Regenerate parameter docs in documentation/api/ |
Development
uv run pytest # run tests
uv run ruff format src/ tests/ # format
uv run ruff check src/ tests/ # lint
Building and publishing
uv build # produces dist/ wheel + sdist
bash publish.sh # build + upload to PyPI (requires .pypi token file)
Citation
Please cite this work and the original HGERE:
@article{Otto2026GSAP-ERE,
title = {{GSAP-ERE}: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning},
author = {Otto, Wolfgang and Gan, Lu and Upadhyaya, Sharmila and Karmakar, Saurav and Dietze, Stefan},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {40},
number = {38},
pages = {32600--32609},
year = {2026},
month = {Mar.},
doi = {10.1609/aaai.v40i38.40537},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/40537},
}
@misc{yan2023joint,
title = {Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks},
author = {Zhaohui Yan and Songlin Yang and Wei Liu and Kewei Tu},
year = {2023},
eprint = {2310.17238},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsapere-0.2.0.tar.gz.
File metadata
- Download URL: gsapere-0.2.0.tar.gz
- Upload date:
- Size: 838.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a59911e2af80329888420ea63adcc15baa320aa9fc5827c719486fd871d8f5f0
|
|
| MD5 |
2744cd2fbebda9472695b52e3952b74a
|
|
| BLAKE2b-256 |
82b719661fabd5e8d1ef4488dcdb4e2765fc6553ec874c64abb3e3c5f79b6a3d
|
File details
Details for the file gsapere-0.2.0-py3-none-any.whl.
File metadata
- Download URL: gsapere-0.2.0-py3-none-any.whl
- Upload date:
- Size: 302.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
054ffd2ea8cddffc8bf7fd49af819ccbdb59c7125d1a35eeffaf93fea38a2bbd
|
|
| MD5 |
9c15cdc43875f375f0a9e1253994a0c4
|
|
| BLAKE2b-256 |
852e4ec637895f75178e8ddbd857f08a55d043b1d7f6e23cf67a784022a0581e
|