Extract references from legal documents
Project description
Legal Reference Extraction
Extract citations from German legal documents — law references (§ 433 BGB)
and case references (BGH, VIII ZR 295/01).
Used by de.openlegaldata.io.
Supported Python versions: 3.11, 3.12, 3.13 (tested on every CI run).
Install
pip install legal-reference-extraction
# or from git
pip install git+https://github.com/openlegaldata/legal-reference-extraction.git
# local dev
make install
Usage
from refex.orchestrator import CitationExtractor
extractor = CitationExtractor()
result = extractor.extract("Die Entscheidung beruht auf § 42 VwGO.")
for cit in result.citations:
print(cit.type, cit.span.text)
# law § 42 VwGO
Input formats
Plain text, HTML, and Markdown are supported. Format is auto-detected or can be set explicitly:
# HTML — tags are stripped, entities decoded, spans map to plain text
result = extractor.extract("<p>Gemäß § 433 BGB ist der Käufer verpflichtet.</p>", fmt="html")
# Markdown — formatting markers stripped
result = extractor.extract("Gemäß **§ 433 BGB** ist der Käufer verpflichtet.", fmt="markdown")
# Auto-detect (based on content sniffing)
result = extractor.extract(html_content)
For HTML and Markdown input, span offsets reference the canonical plain-text
projection. Use map_span_to_raw to recover positions in the original:
from refex.document import Document, map_span_to_raw
doc = Document(raw="<p>§ 433 BGB</p>", format="html")
result = extractor.extract(doc)
for cit in result.citations:
raw_span = map_span_to_raw(cit.span, doc)
print(f"{cit.span.text} → raw[{raw_span.start}:{raw_span.end}]")
Output formats
from refex.serializers import to_jsonl, to_hf_bio, to_gliner, to_spacy_doc, to_web_annotation, to_akn_ref
to_jsonl(result, doc_id="example") # JSONL (primary format)
to_hf_bio(result, text) # HuggingFace BIO tags
to_gliner(result) # GLiNER span format
to_spacy_doc(result, text) # spaCy Doc dict
to_web_annotation(result) # W3C Web Annotation
to_akn_ref(result, text) # Akoma Ntoso XML
Examples
Law references — § and §§ patterns with section numbers and law book codes:
result = extractor.extract(
"Bar und bar §§ 1, 2 Abs. 2, 3, 10 Abs. 1 Nr. 1 BGB foo."
)
for cit in result.citations:
print(cit.book, cit.number)
# bgb 1
# bgb 2
# bgb 3
# bgb 10
Cross-references — i.V.m. (in conjunction with) linking sections across law books:
result = extractor.extract(
"Die vorläufige Vollstreckbarkeit folgt aus "
"§ 167 VwGO i.V.m. §§ 708 Nr. 11, 711 ZPO."
)
for cit in result.citations:
print(cit.book, cit.number)
# vwgo 167
# zpo 708
# zpo 711
Case references — court names and file numbers:
result = extractor.extract(
"Das OVG Schleswig habe bereits in seinem Urteil vom 22.04.2010 "
"(1 KN 19/09) entschieden."
)
for cit in result.citations:
print(cit.court, cit.file_number)
# OVG Schleswig 1 KN 19/09
Artikel / Grundgesetz — Art. references are supported:
result = extractor.extract("Gemäß Art. 12 Abs. 1 GG besteht Berufsfreiheit.")
Law book context — extract bare § references within a specific law:
extractor = CitationExtractor()
# ... set law_book_context on the underlying engine if needed
Legacy API
The old RefExtractor API is still available but deprecated:
from refex.extractor import RefExtractor
extractor = RefExtractor()
content, markers = extractor.extract("Ein Satz mit § 3b AsylG.")
# Note: content no longer contains [ref=UUID] markers (deprecated in v0.7.0)
Development
make install # create venv + install in editable mode with dev deps
make test # run pytest (271 tests)
make lint # ruff check + format check
make format # auto-fix lint + format
Benchmark
Run the extraction benchmark against gold-annotated German legal documents:
make bench-ci # vendored CI subset (15 docs, no external data needed)
make bench-dev # full validation split (821 docs)
make bench-quick # quick check (50 docs on validation)
make bench-validate # dataset integrity checks
make diagnose # error analysis
Current metrics (validation split, 821 docs):
| Metric | Value |
|---|---|
| Span F1 (exact) | 0.734 |
| Case F1 (exact) | 0.613 |
| Law F1 (exact) | 0.797 |
| Throughput | 418 docs/s |
See benchmarks/README.md for details.
Optional extras
The base install has zero runtime dependencies. Inference engines and format adapters live in opt-in extras — pick the ones you need:
pip install "legal-reference-extraction[adapters]" # spaCy adapter for to_spacy_doc
pip install "legal-reference-extraction[crf]" # CRF engine (~30 MB, sklearn-crfsuite)
pip install "legal-reference-extraction[transformers]" # transformer engine (~2 GB, transformers + torch)
pip install "legal-reference-extraction[training]" # fine-tuning utilities (wandb, seqeval, datasets, accelerate)
Most users pick exactly one inference engine ([crf] or
[transformers]). [training] is only needed when fine-tuning a
transformer via scripts/train_transformer.py.
The default transformer model is
openlegaldata/legal-reference-extraction-base-de
(a fine-tune of EuroBERT/EuroBERT-210m, CC BY-NC 4.0) — so
TransformerExtractor() with no arguments downloads and uses it
automatically. Override via TransformerExtractor(model="...").
The benchmark harness also honours REFEX_TRANSFORMER_MODEL /
REFEX_TRANSFORMER_DEVICE env vars for quick A/B runs.
See also
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file legal_reference_extraction-0.5.2.tar.gz.
File metadata
- Download URL: legal_reference_extraction-0.5.2.tar.gz
- Upload date:
- Size: 83.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc3a9a204de4e78880fcd60623fabeb652ad84218a5569498989ebc636b04a37
|
|
| MD5 |
9b7b202532f0557f089986f9dc9017af
|
|
| BLAKE2b-256 |
affc6e99324bd8110394d6d08607377df9985412382b5d76ad5ddac3576b8269
|
Provenance
The following attestation bundles were made for legal_reference_extraction-0.5.2.tar.gz:
Publisher:
publish.yml on openlegaldata/legal-reference-extraction
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
legal_reference_extraction-0.5.2.tar.gz -
Subject digest:
dc3a9a204de4e78880fcd60623fabeb652ad84218a5569498989ebc636b04a37 - Sigstore transparency entry: 1453571781
- Sigstore integration time:
-
Permalink:
openlegaldata/legal-reference-extraction@e676d732080e75a0081ab1561f741208d83edec5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/openlegaldata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e676d732080e75a0081ab1561f741208d83edec5 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file legal_reference_extraction-0.5.2-py3-none-any.whl.
File metadata
- Download URL: legal_reference_extraction-0.5.2-py3-none-any.whl
- Upload date:
- Size: 61.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba5d23ec50da4c30ec3f92f24b15a77b8072cdeb8e4945b7f4a0ef1faa1b5657
|
|
| MD5 |
8a6481e3869ca59378b410f3db3e2e3f
|
|
| BLAKE2b-256 |
65d86181272c27e0668dbd4df1ca8f80807e893b0160aeaf9c14275c9b2ea80e
|
Provenance
The following attestation bundles were made for legal_reference_extraction-0.5.2-py3-none-any.whl:
Publisher:
publish.yml on openlegaldata/legal-reference-extraction
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
legal_reference_extraction-0.5.2-py3-none-any.whl -
Subject digest:
ba5d23ec50da4c30ec3f92f24b15a77b8072cdeb8e4945b7f4a0ef1faa1b5657 - Sigstore transparency entry: 1453571869
- Sigstore integration time:
-
Permalink:
openlegaldata/legal-reference-extraction@e676d732080e75a0081ab1561f741208d83edec5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/openlegaldata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e676d732080e75a0081ab1561f741208d83edec5 -
Trigger Event:
workflow_dispatch
-
Statement type: