Extract references from legal documents

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

malteos openlegaldata

These details have not been verified by PyPI

Project description

Legal Reference Extraction

Extract citations from German legal documents — law references (§ 433 BGB) and case references (BGH, VIII ZR 295/01).

Used by de.openlegaldata.io.

Supported Python versions: 3.11, 3.12, 3.13 (tested on every CI run).

Install

pip install legal-reference-extraction

# or from git
pip install git+https://github.com/openlegaldata/legal-reference-extraction.git

# local dev
make install

Usage

from refex.orchestrator import CitationExtractor

extractor = CitationExtractor()
result = extractor.extract("Die Entscheidung beruht auf § 42 VwGO.")

for cit in result.citations:
    print(cit.type, cit.span.text)
# law § 42 VwGO

Input formats

Plain text, HTML, and Markdown are supported. Format is auto-detected or can be set explicitly:

# HTML — tags are stripped, entities decoded, spans map to plain text
result = extractor.extract("<p>Gemäß &#167; 433 BGB ist der Käufer verpflichtet.</p>", fmt="html")

# Markdown — formatting markers stripped
result = extractor.extract("Gemäß **§ 433 BGB** ist der Käufer verpflichtet.", fmt="markdown")

# Auto-detect (based on content sniffing)
result = extractor.extract(html_content)

For HTML and Markdown input, span offsets reference the canonical plain-text projection. Use map_span_to_raw to recover positions in the original:

from refex.document import Document, map_span_to_raw

doc = Document(raw="<p>§ 433 BGB</p>", format="html")
result = extractor.extract(doc)
for cit in result.citations:
    raw_span = map_span_to_raw(cit.span, doc)
    print(f"{cit.span.text} → raw[{raw_span.start}:{raw_span.end}]")

Output formats

from refex.serializers import to_jsonl, to_hf_bio, to_gliner, to_spacy_doc, to_web_annotation, to_akn_ref

to_jsonl(result, doc_id="example")      # JSONL (primary format)
to_hf_bio(result, text)                 # HuggingFace BIO tags
to_gliner(result)                       # GLiNER span format
to_spacy_doc(result, text)              # spaCy Doc dict
to_web_annotation(result)               # W3C Web Annotation
to_akn_ref(result, text)                # Akoma Ntoso XML

Examples

Law references — § and §§ patterns with section numbers and law book codes:

result = extractor.extract(
    "Bar und bar §§ 1, 2 Abs. 2, 3, 10 Abs. 1 Nr. 1 BGB foo."
)
for cit in result.citations:
    print(cit.book, cit.number)
# bgb 1
# bgb 2
# bgb 3
# bgb 10

Cross-references — i.V.m. (in conjunction with) linking sections across law books:

result = extractor.extract(
    "Die vorläufige Vollstreckbarkeit folgt aus "
    "§ 167 VwGO i.V.m. §§ 708 Nr. 11, 711 ZPO."
)
for cit in result.citations:
    print(cit.book, cit.number)
# vwgo 167
# zpo 708
# zpo 711

Case references — court names and file numbers:

result = extractor.extract(
    "Das OVG Schleswig habe bereits in seinem Urteil vom 22.04.2010 "
    "(1 KN 19/09) entschieden."
)
for cit in result.citations:
    print(cit.court, cit.file_number)
# OVG Schleswig 1 KN 19/09

Artikel / Grundgesetz — Art. references are supported:

result = extractor.extract("Gemäß Art. 12 Abs. 1 GG besteht Berufsfreiheit.")

Law book context — extract bare § references within a specific law:

extractor = CitationExtractor()
# ... set law_book_context on the underlying engine if needed

Legacy API

The old RefExtractor API is still available but deprecated:

from refex.extractor import RefExtractor

extractor = RefExtractor()
content, markers = extractor.extract("Ein Satz mit § 3b AsylG.")
# Note: content no longer contains [ref=UUID] markers (deprecated in v0.7.0)

Development

make install   # create venv + install in editable mode with dev deps
make test      # run pytest (271 tests)
make lint      # ruff check + format check
make format    # auto-fix lint + format

Benchmark

Run the extraction benchmark against gold-annotated German legal documents:

make bench-ci           # vendored CI subset (15 docs, no external data needed)
make bench-dev          # full validation split (821 docs)
make bench-quick        # quick check (50 docs on validation)
make bench-validate     # dataset integrity checks
make diagnose           # error analysis

Current metrics (validation split, 821 docs):

Metric	Value
Span F1 (exact)	0.734
Case F1 (exact)	0.613
Law F1 (exact)	0.797
Throughput	418 docs/s

See benchmarks/README.md for details.

Optional extras

The base install has zero runtime dependencies. Inference engines and format adapters live in opt-in extras — pick the ones you need:

pip install "legal-reference-extraction[adapters]"     # spaCy adapter for to_spacy_doc
pip install "legal-reference-extraction[crf]"          # CRF engine  (~30 MB, sklearn-crfsuite)
pip install "legal-reference-extraction[transformers]" # transformer engine (~2 GB, transformers + torch)
pip install "legal-reference-extraction[training]"     # fine-tuning utilities (wandb, seqeval, datasets, accelerate)

Most users pick exactly one inference engine ([crf] or [transformers]). [training] is only needed when fine-tuning a transformer via scripts/train_transformer.py.

The default transformer model is openlegaldata/legal-reference-extraction-base-de (a fine-tune of EuroBERT/EuroBERT-210m, CC BY-NC 4.0) — so TransformerExtractor() with no arguments downloads and uses it automatically. Override via TransformerExtractor(model="..."). The benchmark harness also honours REFEX_TRANSFORMER_MODEL / REFEX_TRANSFORMER_DEVICE env vars for quick A/B runs.

License

MIT

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

malteos openlegaldata

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.3

Jun 15, 2026

0.5.2

May 6, 2026

0.5.1

May 6, 2026

0.5.0

Apr 22, 2026

0.4.2

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

legal_reference_extraction-0.5.3.tar.gz (90.4 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

legal_reference_extraction-0.5.3-py3-none-any.whl (64.2 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file legal_reference_extraction-0.5.3.tar.gz.

File metadata

Download URL: legal_reference_extraction-0.5.3.tar.gz
Upload date: Jun 15, 2026
Size: 90.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for legal_reference_extraction-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`b3270e4e4c2bda1c5e10df46914d407d11d77164f9c6ca80815d571c2bc59c53`
MD5	`a215a2b82828ea1a01a51bee12b4fdfe`
BLAKE2b-256	`153cfaae47d534f017f3f6e76419a1ecd6c0a9ca521b9e52ff30dcd121005491`

See more details on using hashes here.

Provenance

The following attestation bundles were made for legal_reference_extraction-0.5.3.tar.gz:

Publisher: publish.yml on openlegaldata/legal-reference-extraction

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: legal_reference_extraction-0.5.3.tar.gz
- Subject digest: b3270e4e4c2bda1c5e10df46914d407d11d77164f9c6ca80815d571c2bc59c53
- Sigstore transparency entry: 1827830546
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: openlegaldata/legal-reference-extraction@6b7e4399faf1b352d42b5f97955cd4250a01d4f6
- Branch / Tag: refs/heads/master
- Owner: https://github.com/openlegaldata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6b7e4399faf1b352d42b5f97955cd4250a01d4f6
- Trigger Event: workflow_dispatch

File details

Details for the file legal_reference_extraction-0.5.3-py3-none-any.whl.

File metadata

Download URL: legal_reference_extraction-0.5.3-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 64.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for legal_reference_extraction-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16b80d6664dd2b23ce2fb8f71fce4ccdd840872d05a22d5c07cadf8708b80f1a`
MD5	`9fe5d6ddc34eefcbdb10f986abce3566`
BLAKE2b-256	`6a4c2eedec35243d036c9ec2b5369af3de4d68ff508f27ea8fbf273175d010f1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for legal_reference_extraction-0.5.3-py3-none-any.whl:

Publisher: publish.yml on openlegaldata/legal-reference-extraction

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: legal_reference_extraction-0.5.3-py3-none-any.whl
- Subject digest: 16b80d6664dd2b23ce2fb8f71fce4ccdd840872d05a22d5c07cadf8708b80f1a
- Sigstore transparency entry: 1827830671
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: openlegaldata/legal-reference-extraction@6b7e4399faf1b352d42b5f97955cd4250a01d4f6
- Branch / Tag: refs/heads/master
- Owner: https://github.com/openlegaldata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6b7e4399faf1b352d42b5f97955cd4250a01d4f6
- Trigger Event: workflow_dispatch

legal-reference-extraction 0.5.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Legal Reference Extraction

Install

Usage

Input formats

Output formats

Examples

Legacy API

Development

Benchmark

Optional extras

See also

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance