Python toolkit for empirical research on the Court of Justice of the European Union

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

cjeu-py

A Python toolkit for empirical research on the Court of Justice of the European Union (CJEU). Collects structured data from the EU's CELLAR endpoint, parses judgment texts, extracts case-law citations, and builds research-ready datasets. Citations can optionally be classified by an LLM.

Why Python

cjeu-py is designed to be idiomatically Pythonic: pip install cjeu-py, a CLI entry point, pandas DataFrames in and out, standard logging, and a flat module structure you can import piecemeal. Data flows through the pipeline as Parquet files and JSONL logs, so each stage is independently inspectable and resumable. LLM classification uses structured JSON output rather than hand-coded labels, making taxonomies easy to extend.

The goal is a toolkit that fits naturally into the workflows that computational social scientists and NLP researchers already use – Jupyter notebooks, pandas, scikit-learn, HuggingFace – without requiring a separate ecosystem.

What it does

Stage	What	How
Collect	Case metadata, citation networks, subject matter (4 taxonomies), procedural links, legislation links, AG opinion links, academic citations, referring court details	CELLAR SPARQL endpoint (CDM ontology)
Download	Full judgment and AG opinion texts	CELLAR REST API (content negotiation, XHTML)
Parse	Court composition, parties, representatives, procedural dates	XHTML header parser (structural, not heuristic)
Derive	Judge assignments, case names, operative parts	Flattened from parsed headers
Extract	Case-law citations with paragraph-level context	Regex (14 patterns) + italic markers + party name matching
Classify (optional)	Precision, use, treatment, topic of each citation	Gemini structured JSON output (requires API key)
Validate (optional)	Classification quality	Stratified sample export for human coding
Scrape (optional)	Judge biographical data (all current and former members)	Curia.europa.eu + LLM structured extraction (requires API key)
Search	Full-text, party, citation graph, topic, legislation, and live CELLAR headnote queries	`cjeu-py search text "proportionality"`
Network	Interactive citation network with centrality metrics, community detection, subject/procedure/year filters	Self-contained HTML (D3.js), GEXF (Gephi / Gephi Lite), D3 JSON
Export	All tables as CSV or Excel	`cjeu-py export --format csv`

Interactive citation network

export-network builds a directed citation graph from cached pipeline data (nodes = cases, edges = citing → cited) and exports it in three formats:

HTML – a self-contained interactive visualisation (D3.js force-directed graph) that opens directly in a browser, with no server or dependencies required
GEXF – for Gephi desktop and Gephi Lite, with all node attributes (centrality, subjects, formation, procedure) preserved as typed attributes
D3 JSON – for custom web visualisations or programmatic analysis

Each node carries PageRank, betweenness centrality, in-degree, out-degree, and a Louvain community assignment. The graph includes external cited cases (authorities cited by your downloaded decisions but not part of the downloaded set) to preserve full citation neighbourhoods. External nodes have year and court derived from the CELEX number; run enrich-network (below) to add full metadata.

The HTML export provides:

Interactive controls – node sizing (by any centrality metric, with user-definable min/max radius), node colouring (community, procedure type, year, court, or formation), edge thickness scaling, toggleable community hull shading
Filters – year range slider, court checkboxes (CJ/GC/CST), subject matter checkboxes (36 case-law subject codes with human-readable labels, sorted by frequency), procedure type checkboxes, with All/None toggles
Detail sidebar – click any node to see collapsible sections: case metadata (CELEX, ECLI, date, court, formation, procedure, judge-rapporteur, AG, subjects), centrality metrics, procedural links (joined cases, appeals, interveners, annulled acts), legislation links, academic citations, and citing/cited-by lists. Available sections depend on which metadata tiers have been downloaded

A pre-built example is available at examples/grand_chamber_network.html — download and open in a browser. It contains the 500 most central Grand Chamber cases (by PageRank) plus cases they cite.

Handling external nodes: By default, nodes outside your downloaded set appear with limited metadata. Two options:

# Option 1: Enrich external nodes by fetching their metadata from CELLAR
# (one-time, cached — fetches ECLI, date, court, formation)
cjeu-py enrich-network

# Then export as usual — external nodes now have full metadata
cjeu-py export-network --format html --max-nodes 500

# Option 2: Restrict to downloaded decisions only (no external nodes)
cjeu-py export-network --format html --internal-only

# Full network as interactive HTML
cjeu-py export-network --format html

# Cap at 500 most central cases (PageRank) for fast rendering
cjeu-py export-network --format html --max-nodes 500

# Filter by subject and date
cjeu-py export-network --format html --topic "competition" --date-from 2010-01-01

# GEXF for Gephi
cjeu-py export-network --format gexf --formation GRAND_CH

Networks above 5,000 nodes trigger a performance warning; above 10,000 a stronger warning suggests using --max-nodes or filters. Gephi desktop handles large networks without issue.

Quick start

# Install from PyPI
pip install cjeu-py

# Or install with all optional dependencies (LLM, network analysis, visualisation)
pip install cjeu-py[all]

# Set your Gemini API key (only needed for classification)
export GEMINI_API_KEY="your-key-here"

# Download metadata from CELLAR (cached to disk – runs once, then instant)
cjeu-py download-cellar --max-items 100

# Filter by court, formation, judge, date range
cjeu-py download-cellar --court CJ --formation GRAND_CH --date-from 2020-01-01
cjeu-py download-cellar --judge Lenaerts --date-from 2015-01-01 --date-to 2020-12-31

# Re-download even if local data exists
cjeu-py download-cellar --force

# Download extended metadata (joins, appeals, legislation links, AG opinions, case names, etc.)
cjeu-py download-cellar-meta

# Download only high-value metadata (skip academic citations and referring judgments)
cjeu-py download-cellar-meta --detail high

# Include rare legislation link types
cjeu-py download-cellar-meta --detail all

# Supplementary research data (dossiers, summaries, successor chains, etc.)
cjeu-py download-cellar-meta --detail exhaustive

# Every remaining CDM property (administrative metadata bulk dump)
cjeu-py download-cellar-meta --detail kitchen_sink

# Download judgment texts (20 concurrent connections, resumable)
cjeu-py fetch-texts --max-items 50

# Download with language fallback (try English, then French)
cjeu-py fetch-texts --lang eng,fra

# Parse judgment headers (composition, parties, representatives)
cjeu-py parse-headers data/texts/xhtml/

# Extract citations
cjeu-py extract-citations

# Merge data sources
cjeu-py merge

# Classify citations via LLM (5 concurrent workers, safe for free tier)
cjeu-py classify --max-items 20

# Use higher concurrency with a Tier 2 Gemini API key
cjeu-py classify --max-workers 50

# Export sample for human validation
cjeu-py validate --sample-size 50

# Scrape judge biographical data from Curia
cjeu-py scrape-judges --cache-dir data/raw/judges/cache/

# Extract structured bios via LLM
cjeu-py extract-judge-bios --max-items 10

# Export citation network as interactive HTML (see section above)
cjeu-py export-network --format html --max-nodes 500

# Export all pipeline data as CSV
cjeu-py export --format csv

# Generate variable codebook
cjeu-py codebook

All variable definitions are documented in CODEBOOK.md.

Local model support

Classification works with any OpenAI-compatible API -- including Ollama, vLLM, llama.cpp, and LM Studio -- via the --provider openai flag.

# Start Ollama with Gemma 2
ollama pull gemma2
ollama serve

# Classify using local Gemma 2
cjeu-py classify --provider openai --model gemma2

# Use a different endpoint (e.g. vLLM on a remote server)
cjeu-py classify --provider openai --model meta-llama/Llama-3.1-8B \
    --api-base http://gpu-server:8000/v1

# Use LM Studio
cjeu-py classify --provider openai --model local-model \
    --api-base http://localhost:1234/v1

Local models do not guarantee structured JSON output like Gemini does. The pipeline validates output against the expected schema and retries up to 3 times on malformed responses. Larger models (13B+) produce more reliable structured output.

Search

Query collected data or the live CELLAR endpoint directly from the command line.

Local searches (text, party, citing, cited-by, topic, legislation, list) query data you have already downloaded. Remote searches (headnote) query the CELLAR SPARQL endpoint live and require no local data.

# Full-text search across downloaded judgment paragraphs
cjeu-py search text "common market"
cjeu-py search text "proportionality" --limit 50

# Search by party name
cjeu-py search party "Google"
cjeu-py search party "Commission v Germany" --date-from 2015-01-01

# Citation graph queries
cjeu-py search citing 62014CJ0362      # cases citing Schrems I
cjeu-py search cited-by 62014CJ0362    # cases cited by Schrems I

# Search by subject matter (code or label)
cjeu-py search topic "State aid"
cjeu-py search topic PDON

# Cases linked to a piece of legislation
cjeu-py search legislation 32016R0679

# Live CELLAR headnote/title search (no local data needed)
cjeu-py search headnote "data protection"
cjeu-py search headnote "state aid"

# List available values
cjeu-py search list topics
cjeu-py search list judges
cjeu-py search list formations

All search modes support --format csv and --format json for piping into other tools. Local searches also accept --date-from, --date-to, and --court filters.

Data sources

CELLAR SPARQL

Metadata is collected via the CELLAR SPARQL endpoint at https://publications.europa.eu/webapi/rdf/sparql using CDM ontology properties:

Core metadata – CELEX, ECLI, date, court formation, judge-rapporteur, advocate general, procedure type, procedural classification, published-in-eReports flag, authentic language, EEA relevance
Extended metadata – defendant/applicant agents, referring court, treaty basis, date lodged
Relational data – citation network, joined cases, appeals, interveners, annulled acts
Legislation links – which legislation a case interprets, confirms, amends, annuls, etc. (17 link types)
AG opinion links – direct judgment-to-AG-opinion pairing via CELEX
Subject matter – four taxonomies: EuroVoc (260 broad categories), case-law subject matter, hierarchical case-law directory (fd_578, ~3,800 codes), and old directory (fd_577)
Academic citations – journal articles discussing each case (bibliographic references from CELLAR)
Referring national judgments – court name, decision type, date, and reference number for preliminary rulings
Case names – short popular names (expression_title_alternative, older cases) and full party names (expression_case-law_parties, newer cases), queried at the expression level
Supplementary data (--detail exhaustive) – dossier groupings, case summaries, miscellaneous information, successor chains, legislative incorporation links
Administrative metadata (--detail kitchen_sink) – all remaining CDM properties (authoring institution, creation dates, transmission timestamps, collection memberships, obsolete identifiers, etc.) as a long-format table

Each query is paginated and saves results as Parquet. Queries target specific CDM properties rather than fetching all triples. Five detail levels control how much metadata to collect: high, medium (default), all, exhaustive, kitchen_sink.

Caching

All CELLAR downloads are cache-first: if a Parquet file already exists on disk, it is reused without hitting the network. This makes re-running the pipeline instant after the first download and protects against network interruptions.

# First run: downloads from CELLAR (~2 min for Grand Chamber)
cjeu-py download-cellar --court CJ --formation GRAND_CH

# Second run: loads from disk in <1s
cjeu-py download-cellar --court CJ --formation GRAND_CH

# Force re-download (overwrites cached files)
cjeu-py download-cellar --force

The same applies to download-cellar-meta – each table (joined cases, appeals, legislation links, AG opinions, case names, etc.) is cached independently. Text downloads via fetch-texts are resumable via a checkpoint file, so interrupted downloads pick up where they left off.

CELLAR REST API

Full judgment and opinion texts are downloaded via content negotiation against the canonical CELLAR resource URI:

GET http://publications.europa.eu/resource/celex/{CELEX}
Accept: application/xhtml+xml, text/html
Accept-Language: eng

This returns the official XHTML representation – well-structured, with semantic CSS classes (coj-normal, coj-bold, coj-italic, coj-count) that preserve document structure. No website scraping involved. With 20 concurrent connections (default), the full Grand Chamber corpus (976 documents) downloads in under 2 minutes. Downloads are resumable – interrupted fetches pick up where they left off via a checkpoint file.

Judgment header parser

The XHTML header (everything before paragraph 1) contains structured metadata that CELLAR SPARQL does not fully expose. The parser extracts:

Court composition – full panel with roles (President, Vice-President, Presidents of Chambers, Rapporteur, Judges)
Parties – applicants, defendants, and interveners
Representatives – lawyers and agents for each party, with professional titles
Procedural dates – hearing date, AG opinion delivery date
Operative part – the Court's ruling, extracted via the "On those grounds" delimiter

The parser handles both pre-2016 (class="normal") and post-2016 (class="coj-normal") XHTML formats. From parsed data, parse-headers also produces:

Assignments table – one row per judge per decision (Parquet)
Case names – "applicant v defendant" format (Parquet)
Operative parts – full dispositif text (JSONL)

Curia biographical data

Judge biographical data is scraped from the Court's official member pages at curia.europa.eu. Raw bios are then structured via LLM into: birth year, nationality, gender, education, prior careers, CJEU roles with dates, and death year. Covers all current and former members of the Court of Justice, General Court, and Civil Service Tribunal (~260 individuals).

Citation extraction

Three detection layers run in sequence:

Regex (14 patterns) – formal case references: ECLI identifiers, Case C-xxx/xx, joined cases, ECR references, paragraph pinpoints
Italic markers – case names in *italics* preserved from CELLAR XHTML <span class="coj-italic">, matching the Court's typographic convention for case names
Party name matching – gazetteer built from XHTML headers of cited cases, searched against the citing document text

Citations are anchored to their source paragraph and enriched with configurable context windows for downstream classification.

Classification taxonomy (optional)

If you have a Gemini API key, extracted citations can optionally be classified along four dimensions using Gemini 2.5 Flash with structured JSON output. Classification requires judgment texts to have been downloaded and citations extracted first (fetch-texts → extract-citations → classify).

Dimension	Categories
Precision	string citation, general reference, substantive engagement
Use	principle, interpretation, legal test, factual analogy, procedural, definition, distinguish, other
Treatment	follows, extends, distinguishes (facts/law/scope), departs (explicit/implicit), neutral
Topic	Free-text area of EU law

The taxonomy draws on Marc Jacob's Precedents and Case-Based Reasoning in the European Court of Justice (Cambridge, 2014). Classification uses structured JSON output with schema validation, so categories are easy to extend without changing extraction code.

Project structure

cjeu-py/
├── cjeu_py/                           # Core library (pip-installable)
│   ├── main.py                        # CLI entry point (15 commands)
│   ├── config.py                      # Central configuration
│   ├── data_collection/               # CELLAR SPARQL + REST clients, header parser, Curia scraper
│   ├── citation_extraction/           # Regex patterns, context windows, party name matching
│   ├── search.py                      # CLI search (8 modes: text, headnote, party, citing, etc.)
│   ├── classification/                # LLM pipeline with checkpointing & cost tracking
│   ├── llm/                           # Gemini + OpenAI-compatible API wrapper
│   └── utils/                         # XHTML parsing, logging utilities
│
├── examples/                          # Pre-built example outputs
├── data/                              # Pipeline output (Parquet, JSONL, cached XHTML)
├── tests/                             # 93 tests
├── CODEBOOK.md                        # Variable definitions for all tables
├── CITATION.cff                       # Academic citation metadata
├── LICENSE                            # MIT
├── pyproject.toml
└── requirements.txt

Current limitations

No procedural event timeline – hearing and AG opinion dates are parsed from headers, but the full event sequence (date lodged, written procedure, oral hearing) requires InfoCuria scraping.
Header parser coverage – tested on Grand Chamber cases (2013–2025). Earlier cases and smaller formations may have formatting variations.
Citation extraction tuned for English – texts can be downloaded in all 24 EU official languages via --lang, but citation regex patterns are currently tuned for English-language judgments.

Testing

python -m pytest tests/ -v

If you use R

If R is your preferred language, see Michal Ovádek's eurlex package, which provides access to EUR-Lex data including CJEU case law via the CELLAR SPARQL endpoint.

Acknowledgements

Marc Jacob – Precedents and Case-Based Reasoning in the European Court of Justice (Cambridge, 2014). Taxonomy of citation use and treatment.
EU Publications Office – CELLAR SPARQL endpoint, CDM ontology, and REST API.
Court of Justice of the European Union – biographical data from official member pages at curia.europa.eu.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

niccoloridi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Mar 12, 2026

0.4.1

Mar 1, 2026

0.4.0

Mar 1, 2026

0.3.0

Mar 1, 2026

0.2.3

Feb 28, 2026

0.2.2

Feb 28, 2026

This version

0.2.1

Feb 28, 2026

0.2.0

Feb 28, 2026

0.1.3

Feb 27, 2026

0.1.2

Feb 27, 2026

0.1.1

Feb 27, 2026

0.1.0

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjeu_py-0.2.1.tar.gz (96.1 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cjeu_py-0.2.1-py3-none-any.whl (91.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file cjeu_py-0.2.1.tar.gz.

File metadata

Download URL: cjeu_py-0.2.1.tar.gz
Upload date: Feb 28, 2026
Size: 96.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cjeu_py-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`46ce5de47fe5c53dfeabc1ce0d824907eb57336a809a60732d4e9ae21056b82a`
MD5	`5a807e75ddf9202b89b9aa5c89719f88`
BLAKE2b-256	`d3d69250e34a97de368e543d2ad0cec9be5d24637b167894285f7309588a64eb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cjeu_py-0.2.1.tar.gz:

Publisher: publish.yml on niccoloridi/cjeu-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cjeu_py-0.2.1.tar.gz
- Subject digest: 46ce5de47fe5c53dfeabc1ce0d824907eb57336a809a60732d4e9ae21056b82a
- Sigstore transparency entry: 1004999077
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: niccoloridi/cjeu-py@72cf9e084a457b24605b072ada37947b187360d1
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/niccoloridi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@72cf9e084a457b24605b072ada37947b187360d1
- Trigger Event: release

File details

Details for the file cjeu_py-0.2.1-py3-none-any.whl.

File metadata

Download URL: cjeu_py-0.2.1-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 91.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cjeu_py-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1309877f1b508faa8eb6c2101bb979b70fbd09e58a22cb5fdd675e9bcded3f41`
MD5	`715a92cfc463a4293510cb0a59a0101a`
BLAKE2b-256	`3fbfad231b6118968452579eca04de78acf88d03dab211c506d48b2304839760`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cjeu_py-0.2.1-py3-none-any.whl:

Publisher: publish.yml on niccoloridi/cjeu-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cjeu_py-0.2.1-py3-none-any.whl
- Subject digest: 1309877f1b508faa8eb6c2101bb979b70fbd09e58a22cb5fdd675e9bcded3f41
- Sigstore transparency entry: 1004999081
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: niccoloridi/cjeu-py@72cf9e084a457b24605b072ada37947b187360d1
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/niccoloridi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@72cf9e084a457b24605b072ada37947b187360d1
- Trigger Event: release

cjeu-py 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

cjeu-py

Why Python

What it does

Interactive citation network

Quick start

Local model support

Search

Data sources

CELLAR SPARQL

Caching

CELLAR REST API

Judgment header parser

Curia biographical data

Citation extraction

Classification taxonomy (optional)

Project structure

Current limitations

Testing

If you use R

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance