Skip to main content

Library for extracting CELLAR case law data from EUR-Lex

Project description

Cellar Extractor

CI Coverage

A Python library for extracting CELLAR case law data from EUR-Lex.

This library contains functions to get CELLAR case law data from the EUR-Lex SPARQL endpoint and enrich additional information from InfoCuria and CELLAR item sources.

Version

Python 3.9+

Tests

  • CI: the badge above tracks the default supported test workflow
  • Coverage: the badge above tracks the default local test suite coverage snapshot

Contributors

pranavnbapat
Pranav Bapat
Cloud956
Piotr Lewandowski
shashankmc
shashankmc
gijsvd
gijsvd
venvis
venvis
davidwickerhf
davidwickerhf

How to install?

pip install cellar-extractor

What The Project Does

cellar-extractor builds enriched EUR-Lex / CELLAR case-law datasets.

It starts from CELLAR metadata and then enriches:

  • citation edges
  • summaries and keywords
  • full text
  • sector-specific metadata
  • graph-ready node/edge projections

The extractor is currently centered on:

  • sector 6 case law: CJEU-style material via InfoCuria
  • sector 8 case law: mixed / national-case-law material via CELLAR RDF + item downloads

The main workflow has two stages.

  1. get_cellar(...)
    • fetches the base CELLAR corpus
    • returns CSV-like dataframe output or JSON-like dictionary output
  2. get_cellar_extra(...)
    • enriches that corpus with citations, full text, summaries, keywords, provenance, and missing-data flags

The citation graph is now extracted through the public CELLAR SPARQL endpoint. Legacy EUR-Lex SOAP webservice support is kept only for validation tests and is not part of the production path anymore.

Data Sources By Type

Need Source
Base corpus metadata CELLAR SPARQL
Citation edges (citing, cited_by) CELLAR SPARQL
Sector 6 full text and structured metadata InfoCuria
Sector 8 full text and summaries CELLAR RDF + downloadable item manifestations
Legacy citation comparison only EUR-Lex SOAP webservice

Quick Start

1. Fetch Base CELLAR Metadata

import cellar_extractor as cell

df = cell.get_cellar(
    save=False,
    file_format="csv",
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=100,
)

Returns a dataframe with base metadata such as CELEX, ECLI, type, dates, and subject-matter-related fields.

You can also save explicitly to a custom path instead of the default data/ location:

cell.get_cellar(
    save=True,
    file_format="csv",
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    output_path="exports/cellar_january.csv",
)

2. Fetch The Enriched Dataset

import cellar_extractor as cell

extra_df, fulltext = cell.get_cellar_extra(
    save=False,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=100,
    threads=4,
)

Returns:

  • extra_df: enriched dataframe
  • fulltext: list of JSON rows containing extracted text and provenance

You can independently control where the enriched CSV and fulltext JSON are written:

cell.get_cellar_extra(
    save=True,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    metadata_output_path="exports/cellar_extra.csv",
    fulltext_output_path="exports/cellar_fulltext.json",
    threads=4,
)

3. Build A Citation Graph

import cellar_extractor as cell

nodes, edges = cell.get_nodes_and_edges_lists(extra_df, only_local=True)

only_local=True keeps only edges whose target CELEX is also present in extra_df.

4. Filter By Subject Matter

filtered = cell.filter_subject_matter(extra_df, "competition")

Full-Scrape Strategy

If you want the largest reproducible scrape, do not run one enormous date range blindly. Use bounded windows and persist each window.

Recommended approach:

  1. choose a date window by sd / ed
  2. run get_cellar(...) or get_cellar_extra(...)
  3. save outputs to disk
  4. repeat for the next window
  5. concatenate downstream

Practical guidance:

  • use month-sized or week-sized windows for stability
  • keep threads moderate, typically 4 to 10
  • use save=True for long runs
  • keep the fulltext JSON files; they are the canonical extracted text output

Example file-based run:

import cellar_extractor as cell

cell.get_cellar_extra(
    save=True,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=5000,
    threads=6,
)

By default this writes into data/:

  • a CSV with the enriched tabular dataset
  • a _fulltext.json file with the text rows

Main Outputs

get_cellar_extra(...) produces:

  1. an enriched dataframe / CSV
  2. a fulltext JSON list / file

Important Enriched DataFrame Columns

  • citing
  • cited_by
  • celex_summary
  • celex_keywords
  • celex_directory_codes
  • celex_eurovoc
  • advocate_general
  • judge_rapporteur
  • affecting_ids
  • affecting_strings
  • citations_extra_info
  • fulltext_source
  • summary_source
  • missing_reasons

Important Fulltext JSON Fields

  • celex
  • ecli
  • text
  • text_source
  • text_language
  • text_format
  • missing_reasons

Completeness Rules

The extractor does not treat empty values as silent success.

Important cases:

  • if citation data exists, it should populate citing / cited_by
  • if a document has no citation edges, the columns still exist and are empty
  • if full text or summary is not available upstream, missing_reasons should reflect that

Typical missing_reasons values:

  • FULLTEXT_UNAVAILABLE_UPSTREAM
  • SUMMARY_UNAVAILABLE_UPSTREAM
  • UNAVAILABLE_UPSTREAM

Sector 8 is still best effort because upstream availability is uneven, but the extractor now flags absence explicitly instead of implying completeness.

Public API Reference

Root-Level Package API

Imported from cellar_extractor/__init__.py:

Function / class Purpose
get_cellar(...) Fetch base CELLAR metadata
get_cellar_extra(...) Fetch enriched metadata + full text
get_nodes_and_edges_lists(df, only_local=False) Build citation graph lists
filter_subject_matter(df, phrase) Filter dataframe by subject phrase
FetchOperativePart Extract operative part from a single case document
Writing Write operative-part outputs to CSV / JSON / TXT

Core Modules

cellar_extractor/cellar.py

  • get_cellar(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", file_format="csv", output_dir="data", output_path=None, return_data=None, save=None)
  • get_cellar_extra(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", threads=10, username="", password="", output_dir="data", metadata_output_path=None, fulltext_output_path=None, save_metadata=None, save_fulltext=None, return_data=None, save=None)
  • get_nodes_and_edges_lists(df=None, only_local=False)
  • filter_subject_matter(df=None, phrase=None)

Notes:

  • username / password are legacy compatibility parameters and no longer change the extraction path
  • save is the preferred save toggle; save_file is kept as a deprecated compatibility alias
  • output_path, metadata_output_path, and fulltext_output_path let callers choose exact output locations instead of relying on fixed folders
  • when save flags are disabled, the package returns in-memory objects without writing files

cellar_extractor/citations_adder.py

  • add_citations_separate(data, threads): production citation enrichment
  • add_citations_separate_webservice(data, username, password): deprecated legacy comparison path
  • add_citations(data, threads): older citation replacement helper

cellar_extractor/fulltext_saving.py

  • add_sections(data, threads, output_path=None, json_filepath=None, fulltext_output_path=None): enriches summaries, keywords, text metadata, provenance, and missing-data flags

cellar_extractor/eurlex_scraping.py

Main higher-level adapter functions:

  • get_case_data_by_celex_id(celex, language="EN")
  • get_html_text_by_celex_id(id)
  • get_summary_html(celex)
  • get_full_text_from_html(html_text)

This module contains the sector-aware source logic for InfoCuria and CELLAR item retrieval.

cellar_extractor/sparql.py

  • get_citations(source_celex, cites_depth=1, cited_depth=1, max_retries=3)
  • get_citations_csv(celex, max_retries=3)
  • get_citing(celex, cites_depth, max_retries=3)
  • get_cited(celex, cited_depth, max_retries=3)
  • run_eurlex_webservice_query(query_input, username, password) for legacy SOAP validation only

cellar_extractor/cellar_sparql_queries.py

Advanced query helper class:

  • CellarSparqlQuery
    • get_endorsements()
    • get_subjects()
    • get_parties()
    • get_keywords()
    • get_citations()
    • get_grounds()

cellar_extractor/operative_extractions.py

Classes:

  • FetchOperativePart
  • Writing

Use this path when you want operative-part extraction for individual documents rather than the full dataset pipeline.

Upstream Endpoints Used

These are the upstream systems the extractor relies on.

Endpoint family Used for
CELLAR SPARQL https://publications.europa.eu/webapi/rdf/sparql corpus discovery, metadata, citation edges
InfoCuria https://infocuriaws.curia.europa.eu/... sector 6 text and metadata
InfoCuria https://infocuria.curia.europa.eu/document/... sector 6 document HTML
CELLAR resource/item URLs under https://publications.europa.eu/resource/cellar/... sector 8 downloadable text / summary manifestations
EUR-Lex SOAP https://eur-lex.europa.eu/EURLexWebService?wsdl legacy redundancy tests only

Testing

Fast Local Suite

pytest -q

Live Integration Flags

  • RUN_INFOCURIA_INTEGRATION=1
  • RUN_SECTOR8_INTEGRATION=1
  • RUN_CITATION_INTEGRATION=1

Examples:

RUN_INFOCURIA_INTEGRATION=1 pytest -q tests/test_infocuria_integration.py
RUN_SECTOR8_INTEGRATION=1 pytest -q tests/test_sector8_integration.py
RUN_CITATION_INTEGRATION=1 pytest -q tests/test_citation_graph_integration.py

Legacy Webservice Tests

Only needed if you want to re-check SOAP redundancy:

RUN_WEBSERVICE_INTEGRATION=1 pytest -q tests/test_webservice_credentials_integration.py tests/test_webservice_redundancy_integration.py

If used, credentials are read from .env:

EURLEX_WEBSERVICE_USERNAME=
EURLEX_WEBSERVICE_PASSWORD=

These credentials are not required for normal extraction.

Troubleshooting

missing_reasons is populated

That means the extractor could not find the requested upstream content. This is expected when upstream does not expose a summary or full text for the document.

Citation columns are empty

Check:

  • that the document actually has graph relations upstream
  • the live SPARQL endpoint availability
  • whether you are looking at a very small or isolated sample

Sector 8 feels sparse

That is usually an upstream availability issue, not a silent extractor failure. Sector 8 is intentionally handled as best effort with explicit flags.

Releasing

This project uses setuptools_scm for automatic versioning based on git tags. Follow these steps to release a new version:

1. Create a git tag

git tag v<major>.<minor>.<patch>

For example:

git tag v1.2.3

2. Push the tag to remote

git push origin v<major>.<minor>.<patch>

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellar_extractor-1.3.0.tar.gz (82.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cellar_extractor-1.3.0-py3-none-any.whl (46.7 kB view details)

Uploaded Python 3

File details

Details for the file cellar_extractor-1.3.0.tar.gz.

File metadata

  • Download URL: cellar_extractor-1.3.0.tar.gz
  • Upload date:
  • Size: 82.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cellar_extractor-1.3.0.tar.gz
Algorithm Hash digest
SHA256 1036653ca6b53d6aa73c01e1a7375c7c915e0d482af5900c88cfbe51fcec4689
MD5 c36d5880c4d0683f0f9e3b28d07f7b8b
BLAKE2b-256 88e40b08a3919c373c7899761f04b79f2ecfa6947b6323df09fd3d7a38cd49c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for cellar_extractor-1.3.0.tar.gz:

Publisher: github-actions.yml on maastrichtlawtech/cellar-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cellar_extractor-1.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cellar_extractor-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98dfd9b8b2095097652e07371aed9e586e55323bed67b0f61d93199f50d48272
MD5 35d190a44af208676913b42f50b40ab2
BLAKE2b-256 2c290917f9066ab0b51dc0a7267ec0fc02cb93283d1cceb17006b83639da027e

See more details on using hashes here.

Provenance

The following attestation bundles were made for cellar_extractor-1.3.0-py3-none-any.whl:

Publisher: github-actions.yml on maastrichtlawtech/cellar-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page