Skip to main content

A Python package for scraping textual data from EU public consultations

Project description

eu-consultations: A Python package for scraping textual data from EU public consultations

PyPI - Python Version PyPI PyPI - Downloads

eu-consultations allows to scrape textual data from EU public consultation from https://ec.europa.eu/info/law/better-regulation. It's aim is to facilitate academic analysis of how the public participates in EU public consultations. In EU public consultations, the broader public of the EU is asked to supply input to proposed regulations.

The package has three main functions:

  • Scrape metadata on feedback by the public to EU consultations by topic and/ or text search (in title and description of consultations) through accessing the API supplying the frontend to https://ec.europa.eu/info/law/better-regulation.
  • Download files (e.g. .pdf and .docx) attached to feedback
  • Extract text from files using docling

Downloaded data is validated and stored as JSON.

eu-consultationsis partially based on https://github.com/desrist/AskThePublic.

Installation

⚠️ eu-consultations requires Python 3.12 or higher.

eu-consultations is available through PyPI:

pip install eu-consultations

How to use eu-consultations

The following describes the typical pipeline for using eu-consultations:

1) Get consultation data

Here we will scrape data on consultations with the text "cloud" and "parrot" for the topic "DIGITAL". To get an overview over all possible topics, use:

from eu_consultations.scrape import show_available_topics

show_available_topics()

Now let's scrape all metadata on feedback to consultations on the the topic "DIGITAL" (Digital economy and society), where the search terms "cloud" or "parrot" appear.

from eu_consultations.scrape import scrape

initiatives_data = scrape(
    topic_list=["DIGITAL"],
    text_list=["cloud","parrot"],
    max_pages=None, # restrict number of frontend pages to crawl
    max_feedback = None, # set a maximum number of feedback to gather
    output_folder=<my-folder>,
    filename="<my-filename>.json")

This:

  • serializes all data to .json in
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects, which store feedback data per consultation

2) Download files attached to consultation feedback

Using our previous initial scrape, we can now download all attached files to feedback if we want to:

from eu_consultations.extract_filetext import download_consultation_files

data_with_downloads = download_consultation_files(
    initiatives_data = initiatives_data,
    output_folder=<my-folder>)

This:

  • downloads all attached files to /files
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with file locations attached

3) Extract texts

A lot of feedback to consultations already contains text on the opinions of the consulted available through step 1), but much of it is also contained in attached document. Let's extract the text and attach to our data:

from eu_consultations.extract_filetext import extract_text_from_attachments

data_with_extracted_text = extract_text_from_attachments(
    initiatives_data_with_attachments = data_with_downloads, #created in step 2)
    stream_out_folder = <my-folder>/files #let's stream out to the same location as files
)

This:

  • extracts text from all files referenced in data_with_downloads
  • stores extracted text in lossless Docling JSON format per document at the folder set by stream_out_folder in a sub-directory docling/ and per consultation at a sub-directory consultations/
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with text and docling JSON attached.

We can save the object, but be aware, it might be quite large:

from eu_consultations.scrape import save_to_json

save_to_json(data_with_extracted_text, 
    <my-folder>, 
    filename="initiatives_with_extracted.json")

Load serialized consultation data

If you have exported output from any of the above steps using eu_consultations.scrape.save_to_json, you can re-import into a list of eu_consultations.consultation_data.Initiative objects with eu_consultations.scrape.read_initiatives_from_json.

Development

The package is developed using uv. Run tests (using pytest) with:

uv run pytest --capture no

The --capture no setting will show loguru log output. It is not necessary.

Some tests need a working internet connection and scrape small amounts of data from https://ec.europa.eu/info/law/better-regulation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eu_consultations-0.3.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eu_consultations-0.3.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file eu_consultations-0.3.0.tar.gz.

File metadata

  • Download URL: eu_consultations-0.3.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for eu_consultations-0.3.0.tar.gz
Algorithm Hash digest
SHA256 508699af576fa28380c2e6d9543b48fcca80506752ffd56685aefdc2816d4871
MD5 b53c4e0d4ec63efb5faf1ef578084d34
BLAKE2b-256 41762be674730d34103b330a79e936cb5a89b886200197a402ee703cbe1d513e

See more details on using hashes here.

File details

Details for the file eu_consultations-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for eu_consultations-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6c6dc9a6c75d844420855aff8165fd12a0f9c9573a58105cb709ac9c786e3731
MD5 4a363c6c12248acde22e0bd25cbb5ce3
BLAKE2b-256 cf410c5f0c22ad77107e95fa267f9fa8bc1283e8cb4345ba659d43d6c1eeb7b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page