Skip to main content

A Python package for scraping textual data from EU public consultations

Project description

eu-consultations: A Python package for scraping textual data from EU public consultations

PyPI - Python Version PyPI PyPI - Downloads

eu-consultations allows to scrape textual data from EU public consultation from https://ec.europa.eu/info/law/better-regulation. It's aim is to facilitate academic analysis of how the public participates in EU public consultations. In EU public consultations, the broader public of the EU is asked to supply input to proposed regulations.

The package has three main functions:

Downloaded data is validated and stored as JSON.

eu-consultationsis partially based on https://github.com/desrist/AskThePublic.

Installation

⚠️ eu-consultations requires Python 3.12 or higher.

eu-consultations is available through PyPI:

pip install eu-consultations

How to use eu-consultations

The following describes the typical pipeline for using eu-consultations:

1) Get consultation data

Here we will scrape data on consultation for the topic "DIGITAL". To get an overview over all possible topics, use:

from eu_consultations.scrape import show_available_topics

show_available_topics()

Now let's scrape all metadata on feedback to consultations on the the topic "DIGITAL" (Digital economy and society).

from eu_consultations.scrape import scrape

initiatives_data = scrape(
    topic_list=["DIGITAL"],
    max_pages=None, # restrict number of frontend pages to crawl
    max_feedback = None, # set a maximum number of feedback to gather
    output_folder=<my-folder>,
    filename="<my-filename>.json")

This:

  • serializes all data to .json in
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects, which store feedback data per consultation

2) Download files attached to consultation feedback

Using our previous initial scrape, we can now download all attached files to feedback if we want to:

from eu_consultations.extract_filetext import download_consultation_files

data_with_downloads = download_consultation_files(
    initiatives_data = initiatives_data,
    output_folder=<my-folder>)

This:

  • downloads all attached files to /files
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with file locations attached

3) Extract texts

A lot of feedback to consultations already contains text on the opinions of the consulted available through step 1), but much of it is also contained in attached document. Let's extract the text and attach to our data:

from eu_consultations.extract_filetext import extract_text_from_attachments

data_with_extracted_text = extract_text_from_attachments(
    initiatives_data_with_attachments = data_with_downloads, #created in step 2)
    stream_out_folder = <my-folder>/files #let's stream out to the same location as files
)

This:

  • extracts text from all files referenced in data_with_downloads
  • stores extracted text in lossless Docling JSON format per document at the folder set by stream_out_folder in a sub-directory docling/ and per consultation at a sub-directory consultations/
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with text and docling JSON attached.

We can save the object, but be aware, it might be quite large:

from eu_consultations.scrape import save_to_json

save_to_json(data_with_extracted_text, 
    <my-folder>, 
    filename="initiatives_with_extracted.json")

Load serialized consultation data

If you have exported output from any of the above steps using eu_consultations.scrape.save_to_json, you can re-import into a list of eu_consultations.consultation_data.Initiative objects with eu_consultations.scrape.read_initiatives_from_json.

Development

The package is developed using uv. Run tests (using pytest) with:

uv run pytest --capture no

The --capture no setting will show loguru log output. It is not necessary.

Some tests need a working internet connection and scrape small amounts of data from https://ec.europa.eu/info/law/better-regulation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eu_consultations-0.2.2.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

eu_consultations-0.2.2-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file eu_consultations-0.2.2.tar.gz.

File metadata

  • Download URL: eu_consultations-0.2.2.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for eu_consultations-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4551b5634271916bd1d093f21ef0384f4fe53f62dd8b1aec7f43bc69e4715bd2
MD5 80a2526f808e26d218bd5f2b22a112ed
BLAKE2b-256 c1ed36596a899f66d6897900f5cec3401e6b8128913ddb693f3728c3c7a8f409

See more details on using hashes here.

File details

Details for the file eu_consultations-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for eu_consultations-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ff37d8214f4d42529e27be2f31db8d01fa91a796061d2836c7ca1520b2776eda
MD5 59700bb3c21ee3002cb88d6a0f99a5dc
BLAKE2b-256 ae8b3f84eb6a7a2e759276adaabff2c58278908cc47d5df0269d73362b3c330e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page