Skip to main content

A Python package for scraping textual data from EU public consultations

Project description

eu-consultations: A Python package for scraping textual data from EU public consultations

PyPI - Python Version PyPI PyPI - Downloads

eu-consultations allows to scrape textual data from EU public consultation from https://ec.europa.eu/info/law/better-regulation. It's aim is to facilitate academic analysis of how the public participates in EU public consultations. In EU public consultations, the broader public of the EU is asked to supply input to proposed regulations.

The package has three main functions:

Downloaded data is validated and stored as JSON.

eu-consultationsis partially based on https://github.com/desrist/AskThePublic.

Installation

⚠️ eu-consultations requires Python 3.12 or higher.

eu-consultations is available through PyPI:

pip install eu-consultations

How to use eu-consultations

The following describes the typical pipeline for using eu-consultations:

1) Get consultation data

Here we will scrape data on consultation for the topic "DIGITAL". To get an overview over all possible topics, use:

from eu_consultations.scrape import show_available_topics

show_available_topics()

Now let's scrape all metadata on feedback to consultations on the the topic "DIGITAL" (Digital economy and society).

from eu_consultations.scrape import scrape

consultation_data = scrape(
    topic_list=["DIGITAL"],
    max_pages=None, # restrict number of frontend pages to crawl
    max_feedback = None, # set a maximum number of feedback to gather
    output_folder=<my-folder>,
    filename="<my-filename>.json")

This:

  • saves all frontend pages crawled in a subdirectory /pages
  • serializes all data to .json in
  • returns a list of eu_consultations.consultation_data.Consultations dataclass objects, which store feedback data per consultation

2) Download files attached to consultation feedback

Using our previous initial scrape, we can now download all attached files to feedback if we want to:

from eu_consultations.extract_filetext import download_consultation_files

data_with_downloads = download_consultation_files(
    consultation_data = consultation_data,
    output_folder=<my-folder>)

This:

  • downloads all attached files to /files
  • returns a list of eu_consultations.consultation_data.Consultations dataclass objects with file locations attached

3) Extract texts

A lot of feedback to consultations already contains text on the opinions of the consulted available through step 1), but much of it is also contained in attached document. Let's extract the text and attach to our data:

from eu_consultations.extract_filetext import extract_text_from_attachments

data_with_extracted_text = extract_text_from_attachments(
    consultation_data_with_attachments = data_with_downloads, #created in step 2)
    stream_out_folder = <my-folder>/files #let's stream out to the same location as files
)

This:

  • extracts text from all files referenced in data_with_downloads
  • stores extracted text in lossless Docling JSON format per document at the folder set by stream_out_folder in a sub-directory docling/ and per consultation at a sub-directory consultations/
  • returns a list of eu_consultations.consultation_data.Consultations dataclass objects with text and docling JSON attached.

We can save the object, but be aware, it might be quite large:

from eu_consultations.scrape import save_to_json

save_to_json(data_with_extracted_text, 
    <my-folder>, 
    filename="consultations_with_extracted.json")

Load serialized consultation data

If you have exported output from any of the above steps using eu_consultations.scrape.save_to_json, you can re-import into a list of eu_consultations.consultation_data.Consultations objects with eu_consultations.scrape.read_consultations_from_json.

Development

The package is developed using uv. Run tests (using pytest) with:

uv run pytest --capture no

The --capture no setting will show loguru log output. It is not necessary.

Some tests need a working internet connection and scrape small amounts of data from https://ec.europa.eu/info/law/better-regulation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eu_consultations-0.1.2.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

eu_consultations-0.1.2-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file eu_consultations-0.1.2.tar.gz.

File metadata

  • Download URL: eu_consultations-0.1.2.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for eu_consultations-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3edaff9952fefc4d289e6abaa8c4c397f9dddc44cf7110a36ff3cdde80716fa4
MD5 72996c1295ee40abe87c03bc4aeafd8f
BLAKE2b-256 e74da5634336b254ec182830610a05b6c0d0d61fb0ccb6492a10f2cb79871d25

See more details on using hashes here.

File details

Details for the file eu_consultations-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for eu_consultations-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d60765ab248c6f940c15124a895ade6957165db9d8b2d8e24f459a614d5da25b
MD5 ce964fc1310d595af5331a3fca56f895
BLAKE2b-256 985983f1b7d5505a103dfaee3c355d467c8656c6d3d00644516c9c61d257e344

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page