Skip to main content

A Python package for scraping textual data from EU public consultations

Project description

eu-consultations: A Python package for scraping textual data from EU public consultations

PyPI - Python Version PyPI PyPI - Downloads

eu-consultations allows to scrape textual data from EU public consultation from https://ec.europa.eu/info/law/better-regulation. It's aim is to facilitate academic analysis of how the public participates in EU public consultations. In EU public consultations, the broader public of the EU is asked to supply input to proposed regulations.

The package has three main functions:

Downloaded data is validated and stored as JSON.

eu-consultationsis partially based on https://github.com/desrist/AskThePublic.

Installation

⚠️ eu-consultations requires Python 3.12 or higher.

eu-consultations is available through PyPI:

pip install eu-consultations

How to use eu-consultations

The following describes the typical pipeline for using eu-consultations:

1) Get consultation data

Here we will scrape data on consultation for the topic "DIGITAL". To get an overview over all possible topics, use:

from eu_consultations.scrape import show_available_topics

show_available_topics()

Now let's scrape all metadata on feedback to consultations on the the topic "DIGITAL" (Digital economy and society).

from eu_consultations.scrape import scrape

initiatives_data = scrape(
    topic_list=["DIGITAL"],
    max_pages=None, # restrict number of frontend pages to crawl
    max_feedback = None, # set a maximum number of feedback to gather
    output_folder=<my-folder>,
    filename="<my-filename>.json")

This:

  • serializes all data to .json in
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects, which store feedback data per consultation

2) Download files attached to consultation feedback

Using our previous initial scrape, we can now download all attached files to feedback if we want to:

from eu_consultations.extract_filetext import download_consultation_files

data_with_downloads = download_consultation_files(
    initiatives_data = initiatives_data,
    output_folder=<my-folder>)

This:

  • downloads all attached files to /files
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with file locations attached

3) Extract texts

A lot of feedback to consultations already contains text on the opinions of the consulted available through step 1), but much of it is also contained in attached document. Let's extract the text and attach to our data:

from eu_consultations.extract_filetext import extract_text_from_attachments

data_with_extracted_text = extract_text_from_attachments(
    initiatives_data_with_attachments = data_with_downloads, #created in step 2)
    stream_out_folder = <my-folder>/files #let's stream out to the same location as files
)

This:

  • extracts text from all files referenced in data_with_downloads
  • stores extracted text in lossless Docling JSON format per document at the folder set by stream_out_folder in a sub-directory docling/ and per consultation at a sub-directory consultations/
  • returns a list of eu_consultations.consultation_data.Initiative dataclass objects with text and docling JSON attached.

We can save the object, but be aware, it might be quite large:

from eu_consultations.scrape import save_to_json

save_to_json(data_with_extracted_text, 
    <my-folder>, 
    filename="initiatives_with_extracted.json")

Load serialized consultation data

If you have exported output from any of the above steps using eu_consultations.scrape.save_to_json, you can re-import into a list of eu_consultations.consultation_data.Initiative objects with eu_consultations.scrape.read_initiatives_from_json.

Development

The package is developed using uv. Run tests (using pytest) with:

uv run pytest --capture no

The --capture no setting will show loguru log output. It is not necessary.

Some tests need a working internet connection and scrape small amounts of data from https://ec.europa.eu/info/law/better-regulation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eu_consultations-0.2.3.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

eu_consultations-0.2.3-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file eu_consultations-0.2.3.tar.gz.

File metadata

  • Download URL: eu_consultations-0.2.3.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for eu_consultations-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1a648e3c27211c2b5c3642d27dba88fb78cbf1efaaab3cb7f8d362b3ee4191b0
MD5 c5c231336314d312be7ea8ae9acda2ca
BLAKE2b-256 9bbab4ebe4287f148b5fee22a9f859fd6a0d0ff46f256f9c263d6234312f7ea0

See more details on using hashes here.

File details

Details for the file eu_consultations-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for eu_consultations-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 af121339637744e9bc10483c99780e582f5edf43d5b85b6cc7b076c07b4fff46
MD5 68e6f0137586d954be213599cc471357
BLAKE2b-256 fd7eb765c2e0af06489dc7d6498c5843630c93f24c726893e5e8f1481b6a9cfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page