A Python package for scraping textual data from EU public consultations
Project description
eu-consultations: A Python package for scraping textual data from EU public consultations
eu-consultations
allows to scrape textual data from EU public consultation from https://ec.europa.eu/info/law/better-regulation. It's aim is to facilitate academic analysis of how the public participates in EU public consultations. In EU public consultations, the broader public of the EU is asked to supply input to proposed regulations.
The package has three main functions:
- Scrape metadata on feedback by the public to EU consultations by topic through accessing the API supplying the frontend to https://ec.europa.eu/info/law/better-regulation.
- Download files (e.g. .pdf and .docx) attached to feedback
- Extract text from files using docling
Downloaded data is validated and stored as JSON.
eu-consultations
is partially based on https://github.com/desrist/AskThePublic.
Installation
⚠️
eu-consultations
requires Python 3.12 or higher.
eu-consultations is available through PyPI:
pip install eu-consultations
How to use eu-consultations
The following describes the typical pipeline for using eu-consultations
:
1) Get consultation data
Here we will scrape data on consultation for the topic "DIGITAL". To get an overview over all possible topics, use:
from eu_consultations.scrape import show_available_topics
show_available_topics()
Now let's scrape all metadata on feedback to consultations on the the topic "DIGITAL" (Digital economy and society).
from eu_consultations.scrape import scrape
initiatives_data = scrape(
topic_list=["DIGITAL"],
max_pages=None, # restrict number of frontend pages to crawl
max_feedback = None, # set a maximum number of feedback to gather
output_folder=<my-folder>,
filename="<my-filename>.json")
This:
- serializes all data to .json in
- returns a list of
eu_consultations.consultation_data.Initiative
dataclass objects, which store feedback data per consultation
2) Download files attached to consultation feedback
Using our previous initial scrape, we can now download all attached files to feedback if we want to:
from eu_consultations.extract_filetext import download_consultation_files
data_with_downloads = download_consultation_files(
initiatives_data = initiatives_data,
output_folder=<my-folder>)
This:
- downloads all attached files to /files
- returns a list of
eu_consultations.consultation_data.Initiative
dataclass objects with file locations attached
3) Extract texts
A lot of feedback to consultations already contains text on the opinions of the consulted available through step 1), but much of it is also contained in attached document. Let's extract the text and attach to our data:
from eu_consultations.extract_filetext import extract_text_from_attachments
data_with_extracted_text = extract_text_from_attachments(
initiatives_data_with_attachments = data_with_downloads, #created in step 2)
stream_out_folder = <my-folder>/files #let's stream out to the same location as files
)
This:
- extracts text from all files referenced in
data_with_downloads
- stores extracted text in lossless Docling JSON format per document at the folder set by stream_out_folder in a sub-directory
docling/
and per consultation at a sub-directoryconsultations/
- returns a list of
eu_consultations.consultation_data.Initiative
dataclass objects with text and docling JSON attached.
We can save the object, but be aware, it might be quite large:
from eu_consultations.scrape import save_to_json
save_to_json(data_with_extracted_text,
<my-folder>,
filename="initiatives_with_extracted.json")
Load serialized consultation data
If you have exported output from any of the above steps using eu_consultations.scrape.save_to_json
, you can re-import into a list of eu_consultations.consultation_data.Initiative
objects with eu_consultations.scrape.read_initiatives_from_json
.
Development
The package is developed using uv. Run tests (using pytest) with:
uv run pytest --capture no
The --capture no
setting will show loguru log output. It is not necessary.
Some tests need a working internet connection and scrape small amounts of data from https://ec.europa.eu/info/law/better-regulation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file eu_consultations-0.2.3.tar.gz
.
File metadata
- Download URL: eu_consultations-0.2.3.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a648e3c27211c2b5c3642d27dba88fb78cbf1efaaab3cb7f8d362b3ee4191b0 |
|
MD5 | c5c231336314d312be7ea8ae9acda2ca |
|
BLAKE2b-256 | 9bbab4ebe4287f148b5fee22a9f859fd6a0d0ff46f256f9c263d6234312f7ea0 |
File details
Details for the file eu_consultations-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: eu_consultations-0.2.3-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af121339637744e9bc10483c99780e582f5edf43d5b85b6cc7b076c07b4fff46 |
|
MD5 | 68e6f0137586d954be213599cc471357 |
|
BLAKE2b-256 | fd7eb765c2e0af06489dc7d6498c5843630c93f24c726893e5e8f1481b6a9cfe |