GDPR document crawler
Project description
About The Project
With the introduction of the Europeans Union's General Data Protection Regulation (GDPR), there has been an explosion in the number of legal documents pertaining to case reviews, analyses, legal decisions, etc... that mark the enforcement of the GDPR. Additionally, these documents are spread across over 30 Data Protection (DPA) and Supervisory Authorities. As a result, it is cumbersome for researchers/legal teams to access and download a large quantity of GDPR documents at once.
To address this, we have created GDPRxiv Crawler, a command-line tool that allows users to efficiently filter and download GDPR documents. Users may select their desired DPA and document_type, and GDPRxiv Crawler will scrape the web and download all up-to-date documents.
Of course, it is impossible to entirely keep up with DPA website redesigns and newly added document categories. However, we hope that this tool will eliminate the bulk of the workload and allow users to focus on more important tasks.
Built With
Getting Started
Prerequisites
Python 3.9 is required. This python version includes the pip installer and the venv module, which is needed to create a virtual environment.
It is strongly recommended that users utilize a virtual environment when installing this package. See below to create and activate one.
In a directory:
-
venv:
virtualenv <virtual env name>
-
Activate the virtual environment:
source <virtual env name>/bin/activate
Installation
At any moment, use command 'pip3 list' to view all installed packages.
- Download requirements.txt and place it in the directory that contains the virtual environment.
- Install package requirements
pip3 install -r requirements.txt
- Install the GDPRxiv Crawler package
pip3 install -i https://test.pypi.org/simple/ gdprCrawlerTest15
Usage
Downloaded documents will be organized into a set of folders based on DPA and document type.
A file called visitedDocs.txt is always created upon an initial run within a new directory. This file records each downloaded document's unique hash, which allows the tool to avoid overwriting existing documents (if desired) in future runs.
-
Scrape desired documents:
gdprCrawler scrape --country <country name> --document_type <document type> --path <directory to store documents>
The same directory can be used for multiple countries: the scraper automatically organizes documents based on country and document type.
-
Optionally, the --overwrite argument can be included if users would like to overwrite existing documents:
gdprCrawler scrape --country <country name> --document_type <document type> --path <directory to store documents> --overwrite <True/False>
Overwrite is False by default.
Country and document type arguments should be written exactly as they appear below:
SUPPORTED COUNTRIES: DOCUMENTS TYPES:
Austria Decisions
Belgium Annual Reports, Decisions, Opinions
Bulgaria Annual Reports, Opinions
Croatia Decisions
Cyprus Annual Reports, Decisions
Czech Republic Annual Reports, Completed Inspections, Court Rulings, Decisions, Opinions, Press Releases
Denmark Annual Reports, Decisions, Permissions
EDPB (Agency) Annual Reports, Decisions, Guidelines, Letters, Opinions, Recommendations
Estonia Annual Reports, Instructions, Prescriptions
Finland Advice, Decisions, Guides, Notices
France FUTURE UPDATE
Germany N/A
Greece Annual Reports, Decisions, Guidelines, Opinions, Recommendations
Hungary Annual Reports, Decisions, Notices, Recommendations, Resolutions
Ireland Decisions, Judgements, News
Italy Annual Reports, Hearings, Injunctions, Interviews, Newsletters, Publications
Latvia Annual Reports, Decisions, Guidances, Opinions, Violations
Lithuania Decisions, Guidelines, Inspection Reports
Luxembourg Annual Reports, Opinions
Malta Guidelines, News Articles
Netherlands Decisions, Opinions, Public Disclosures, Reports
Poland Decisions, Tutorials
Portugal Decisions, Guidelines, Reports
Romania Decisions, Reports
Slovakia Fines, Opinions, Reports
Slovenia Blogs, Guidelines, Infographics, Opinions, Reports
Spain Blogs, Decisions, Guides, Infographics, Reports
Sweden Decisions, Guidances, Judgements, Publications
United Kingdom Decisions, Judgements, Notices
Contributing
All suggestions and contributions you make are greatly appreciated.
License
Distributed under the MIT License. See LICENSE.txt for more information.
Contact
Project Link: https://github.com/GDPRxiv/crawler
Acknowledgments
Thank you to everyone who has supported the project in any way. We greatly appreciate your time and effort!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gdprCrawlerTest20-0.0.1.tar.gz.
File metadata
- Download URL: gdprCrawlerTest20-0.0.1.tar.gz
- Upload date:
- Size: 123.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c913952d625f07ddf019c58ccd7b775b6da22d2ba86564958f356dcb1bc189af
|
|
| MD5 |
a43d27153a9383793ef7315c2a1ffaed
|
|
| BLAKE2b-256 |
91c955578a4bbb38288b2bc70944c13a0de872f665603f2c635b4b60d9810c74
|
File details
Details for the file gdprCrawlerTest20-0.0.1-py3-none-any.whl.
File metadata
- Download URL: gdprCrawlerTest20-0.0.1-py3-none-any.whl
- Upload date:
- Size: 179.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
932f9300fdb6fdfd01a19b74c22138ce29545de82529248100039fb98d3b8090
|
|
| MD5 |
937f634696751c28e66b41ecde63720c
|
|
| BLAKE2b-256 |
97608fc77e8357c55c30d15e7db579ec76fb84e710448cfa0be986af9e80c90d
|