gdprCrawlerTest20

GDPR document crawler

These details have not been verified by PyPI

Project links

Homepage

Project description

GDPRxiv Crawler (README)

An efficient tool to crawl GDPR legal documents!

About The Project

With the introduction of the Europeans Union's General Data Protection Regulation (GDPR), there has been an explosion in the number of legal documents pertaining to case reviews, analyses, legal decisions, etc... that mark the enforcement of the GDPR. Additionally, these documents are spread across over 30 Data Protection (DPA) and Supervisory Authorities. As a result, it is cumbersome for researchers/legal teams to access and download a large quantity of GDPR documents at once.

To address this, we have created GDPRxiv Crawler, a command-line tool that allows users to efficiently filter and download GDPR documents. Users may select their desired DPA and document_type, and GDPRxiv Crawler will scrape the web and download all up-to-date documents.

Of course, it is impossible to entirely keep up with DPA website redesigns and newly added document categories. However, we hope that this tool will eliminate the bulk of the workload and allow users to focus on more important tasks.

Built With

Getting Started

Prerequisites

Python 3.9 is required. This python version includes the pip installer and the venv module, which is needed to create a virtual environment.

It is strongly recommended that users utilize a virtual environment when installing this package. See below to create and activate one.

In a directory:

venv:
```
virtualenv <virtual env name>
```
Activate the virtual environment:
```
source <virtual env name>/bin/activate
```

Installation

At any moment, use command 'pip3 list' to view all installed packages.

Download requirements.txt and place it in the directory that contains the virtual environment.
Install package requirements
```
pip3 install -r requirements.txt
```

Install the GDPRxiv Crawler package

pip3 install -i https://test.pypi.org/simple/ gdprCrawlerTest15

Usage

Downloaded documents will be organized into a set of folders based on DPA and document type.

A file called visitedDocs.txt is always created upon an initial run within a new directory. This file records each downloaded document's unique hash, which allows the tool to avoid overwriting existing documents (if desired) in future runs.

Scrape desired documents:
```
gdprCrawler scrape --country <country name> --document_type <document type> --path <directory to store documents>
```
The same directory can be used for multiple countries: the scraper automatically organizes documents based on country and document type.

Optionally, the --overwrite argument can be included if users would like to overwrite existing documents:

   gdprCrawler scrape --country <country name> --document_type <document type> --path <directory to store documents> --overwrite <True/False>

Overwrite is False by default.

Country and document type arguments should be written exactly as they appear below:

SUPPORTED COUNTRIES:     DOCUMENTS TYPES:

        Austria                  Decisions
        Belgium                  Annual Reports, Decisions, Opinions
        Bulgaria                 Annual Reports, Opinions
        Croatia                  Decisions
        Cyprus                   Annual Reports, Decisions
        Czech Republic           Annual Reports, Completed Inspections, Court Rulings, Decisions, Opinions, Press Releases
        Denmark                  Annual Reports, Decisions, Permissions
        EDPB (Agency)            Annual Reports, Decisions, Guidelines, Letters, Opinions, Recommendations
        Estonia                  Annual Reports, Instructions, Prescriptions
        Finland                  Advice, Decisions, Guides, Notices
        France                   FUTURE UPDATE
        Germany                  N/A
        Greece                   Annual Reports, Decisions, Guidelines, Opinions, Recommendations
        Hungary                  Annual Reports, Decisions, Notices, Recommendations, Resolutions
        Ireland                  Decisions, Judgements, News
        Italy                    Annual Reports, Hearings, Injunctions, Interviews, Newsletters, Publications
        Latvia                   Annual Reports, Decisions, Guidances, Opinions, Violations
        Lithuania                Decisions, Guidelines, Inspection Reports
        Luxembourg               Annual Reports, Opinions
        Malta                    Guidelines, News Articles
        Netherlands              Decisions, Opinions, Public Disclosures, Reports
        Poland                   Decisions, Tutorials
        Portugal                 Decisions, Guidelines, Reports
        Romania                  Decisions, Reports
        Slovakia                 Fines, Opinions, Reports
        Slovenia                 Blogs, Guidelines, Infographics, Opinions, Reports
        Spain                    Blogs, Decisions, Guides, Infographics, Reports
        Sweden                   Decisions, Guidances, Judgements, Publications
        United Kingdom           Decisions, Judgements, Notices

Contributing

All suggestions and contributions you make are greatly appreciated.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Project Link: https://github.com/GDPRxiv/crawler

Acknowledgments

Thank you to everyone who has supported the project in any way. We greatly appreciate your time and effort!

(back to top)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.1

May 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdprCrawlerTest20-0.0.1.tar.gz (123.6 kB view details)

Uploaded May 23, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gdprCrawlerTest20-0.0.1-py3-none-any.whl (179.1 kB view details)

Uploaded May 23, 2023 Python 3

File details

Details for the file gdprCrawlerTest20-0.0.1.tar.gz.

File metadata

Download URL: gdprCrawlerTest20-0.0.1.tar.gz
Upload date: May 23, 2023
Size: 123.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.11

File hashes

Hashes for gdprCrawlerTest20-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c913952d625f07ddf019c58ccd7b775b6da22d2ba86564958f356dcb1bc189af`
MD5	`a43d27153a9383793ef7315c2a1ffaed`
BLAKE2b-256	`91c955578a4bbb38288b2bc70944c13a0de872f665603f2c635b4b60d9810c74`

See more details on using hashes here.

File details

Details for the file gdprCrawlerTest20-0.0.1-py3-none-any.whl.

File metadata

Download URL: gdprCrawlerTest20-0.0.1-py3-none-any.whl
Upload date: May 23, 2023
Size: 179.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.11

File hashes

Hashes for gdprCrawlerTest20-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`932f9300fdb6fdfd01a19b74c22138ce29545de82529248100039fb98d3b8090`
MD5	`937f634696751c28e66b41ecde63720c`
BLAKE2b-256	`97608fc77e8357c55c30d15e7db579ec76fb84e710448cfa0be986af9e80c90d`

See more details on using hashes here.

gdprCrawlerTest20 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GDPRxiv Crawler (README)

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Contributing

License

Contact

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes