Tools to extract and compile enforcement decisions from the Singapore Personal Data Protection Commission

These details have not been verified by PyPI

Project links

Homepage

Project description

pdpc-decisions

This package contains utilities which allow you to create a corpus of decisions from the Personal Data Protection Commission of Singapore's Data Protection Enforcement Cases.

The primary use of such a corpus is for studying, possibly using data science tools such as natural language processing.

It currently has the following features:

Visit the Personal Data Protection Commission of Singapore's Data Protection Enforcement Cases and compile a table of decisions with information from the summaries provided by the PDPC for each case.
Save this table of decisions as CSV
Download all the PDF files of the decisions from the PDPC's website. If the decision is not a PDF, collects the information provided on the decision web page and saves it as a text file.
Convert the PDF files into text files

Features provided by scraper

Published date
Respondent
Title
Summary
URL of PDF of decision

The features are discovered by passing --extras to the command.

[Extras] Citation
[Extras] Basic enforcement information (Financial penalty, warning, directions)
[Extras] References (referred by, referring to)

What pdpc-decisions uses

Python 3
PDF Miner
Selenium
Chrome
spaCy

Installation

Docker Image

I dockerised the application for my personal ease of use. It is probably the easiest and most straight-forward way to use the application and I recommend it too. The dockerised application also contains all pre-requisites so there is no need for any manual installs.

You need to have docker installed. Pull the image from docker hub.

docker pull houfu/pdpc-decisions

After that you can run the image and pass commands and arguments to it. For example, if you would like the application to do all actions.

docker run houfu/pdpc-decisions all

This isn't clever because downloads will be stored in the docker image and not easily accessed. Bind a volume in your filesystem and use the --root option to direct the application to save the files there. For example:

docker run \ 
  --mount type=bind,source="$(pwd)"/target,target=/code/download \ # Target directory must exist!
  houfu/pdpc-decisions \
  all \
  --root /code/download/

Local install

Install via PIP

pip install pdpc-decisions

Once the package is installed, used the command line tool pdpc-decisions to use the script.
If necessary, install Chrome and ChromeDriver for Selenium to work.

The main entry point for the script is pdpcdecision.py

Usage

The script accepts the following actions and options:

Accepts the following actions.

"all" Does all the actions (scraping the website, saving a csv, downloading all files and creating a corpus).

"corpus" Converts PDF format of decisions into plain text files.

"csv" Save the items gathered by the scraper as a csv file.

"files" Downloads all the decisions from the PDPC website into a folder.

Options:

--csv FILE Filename for saving the items gathered by scraper as a csv file. [default: scrape_results.csv]

--download DIRECTORY Destination folder for downloads of all PDF/web pages of PDPC decisions [default: download/]

--corpus DIRECTORY Destination folder for PDPC decisions converted to text files [default: corpus/] -r, --root DIRECTORY Root directory for downloads and files [default: Your current working directory]

--extras/--no-extras Add extra features to the data collected. This increases processing time. This feature is ignored if action is files or downloads. (Experimental and requires reading of actual decisions) [default: False, '--no-extras']

--extra-corpus/--no-extra-corpus Enable experimental features for corpus. This increases processing time.

--verbose Verbose output

--help Show this message and exit.

Contact

Feel free to let me have your suggestions, comments or issues using the issue tracker or by emailing me.

It would also be nice to hear how you have used this corpus by using the above contacts.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.3.2

Jul 1, 2020

1.3.1

Jun 27, 2020

1.3.0

Jun 22, 2020

1.2.2

May 19, 2020

1.2.1

Apr 27, 2020

1.2.0

Apr 8, 2020

1.1.2

Mar 27, 2020

1.1.1

Mar 20, 2020

1.1.0

Mar 19, 2020

1.0.2

Feb 21, 2020

1.0.1

Feb 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdpc-decisions-1.3.2.tar.gz (21.1 kB view details)

Uploaded Jul 1, 2020 Source

File details

Details for the file pdpc-decisions-1.3.2.tar.gz.

File metadata

Download URL: pdpc-decisions-1.3.2.tar.gz
Upload date: Jul 1, 2020
Size: 21.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.8.0 tqdm/4.47.0 CPython/3.8.1

File hashes

Hashes for pdpc-decisions-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`6023348d70f4edd81acfb85820d049fef60c3f64d9470bc99db6f1c9dc57fb5e`
MD5	`3edeb56fd2c4eec66ccbd59f4a6145a0`
BLAKE2b-256	`d7aeb7521145f9be4f6352ecf313ac32a7c9d6fc962bee44f23cb1593fb88e62`

See more details on using hashes here.

pdpc-decisions 1.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdpc-decisions

Features provided by scraper

What pdpc-decisions uses

Installation

Docker Image

Local install

Usage

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes