Skip to main content

Tools to extract and compile enforcement decisions from the Singapore Personal Data Protection Commission

Project description

pdpc-decisions

GitHub last commit Build Status Docker Cloud Automated build

This package contains utilities which allow you to create a corpus of decisions from the Personal Data Protection Commission of Singapore's Data Protection Enforcement Cases.

The primary use of such a corpus is for studying, possibly using data science tools such as natural language processing.

It currently has the following features:

  • Visit the Personal Data Protection Commission of Singapore's Data Protection Enforcement Cases and compile a table of decisions with information from the summaries provided by the PDPC for each case.
  • Save this table of decisions as CSV
  • Download all the PDF files of the decisions from the PDPC's website. If the decision is not a PDF, collects the information provided on the decision web page and saves it as a text file.
  • Convert the PDF files into text files

What pdpc-decisions uses

  • Python 3
  • PDF Miner
  • Selenium
  • Chrome
  • spaCy

Installation

Docker Image

I dockerised the application for my personal ease of use. It is probably the easiest and most straight-forward way to use the application and I recommend it too.

You need to have docker installed. Pull the image from docker hub.

docker pull houfu/pdpc-decisions

After that you can run the image and pass commands and arguments to it. For example, if you would like the application to do all actions.

docker run houfu/pdpc-decisions all

This isn't clever because downloads will be stored in the docker image and not easily accessed. Bind a volume in your filesystem and use the --root option to direct the application to save the files there. For example:

docker run \ 
  --mount type=bind,source="$(pwd)"/target,target=/code/download \ # Target directory must exist!
  houfu/pdpc-decisions \
  all \
  --root /code/download/

Local install

  • Clone this repository.
git clone https://github.com/houfu/pdpc-decisions.git
  • Install using setup.py (which will also install all dependencies. Except Chrome and ChromeDriver)
$ cd pdpc-decisions
$ pip install .

The main entry point for the script is pdpcdecision.py

Usage

The script accepts the following actions and options:

Accepts the following actions.

"all" Does all the actions (scraping the website, saving a csv, downloading all files and creating a corpus).

"corpus" After downloading all the decisions from the website, converts them into text files.

"csv" Save the items gathered by the scraper as a csv file.

"files" Downloads all the decisions from the PDPC website into a folder.

"zeeker" Construct or updates the zeeker database (internal use only)

Options:

--csv FILE Filename for saving the items gathered by scraper as a csv file. [default: scrape_results.csv]

--download DIRECTORY Destination folder for downloads of all PDF/web pages of PDPC decisions [default: download/]

--corpus DIRECTORY Destination folder for PDPC decisions converted to text files [default: corpus/] -r, --root DIRECTORY Root directory for downloads and files [default: Your current working directory]

--extras/--no-extras Add extra features to the data collected. (Experimental and requires reading of actual decisions)

--help Show this message and exit.

Contact

Feel free to let me have your suggestions, comments or issues using the issue tracker or by emailing me.

It would also be nice to hear how you have used this corpus by using the above contacts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdpc-decisions-1.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

pdpc_decisions-1.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file pdpc-decisions-1.1.0.tar.gz.

File metadata

  • Download URL: pdpc-decisions-1.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for pdpc-decisions-1.1.0.tar.gz
Algorithm Hash digest
SHA256 24c0e659f0df05430f99540e92f05a5b83207e9f3bcc1550679b4384d8ba1805
MD5 eadf9117f9d633fa33ea7e57103036ce
BLAKE2b-256 31b62d794f49d63cdda4f335960ed36991e62e57d087a19e4d0d66d24ded2a2d

See more details on using hashes here.

File details

Details for the file pdpc_decisions-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdpc_decisions-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for pdpc_decisions-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76866a70fa2c49873f3227b45731d513e31e6052f90faa5bbdc35f8bdddf6a7e
MD5 a4f60d3cff828f1ec98ba44bb3602717
BLAKE2b-256 2c560ac557f3155ffe8a57e727f9968fbae1084332cedaa6576d98cfa3a8d3f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page