Skip to main content

Library for scraping, parsing, and analyzing privacy policies.

Project description

polipy

PoliPy is a Python library that provides a command-line interface (CLI) and an API to scrape, parse, and analyze privacy policies of different services. It is a maintained library developed as a collaborative effort of researchers at the Berkeley Lab for Usable and Experimental Security (BLUES) at the University of California, Berkeley.

Please read carefully to learn more about properly citing this library and the GNU GPLv3 terms that govern the usage and modification of this software.

Installation

You can easily install the library using pip:

pip install polipy

Example

This library can either be used as a command-line interface (CLI):

$ cat policies.txt
https://docs.github.com/en/github/site-policy/github-privacy-statement

$ polipy policies.txt -s

or as an API imported by another module:

import polipy

url = 'https://docs.github.com/en/github/site-policy/github-privacy-statement'
result = polipy.get_policy(url, screenshot=True)

result.save(output_dir='.')

Both of these result in the creation of the following output folder:

├── docs_github_com_c0eb432555
│   ├── 20210511.html
│   ├── 20210511.png
│   ├── 20210511.json
├── └── 20210511.meta

where the base file name corresponds to the date the policy was scraped and the file extensions correspond to the following:

  • .html contains the (dynamic) source of the webpage where privacy policy is hosted
  • .png contains the screenshot of the webpage
  • .meta contains information such as the URL of the privacy policy and the date of last scraping
  • .json contains the content extracted from the privacy policy.

For instance, the text key of the JSON in the .json file contains the extracted text from the scraped privacy policy:

  GitHub Privacy Statement - GitHub Docs
  GitHub Docs
  All products
  GitHub.com
  Getting started
  ...
  Effective date: December 19, 2020
  Thanks for entrusting GitHub Inc. (“GitHub”, “we”) with your source code, your projects, and your personal information. Holding on to your private information is a serious responsibility, and we want you to know how we're handling it.
  All capitalized terms have their definition in
  GitHub’s Terms of Service , unless otherwise noted here.
  ...
  Contact GitHub
  Pricing
  Developer API
  Training
  About

Usage

CLI usage manual is available with the --help or -h flag:

$ polipy --help
usage: __main__.py [-h] [--output_dir OUTPUT_DIR] [--timeout TIMEOUT] [--screenshot] [--extractors EXTRACTORS [EXTRACTORS ...]] [--workers WORKERS] [--force] [--raise_errors] [--verbose] input_file

Download privacy policies from URLs contained in the input_file.

positional arguments:
  input_file            Path to file containing a list of newline-separated URLs of privacy policies to scrape.

optional arguments:
  -h, --help            show this help message and exit
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        Path to directory where policies will be saved (default is ...).
  --timeout TIMEOUT, -t TIMEOUT
                        The amount of time in seconds to wait for the HTTP request response (default is 30).
  --screenshot, -s      Capture and save the screenshot of the privacy policy page (default is False).
  --extractors EXTRACTORS [EXTRACTORS ...], -e EXTRACTORS [EXTRACTORS ...]
                        Extractors to use to capture information from the privacy policy (default is [text]).
  --workers WORKERS, -w WORKERS
                        Number of threading workers to use (default is 1).
  --force, -f           Scrape privacy policy again even if it is already scraped or has not been updated (default is False).
  --raise_errors, -r    Raise errors that occur during the scraping and parsing (default is False).
  --verbose, -v         Enable verbose logging (default is False).

The following helper methods are available when the PoliPy library is imported:

  • get_policy: Helper method that returns a polipy.Policy object containing information about the policy, scraped and processed from the given URL.
  • download_policy: Helper method that scrapes, parses, and saves the privacy policy located at the provided .

Additionally, you can directly create polipy.Policy objects supporting the following interface:

get_policy

Helper method that returns a polipy.Policy object containing information about the policy, scraped and processed from the given URL.

Parameters:

  • url (str): The URL of the privacy policy.
  • screenshot (bool, optional): Flag that indicates whether to capture and save the screenshot of the privacy policy page (default is False).
  • timeout (int, optional): The amount of time in seconds to wait for the HTTP request response (default is 30).
  • extractors (list of str, optional): Extractors to use to capture information from the privacy policy (default is ["text"]).

Returns:

  • polipy.Policy: Object containing information about the privacy policy.

Raises:

  • polipy.NetworkIOException: Raised if an error has occured while performing networking I/O.

download_policy

Helper method that scrapes, parses, and saves the privacy policy located at the provided url by creating the following directory structure:

├── <output_dir>
│   ├── <policy URL domain>_<hash of policy URL>
│   │   ├── <current UTC date>.html
│   │   ├── <current UTC date>.json
└── └── └── <current UTC date>.meta

Parameters:

  • url (str): The URL of the privacy policy.
  • output_dir (str, optional) Path to directory where the policy will be saved (default is the current working directory).
  • force (bool, optional): Flag that indicates whether to scrape privacy policy again even if it is already scraped or has not been updated (default is False).
  • raise_errors (bool, optional): Flag that indicates whether to raise errors that occur during the scraping and parsing (default is False).
  • logger (logging.Logger, optional): A logging.Logger object to handle the logging of events (default is None).
  • screenshot (bool, optional): Flag that indicates whether to capture and save the screenshot of the privacy policy page (default is False).
  • timeout (int, optional): The amount of time in seconds to wait for the HTTP request response (default is 30).
  • extractors (list of str, optional): Extractors to use to capture information from the privacy policy (default is ["text"]).

Raises:

  • polipy.NetworkIOException: Raised if an error has occured while performing networking I/O.
  • polipy.ParserException: Raised if an error occured while extracting text from page source.

Policy

A class representing a privacy policy. Attributes:

  • url (dict): Contains the URL to the privacy policy and additional information about the URL, such as domain, scheme, content-type, etc.
  • source (dict): Contains the information scraped from the webpage where policy is hosted, such as the HTML (dynamic) source and the screenshot.
  • content (dict): Contains the content extracted from the privacy policy website, such as the text of the policy.

__init__

Constructor method. Populates the Policy.url attribute. Parameters:

  • url (str): The URL of the privacy policy.

scrape

Obtains the page source of the given privacy policy URL. Populates the Policy.source attribute. Parameters:

  • screenshot (bool, optional): Flag that indicates whether to capture and save the screenshot of the privacy policy page (default is False).
  • timeout (int, optional): The amount of time in seconds to wait for the HTTP request response (default is 30).

Returns:

  • polipy.Policy: Policy object with the populated attribute.

Raises:

  • polipy.NetworkIOException: Raised if an error has occured while performing networking I/O.

extract

Extracts information from the scraped privacy policy. Populates the Policy.content attribute. Parameters:

  • extractors (list of str, optional): Extractors to use to capture information from the privacy policy (default is ["text"]).

Returns:

  • polipy.Policy: Policy object with the populated attribute.

Raises:

  • polipy.ParserException: Raised if an error occured while extracting text from page source.

save

Saves the information contained in the Policy object. Parameters:

  • output_dir (str): Path to directory where the policy will be saved.

to_dict

Converts the Policy object to a dictionary. Returns:

  • dict: Dictionary containing policy attributes as key-value pairs.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Citations

Any project that uses this library in part or in whole is required to acknowledge the usage of this library. For publications, the following citation can be used:

Samarin, N., Kothari, S., Siyed, Z., Wijesekera, P., Fischer, J., Hoofnagle, C. and Egelman, S., Investigating the Compliance of Android App Developers with the CCPA.

License

GNU GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polipy-0.1.2.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

polipy-0.1.2-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file polipy-0.1.2.tar.gz.

File metadata

  • Download URL: polipy-0.1.2.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for polipy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b6f36d37bb8d2ae81d565cb2f4886dae83841026212370b33b6bbf598c89ad5b
MD5 1325d31245c17a585e8485265558e7e2
BLAKE2b-256 75f424bb24d64d4239bcd19ae60c79b268d6c3f80ad09cc709b07e5e81f98883

See more details on using hashes here.

File details

Details for the file polipy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: polipy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for polipy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 daf8044888f05d7d9f695cdb3b54c1a31fad2d60bd5f5eecaac31ca19c434833
MD5 85a28675b6d2d810d2f9a662de56779c
BLAKE2b-256 cbea964bf674543f2849ebc859e85ccca6a71e67348683232ec8b2c682f30a49

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page