Skip to main content

Base data structures for PII Processing

Project description

pii-data

This package provides base data structures for the management of PII i.e. Personally Identifiable Information (it does not contain code for processing documents, or extracting PII from documents).

For the full specification embodied by these base data structures, check the PIISA Data Specification.

Data structures

Two main data types are defined: PII Source Documents and PII Collections

PII Source Document

A PII Source Document defines the raw data from which PII is detected. This document is modelled as a sequence of chunks, each one having an identifier and a raw text content. There could be two variants:

  • a linear PII Source Document is just a sequence of chunks
  • in a hierarchical PII Source document chunks might contain other chunks, in a nested fashion

The Python data structure for a chunk is just a Python dictionary; a PII Source Document could be a nested dictionary, a list of dictionaries or a combination of both.

The official dump representation of a PII Source Document is in the form of a YAML file.

PII Collection

A PII Collection contains a list of detected/extractied PII Entities. Each entity contains all the information needed to correctly identify one PII instance and locate it in the document it belongs to.

These are the data classes defined:

  • PiiCollection: the full collection of PII
  • PiiEntity: one PII instance
  • PiiDetector: an objecto to describe the module used to generate a given ç PiiEntity object

PiiCollection objects have a dump() method that allows writing them in a standard format, with two representations:

  • a JSON representation, adequate for storage
  • a NDJSON representation (newline-delimited JSON), intended for processing and streaming

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii-data-0.0.1.tar.gz (9.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page