Skip to main content

No project description provided

Project description

idscrub 🧽✨

Project Info

[!WARNING] You must follow GDPR guidance when processing personal data using this package.

Specifically, you must:

  • Update privacy notices: Clearly state this processing activity in new or existing privacy notices before using the package.
  • Ensure secure deletion: Remove any temporary or intermediary files and outputs in a secure manner.
  • Ensure data subject rights upheld: Ensure individuals can access, correct, or erase their data as required.
  • Maintain processing records: Document how personal data is handled and for what purpose.

Description

  • Names and other personally identifying information are often present in text.
  • This information may need to be removed prior to further analysis in many cases.
  • idscrub provides a standardised way to do this in the Department for Business and Trade.

Expected Outputs

  • A list of text with names and other identifying information removed.

[!WARNING]

  • This package has been designed as a first pass for standardised personal data removal.
  • Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
  • It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.

Data

  • This package is designed for text-based documents structured as a list of strings.
  • It performs best when contextual meaning can be inferred from the text.
  • For best results, input text should therefore resemble natural language.
  • Highly fragmented, informal, technical, or syntactically broken text may reduce detection accuracy and lead to incomplete or incorrect name detection.

Biases and evaluation

  • idscrub supports integration with SpaCy and Hugging Face models for name cleaning.
  • These models are state-of-the-art, capable of identifying approximately 90% of named entities, but may not remove all names.
  • Biases present in these models due to their training data may affect performance. For example:
    • English names may be more reliably identified than names common in other languages.
    • Uncommon or non-Western naming conventions may be missed or misclassified.

[!IMPORTANT]

  • See our wiki for further details and notes on our evaluation of idscrub.

Models and Memory

  • Only Spacy's en_core_web_trf and no Hugging Face models have been formally evaluated.
  • We therefore recommend that the current default en_core_web_trf is used for name scrubbing. Other models need to be evaluated by the user.

[!IMPORTANT] Spacy and Hugging Face models have high memory requirements. To avoid memory-related errors. Clear the auto-generated huggingface folder if not in use. Do not push the huggingface folder (or user-defined equivalent) to GitHub.

Similar Python packages

  • Similar packages exist for undertaking this task, such as presidio, scrubadub and sanityze.
  • Development of idscrub was undertaken to: bring together different scrubbing methods across the department, adhere to infrastructure requirements, guarantee future stability and maintainability, and encourage future scrubbing methods to be added collaboratively and transparently.
  • To leverage the power of other packages, we have added methods that allow you to interact with them. These include: IDScrub.presidio() and IDScrub.google_phone_numbers(). See the usage example notebook and method docstrings for further information.

Installation

idscrub can be installed using pip into a Python >=3.12 environment. Example (with spaCy model installed):

pip install 'git+ssh://git@github.com/uktrade/idscrub.git#egg=idscrub[trf]'

or without spaCy installed (it will be installed automatically if name cleaning methods are called):

pip install 'git+ssh://git@github.com/uktrade/idscrub.git'

How to use the code

Basic usage example (see notebooks/basic_usage.ipynb for further examples):

from idscrub import IDScrub

scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA, Lapland.'])
scrubbed_texts = scrub.all()

print(scrubbed_texts)

# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE], [LOCATION].']

AI Declaration

AI has been used in the development of idscrub, primarily to develop regular expressions, suggest code refinements and draft documentation.

Development setup

This project is managed by uv.

To install all dependencies for this project, run:

uv sync --all-extras

If you do not have Python 3.12, run:

uv python install 3.12

To run tests:

uv run pytest

or

make test

Author

Analytical Data Science, Department for Business and Trade

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idscrub-0.1.1.tar.gz (134.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idscrub-0.1.1-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file idscrub-0.1.1.tar.gz.

File metadata

  • Download URL: idscrub-0.1.1.tar.gz
  • Upload date:
  • Size: 134.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for idscrub-0.1.1.tar.gz
Algorithm Hash digest
SHA256 162667cf561260a3dcb00368ac1895e0748f77041d0fcc3bee67b58afd0a890e
MD5 d81e1b61f3596bba7c5dd6c1dacbac95
BLAKE2b-256 28f85c697178057779ad395af8a9dfbebef6a6b1d5b3543c8dd25ff8b2698276

See more details on using hashes here.

File details

Details for the file idscrub-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: idscrub-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for idscrub-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed9f1c319f6daf88a5e76e7759d9566d8307075b913e127575143cdb7bd5020d
MD5 78043f2c633536beb5a1b111eba3d037
BLAKE2b-256 1fc205c41d946dfc5ffc1669bdda88fd2b35a7e251d6819fc83f1566a32ea110

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page