No project description provided

These details have not been verified by PyPI

Project description

Development

idscrub 🧽✨

Names and other personally identifying information are often present in text, even if they are not clearly visible or requested.
This information may need to be removed prior to further analysis in many cases.
idscrub identifies and removes (✨scrubs✨) personal data from text using regular expressions and named-entity recognition.

[!IMPORTANT]

This package is undergoing frequent internal development. Major updates will be made public periodically.

Installation

idscrub can be installed using pip into a Python >=3.12 environment.

We recommend installing with the SpaCy transformer model (en_core_web_trf) as a dependency:

pip install idscrub[trf]

If you do not need SpaCy:

pip install idscrub

How to use the code

Basic usage example (see basic_usage.ipynb for further examples):

from idscrub import IDScrub

scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)

print(scrubbed_texts)

# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']

This package will identify and scrub many types of data that you might not want to scrub, such as locations or context-relevent names. We therefore highly recommend manually removing scrubbed data identified by idscrub from your original dataset on a case-by-case basis.

Scrubbed data can be identified using the following methods (see the usage example notebook for further information):

import pandas as pd
from idscrub import IDScrub

# From lists of text:
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)
scrubbed_df = scrub.get_scrubbed_data()
print(scrubbed_df)

# From a Pandas DataFrame:
scrubbed_df, scrubbed_data = IDScrub.dataframe(
    df=pd.read_csv('path/to/csv'), 
    id_col="ID", 
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)
print(scrubbed_df)

Personal data types supported

Method	Scrubs
`all`	All supported personal data types (see `IDScrub.all()` for further customisation)
`spacy_entities`	Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations)
`presidio_entities`	Entities supported by Microsoft Presidio (e.g. persons (names), URLs, NHS numbers, IBAN codes)
`huggingface_entities`	Entities detected by user-selected HuggingFace models
`email_addresses`	Email addresses (e.g. john@email.com)
`titles`	Titles (e.g. Mr., Mrs., Dr.)
`handles`	Social media handles (e.g. @username)
`urls`	URLs (e.g. www.bbc.co.uk)
`ip_addresses`	IP addresses (e.g. 8.8.8.8)
`uk_postcodes`	UK postal codes (e.g. SW1A 2AA)
`uk_addresses`	UK addresses (e.g. 10 Downing Street)
`uk_phone_numbers`	UK phone numbers (e.g. +441111111111)
`google_phone_numbers`	Phone numbers detected by Google's phonenumbers

Method arguments for further customisation can be viewed by viewing the docstring e.g. ?IDScrub.spacy_entities.

Considerations before use

You must follow GDPR guidance when processing personal data using this package.
This package has been designed as a first pass for standardised personal data removal.
Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.

Input data

This package is designed for text-based documents structured as a list of strings.
It performs best when contextual meaning can be inferred from the text.
For best results, input text should therefore resemble natural language.
Highly fragmented, informal, technical, or syntactically broken text may reduce detection accuracy and lead to incomplete or incorrect name detection.

Biases and evaluation

idscrub supports integration with SpaCy and Hugging Face models for name cleaning.
These models are state-of-the-art, capable of identifying approximately 90% of named entities, but may not remove all names.
Biases present in these models due to their training data may affect performance. For example:
- English names may be more reliably identified than names common in other languages.
- Uncommon or non-Western naming conventions may be missed or misclassified.

[!IMPORTANT]

See our wiki for further details and notes on our evaluation of idscrub.

Models

Only Spacy's en_core_web_trf and no Hugging Face models have been formally evaluated.
We therefore recommend that the current default en_core_web_trf is used for name scrubbing. Other models need to be evaluated by the user.

Similar Python packages

Similar packages exist for undertaking this task, such as Presidio, Scrubadub and Sanityze.
Development of idscrub was undertaken to:
- Bring together different scrubbing methods across the Department for Business and Trade.
- Adhere to infrastructure requirements.
- Guarantee future stability and maintainability.
- Encourage future scrubbing methods to be added collaboratively and transparently.
- Allow for full flexibility depending on the use case and required outputs.
To leverage the power of other packages, we have added methods that allow you to interact with them. These include: IDScrub.presidio() and IDScrub.google_phone_numbers(). See the usage example notebook and method docstrings for further information.

AI declaration

AI has been used in the development of idscrub, primarily to develop regular expressions, suggest code refinements and draft documentation.

Development setup

This project is managed by uv.

To install all dependencies for this project, run:

uv sync

If you do not have Python 3.12, run:

uv python install 3.12

To run tests:

uv run pytest

make test

Author

Analytical Data Science, Department for Business and Trade

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.15

Mar 30, 2026

2.0.14

Mar 19, 2026

2.0.13

Mar 17, 2026

2.0.12

Feb 19, 2026

2.0.11

Feb 13, 2026

This version

2.0.1

Feb 3, 2026

2.0.0

Feb 3, 2026

1.1.2

Jan 30, 2026

1.1.1

Jan 27, 2026

1.1.0

Jan 20, 2026

1.0.1

Dec 22, 2025

1.0.0

Dec 22, 2025

0.2.2

Dec 17, 2025

0.2.1

Dec 17, 2025

0.2.0

Dec 16, 2025

0.1.1

Dec 8, 2025

0.1.0

Dec 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idscrub-2.0.1.tar.gz (153.3 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

idscrub-2.0.1-py3-none-any.whl (29.8 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file idscrub-2.0.1.tar.gz.

File metadata

Download URL: idscrub-2.0.1.tar.gz
Upload date: Feb 3, 2026
Size: 153.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for idscrub-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`0dca981d116d1b049991b8d7cf946b3e9512029cb30957578cc6e74e1adbe088`
MD5	`43d46fa17b7041dba2108c1540795c5a`
BLAKE2b-256	`69f554078ca8de9fdf54546cb08c1ffba11ec5703ee1d3bb29b9eb8dedd820bf`

See more details on using hashes here.

File details

Details for the file idscrub-2.0.1-py3-none-any.whl.

File metadata

Download URL: idscrub-2.0.1-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for idscrub-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d9058b210571b911d52a0af988da96929f73c5684d8f4cc64bebb868337219e`
MD5	`5fc434b6744c58abcb9c033b36bec847`
BLAKE2b-256	`97cda3cfebc1763e7fba9b7047143697b152cdb206e54ba5e1d67f6bc36583fc`

See more details on using hashes here.

idscrub 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

idscrub 🧽✨

Installation

How to use the code

Personal data types supported

Considerations before use

Input data

Biases and evaluation

Models

Similar Python packages

AI declaration

Development setup

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes