No project description provided
Project description
idscrub 🧽✨
- Names and other personally identifying information are often present in text, even if they are not clearly visible or requested.
- This information may need to be removed prior to further analysis in many cases.
idscrubidentifies and removes (✨scrubs✨) personal data from text using regular expressions and named-entity recognition.
Installation
idscrub can be installed using pip into a Python >=3.12 environment. Example:
pip install idscrub
or with the spaCy transformer model (en_core_web_trf) already installed:
pip install idscrub[trf]
How to use the code
Basic usage example (see basic_usage.ipynb for further examples):
from idscrub import IDScrub
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])x
scrubbed_texts = scrub.scrub(scrub_methods=['spacy_persons', 'uk_phone_numbers', 'uk_postcodes'])
print(scrubbed_texts)
# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
Personal data types supported
Personal data can either be scrubbed as methods with arguments for extra customisation, e.g. IDScrub.google_phone_numbers(region="GB"), or as a string arguments with default configurations (see above). The method name and its string representation are the same.
| Argument | Scrubs |
|---|---|
all |
All supported personal data types (see IDScrub.all() for further customisation) |
spacy_persons |
Person names detected by spaCy's en_core_web_trf (or other user-selected spaCy models) |
huggingface_persons |
Person names detected by user-selected HuggingFace models |
email_addresses |
Email addresses |
titles |
Titles (e.g., Mr., Mrs., Dr.) |
handles |
Social media handles (e.g., @username) |
ip_addresses |
IP addresses |
uk_postcodes |
UK postal codes |
uk_phone_numbers |
UK phone numbers |
google_phone_numbers |
Phone numbers detected by Google’s phonenumbers |
presidio |
Entities supported by Microsoft Presidio (e.g., names, URLs, NHS numbers, IBAN codes) |
Considerations before use
- You must follow GDPR guidance when processing personal data using this package.
- This package has been designed as a first pass for standardised personal data removal.
- Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
- It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.
Input data
- This package is designed for text-based documents structured as a list of strings.
- It performs best when contextual meaning can be inferred from the text.
- For best results, input text should therefore resemble natural language.
- Highly fragmented, informal, technical, or syntactically broken text may reduce detection accuracy and lead to incomplete or incorrect name detection.
Biases and evaluation
idscrubsupports integration with SpaCy and Hugging Face models for name cleaning.- These models are state-of-the-art, capable of identifying approximately 90% of named entities, but may not remove all names.
- Biases present in these models due to their training data may affect performance. For example:
- English names may be more reliably identified than names common in other languages.
- Uncommon or non-Western naming conventions may be missed or misclassified.
[!IMPORTANT]
- See our wiki for further details and notes on our evaluation of
idscrub.
Models
- Only Spacy's
en_core_web_trfand no Hugging Face models have been formally evaluated. - We therefore recommend that the current default
en_core_web_trfis used for name scrubbing. Other models need to be evaluated by the user.
Similar Python packages
-
Similar packages exist for undertaking this task, such as Presidio, Scrubadub and Sanityze.
-
Development of
idscrubwas undertaken to:- Bring together different scrubbing methods across the Department for Business and Trade.
- Adhere to infrastructure requirements.
- Guarantee future stability and maintainability.
- Encourage future scrubbing methods to be added collaboratively and transparently.
- Allow for full flexibility depending on the use case and required outputs.
-
To leverage the power of other packages, we have added methods that allow you to interact with them. These include:
IDScrub.presidio()andIDScrub.google_phone_numbers(). See the usage example notebook and method docstrings for further information.
AI declaration
AI has been used in the development of idscrub, primarily to develop regular expressions, suggest code refinements and draft documentation.
Development setup
This project is managed by uv.
To install all dependencies for this project, run:
uv sync --all-extras
If you do not have Python 3.12, run:
uv python install 3.12
To run tests:
uv run pytest
or
make test
Author
Analytical Data Science, Department for Business and Trade
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file idscrub-1.0.1.tar.gz.
File metadata
- Download URL: idscrub-1.0.1.tar.gz
- Upload date:
- Size: 150.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ab84dd1a4475588802fc6aa191e0fde8d62accca3c529a9c1bb689a295b0b34
|
|
| MD5 |
c3cc9114538578b1f215c73fac0221cb
|
|
| BLAKE2b-256 |
dea0b8b7414fa6ceb73668774057067b96843b52b2671d4a1d1540703337438a
|
File details
Details for the file idscrub-1.0.1-py3-none-any.whl.
File metadata
- Download URL: idscrub-1.0.1-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8f7271decfe3fcf8901dfc59c5a3daa17a73f2b8ddd2b329819d31c5ad5aaf0
|
|
| MD5 |
52025ef2ea81fdeda96e69014d21bf52
|
|
| BLAKE2b-256 |
359c36ffc2ccecb1590456b63c90fe9a225e31c66514b7d47b8f6780e1f75d0a
|