A collection of funcionality to perform data classification, data privacy risk assessment, and enforce mitigation
Project description
🔒 READI - Risk Evaluation and De-Identification
Privacy-preserving AI made simple - A comprehensive toolkit for data privacy risk assessment and de-identification in Python-based ML pipelines.
READI augments the functionalities provided by IBM Data Privacy Toolkit, offering state-of-the-art capabilities for detecting Personal and Sensitive Information in unstructured documents. Built for modern compliance frameworks and AI model training workflows.
✨ Features
- 🎯 Advanced PII Detection - Identify personal and sensitive information across multiple data types
- 🔄 Seamless Integration - Low-effort integration with existing ML pipelines
- 📊 Structured & Unstructured Data - Support for both data formats
- 🌐 REST API - Easy-to-use HTTP interface for remote processing
- 🧪 Extensible Framework - Modular design for custom privacy requirements
- 📝 Comprehensive Examples - Jupyter notebooks with real-world use cases
🚀 Quick Start
Prerequisites
- Python 3.8 or higher
- Git with git-lfs support (for large files >50 MB)
- uv (recommended) - A fast Python package installer
Installation
Recommended: Using uv (10-100x faster)
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install READI
uv pip install git+https://github.com/IBM/READI.git
Standard Installation with pip:
pip install git+https://github.com/IBM/READI.git
Clone Repository:
git clone https://github.com/IBM/READI.git
cd READI
# With uv (recommended)
uv pip install -e .
# Or with pip
pip install -e .
💻 Development Setup
For contributors and developers:
Recommended: Using uv
# Install in editable mode with development dependencies
uv pip install -e .
uv pip install -r requirements-dev.txt
# Set up pre-commit hooks (recommended)
pre-commit install
Alternative: Using pip
# Install in editable mode with development dependencies
pip install -e .
pip install -r requirements-dev.txt
# Set up pre-commit hooks (recommended)
pre-commit install
This installs the project in editable mode along with development tools (pytest, ruff, bandit, etc.).
💡 Tip: Using
uvprovides significantly faster dependency resolution and installation compared to traditionalpip.
🌐 REST API Usage
READI provides a simple REST API for remote processing.
Setup
# Install with REST API support
pip install -e '.[rest]'
# Start the server
uvicorn risk_assessment.entry_points.rest.api:app
Example Request
curl -H 'Content-Type: application/json' \
http://localhost:8000/detect_phi \
--data-raw '{"text":"My text with email: john@gmail.com"}'
The API will be available at http://localhost:8000 with interactive documentation at /docs.
📚 Examples & Tutorials
Explore our comprehensive Jupyter notebooks in the notebooks/ directory:
| Notebook | Description |
|---|---|
| Unstructured Data Classification | General overview of READI API for free-text processing |
| Structured Data Classification | Working with tabular and structured datasets |
📖 Documentation
For detailed documentation, API references, and advanced usage patterns, please visit our documentation portal (coming soon).
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Development workflow
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
📌 How to Cite
If you use READI in academic work, please cite the most relevant publication from the references below. A general citation entry is:
@software{readi_ibm,
title = {READI: Risk Evaluation and De-Identification},
author = {Stefano Braghin and Liubov Nedoshivina and Anisa Halimi and Naoise Holohan and Kieran Fraser},
year = {2026},
url = {https://github.com/IBM/READI}
}
When your usage specifically relates to unstructured document de-identification, prefer citing:
@article{nedoshivina2024pragmatic,
title = {Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering},
author = {Liubov Nedoshivina and Anisa Halimi and Joa Bettencourt-Silva and Stefano Braghin},
journal = {AMIA Summits on Translational Science Proceedings},
volume = {2024},
pages = {85},
year = {2024}
}
📚 Academic References
READI is built on years of privacy research. Key publications:
-
Nedoshivina, L., Halimi, A., Bettencourt-Silva, J., & Braghin, S. (2024). Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering. AMIA Summits on Translational Science Proceedings, 2024, 85.
-
Pachilakis, M., Antonatos, S., Levacher, K., & Braghin, S. (2020). PrivLeAD: Privacy Leakage Detection on the Web. Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. DOI: 10.1007/978-3-030-55180-3_32
-
Braghin, S., Bettencourt-Silva, J. H., Levacher, K., & Antonatos, S. (2019). An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures. MEDINFO 2019: Health and Wellbeing e-Networks for All (pp. 1140-1144). IOS Press. DOI: 10.3233/SHTI190404
-
Antonatos, S., Braghin, S., Holohan, N., Gkoufas, Y., & Mac Aonghusa, P. (2018). PRIMA: An End-to-End Framework for Privacy at Scale. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1531-1542. DOI: 10.1109/ICDE.2018.00171
-
Gkoulalas-Divanis, A., & Braghin, S. (2016). IPV: A system for identifying privacy vulnerabilities in datasets. IBM Journal of Research and Development, vol. 60, no. 4, pp. 14:1-14:10. DOI: 10.1147/JRD.2016.2576818
-
Gkoulalas-Divanis, A., Braghin, S., & Antonatos, S. (2016). FPVI: A scalable method for discovering privacy vulnerabilities in microdata. 2016 IEEE International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2016.7580849
-
Gkoulalas-Divanis, A., & Braghin, S. (2015). Efficient algorithms for identifying privacy vulnerabilities. 2015 IEEE First International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2015.7366170
🙏 Acknowledgment
This project is partly supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No. 101172997 – SEARCH.
💬 Support & Community
- 🐛 Issues: GitHub Issues
- 💡 Discussions: GitHub Discussions
- 📧 Contact: For enterprise support, please contact the IBM Research team
Built with ❤️ by IBM Research
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file readi_privacy-0.1.2.tar.gz.
File metadata
- Download URL: readi_privacy-0.1.2.tar.gz
- Upload date:
- Size: 15.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11393d86de97932ca3b005543e7d21c456cfda3d22fdbb487c3329dc8067dd91
|
|
| MD5 |
64c7a14957e76d0fa6974eaec5cf4ab9
|
|
| BLAKE2b-256 |
24496f971c90c8f559e8021ade73c2f0d95dcb33e8cb9c205406ecde878ed0df
|
Provenance
The following attestation bundles were made for readi_privacy-0.1.2.tar.gz:
Publisher:
publish.yml on IBM/READI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
readi_privacy-0.1.2.tar.gz -
Subject digest:
11393d86de97932ca3b005543e7d21c456cfda3d22fdbb487c3329dc8067dd91 - Sigstore transparency entry: 1548273497
- Sigstore integration time:
-
Permalink:
IBM/READI@45bede639c22d42f908df970f5ea689b837fcc18 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/IBM
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@45bede639c22d42f908df970f5ea689b837fcc18 -
Trigger Event:
push
-
Statement type:
File details
Details for the file readi_privacy-0.1.2-py3-none-any.whl.
File metadata
- Download URL: readi_privacy-0.1.2-py3-none-any.whl
- Upload date:
- Size: 13.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57dee5d1cdc82c6e5fd64a8bf1d403edce0bfa4da0e0e80f3deae283e813c085
|
|
| MD5 |
1f1c4b115059489bb0d9da6766881598
|
|
| BLAKE2b-256 |
af3cc7027040ef43275e7c6f36932eeacf58f188eeb1d6b581026c7f85f917c6
|
Provenance
The following attestation bundles were made for readi_privacy-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on IBM/READI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
readi_privacy-0.1.2-py3-none-any.whl -
Subject digest:
57dee5d1cdc82c6e5fd64a8bf1d403edce0bfa4da0e0e80f3deae283e813c085 - Sigstore transparency entry: 1548273682
- Sigstore integration time:
-
Permalink:
IBM/READI@45bede639c22d42f908df970f5ea689b837fcc18 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/IBM
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@45bede639c22d42f908df970f5ea689b837fcc18 -
Trigger Event:
push
-
Statement type: