Skip to main content

A collection of funcionality to perform data classification, data privacy risk assessment, and enforce mitigation

Project description

🔒 READI - Risk Evaluation and De-Identification

License Python Lint Testing Publish to PyPI PyPI uv Ruff

Privacy-preserving AI made simple - A comprehensive toolkit for data privacy risk assessment and de-identification in Python-based ML pipelines.

READI augments the functionalities provided by IBM Data Privacy Toolkit, offering state-of-the-art capabilities for detecting Personal and Sensitive Information in unstructured documents. Built for modern compliance frameworks and AI model training workflows.


✨ Features

  • 🎯 Advanced PII Detection - Identify personal and sensitive information across multiple data types
  • 🔄 Seamless Integration - Low-effort integration with existing ML pipelines
  • 📊 Structured & Unstructured Data - Support for both data formats
  • 🌐 REST API - Easy-to-use HTTP interface for remote processing
  • 🧪 Extensible Framework - Modular design for custom privacy requirements
  • 📝 Comprehensive Examples - Jupyter notebooks with real-world use cases

🚀 Quick Start

Prerequisites

  • Python 3.8 or higher
  • Git with git-lfs support (for large files >50 MB)
  • uv (recommended) - A fast Python package installer

Installation

Recommended: Using uv (10-100x faster)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install READI
uv pip install git+https://github.com/IBM/READI.git

Standard Installation with pip:

pip install git+https://github.com/IBM/READI.git

Clone Repository:

git clone https://github.com/IBM/READI.git
cd READI

# With uv (recommended)
uv pip install -e .

# Or with pip
pip install -e .

💻 Development Setup

For contributors and developers:

Recommended: Using uv

# Install in editable mode with development dependencies
uv pip install -e .
uv pip install -r requirements-dev.txt

# Set up pre-commit hooks (recommended)
pre-commit install

Alternative: Using pip

# Install in editable mode with development dependencies
pip install -e .
pip install -r requirements-dev.txt

# Set up pre-commit hooks (recommended)
pre-commit install

This installs the project in editable mode along with development tools (pytest, ruff, bandit, etc.).

💡 Tip: Using uv provides significantly faster dependency resolution and installation compared to traditional pip.


🌐 REST API Usage

READI provides a simple REST API for remote processing.

Setup

# Install with REST API support
pip install -e '.[rest]'

# Start the server
uvicorn risk_assessment.entry_points.rest.api:app

Example Request

curl -H 'Content-Type: application/json' \
     http://localhost:8000/detect_phi \
     --data-raw '{"text":"My text with email: john@gmail.com"}'

The API will be available at http://localhost:8000 with interactive documentation at /docs.


📚 Examples & Tutorials

Explore our comprehensive Jupyter notebooks in the notebooks/ directory:

Notebook Description
Unstructured Data Classification General overview of READI API for free-text processing
Structured Data Classification Working with tabular and structured datasets

📖 Documentation

For detailed documentation, API references, and advanced usage patterns, please visit our documentation portal (coming soon).


🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details on:

  • Code style and standards
  • Testing requirements
  • Pull request process
  • Development workflow

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


📌 How to Cite

If you use READI in academic work, please cite the most relevant publication from the references below. A general citation entry is:

@software{readi_ibm,
  title        = {READI: Risk Evaluation and De-Identification},
  author       = {Stefano Braghin and Liubov Nedoshivina and Anisa Halimi and Naoise Holohan and Kieran Fraser},
  year         = {2026},
  url          = {https://github.com/IBM/READI}
}

When your usage specifically relates to unstructured document de-identification, prefer citing:

@article{nedoshivina2024pragmatic,
  title   = {Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering},
  author  = {Liubov Nedoshivina and Anisa Halimi and Joa Bettencourt-Silva and Stefano Braghin},
  journal = {AMIA Summits on Translational Science Proceedings},
  volume  = {2024},
  pages   = {85},
  year    = {2024}
}

📚 Academic References

READI is built on years of privacy research. Key publications:

  1. Nedoshivina, L., Halimi, A., Bettencourt-Silva, J., & Braghin, S. (2024). Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering. AMIA Summits on Translational Science Proceedings, 2024, 85.

  2. Pachilakis, M., Antonatos, S., Levacher, K., & Braghin, S. (2020). PrivLeAD: Privacy Leakage Detection on the Web. Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. DOI: 10.1007/978-3-030-55180-3_32

  3. Braghin, S., Bettencourt-Silva, J. H., Levacher, K., & Antonatos, S. (2019). An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures. MEDINFO 2019: Health and Wellbeing e-Networks for All (pp. 1140-1144). IOS Press. DOI: 10.3233/SHTI190404

  4. Antonatos, S., Braghin, S., Holohan, N., Gkoufas, Y., & Mac Aonghusa, P. (2018). PRIMA: An End-to-End Framework for Privacy at Scale. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1531-1542. DOI: 10.1109/ICDE.2018.00171

  5. Gkoulalas-Divanis, A., & Braghin, S. (2016). IPV: A system for identifying privacy vulnerabilities in datasets. IBM Journal of Research and Development, vol. 60, no. 4, pp. 14:1-14:10. DOI: 10.1147/JRD.2016.2576818

  6. Gkoulalas-Divanis, A., Braghin, S., & Antonatos, S. (2016). FPVI: A scalable method for discovering privacy vulnerabilities in microdata. 2016 IEEE International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2016.7580849

  7. Gkoulalas-Divanis, A., & Braghin, S. (2015). Efficient algorithms for identifying privacy vulnerabilities. 2015 IEEE First International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2015.7366170


🙏 Acknowledgment

This project is partly supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No. 101172997 – SEARCH.


💬 Support & Community


Built with ❤️ by IBM Research

DocumentationExamplesContributingLicense

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readi_privacy-0.1.1.tar.gz (15.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readi_privacy-0.1.1-py3-none-any.whl (13.1 MB view details)

Uploaded Python 3

File details

Details for the file readi_privacy-0.1.1.tar.gz.

File metadata

  • Download URL: readi_privacy-0.1.1.tar.gz
  • Upload date:
  • Size: 15.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for readi_privacy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 499f95f7dc405f648fab1aaf521a55aad0106d29b6273b0dfb6f51f175cbac87
MD5 ede7a902dd69f9075bee040795f4a38e
BLAKE2b-256 3dfa3e5e6223dbaaeede81fae9483845c4602ef1b91366a0749aaac0e68a9384

See more details on using hashes here.

Provenance

The following attestation bundles were made for readi_privacy-0.1.1.tar.gz:

Publisher: publish.yml on IBM/READI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file readi_privacy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: readi_privacy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for readi_privacy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a29b53a714bdcb7ee58e208278ca942698e6e3cdd3466a45946eacbb3cf88a71
MD5 fc0311c71820f86ad13d826e42eb7b7d
BLAKE2b-256 1d6c0c873fda59d2f845a0284aa63b134002bf90e1d4ae963f633777d448aaae

See more details on using hashes here.

Provenance

The following attestation bundles were made for readi_privacy-0.1.1-py3-none-any.whl:

Publisher: publish.yml on IBM/READI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page