Skip to main content

A collection of funcionality to perform data classification, data privacy risk assessment, and enforce mitigation

Project description

🔒 READI - Risk Evaluation and De-Identification

License Python Lint Testing Publish to PyPI PyPI uv Ruff

Privacy-preserving AI made simple - A comprehensive toolkit for data privacy risk assessment and de-identification in Python-based ML pipelines.

READI augments the functionalities provided by IBM Data Privacy Toolkit, offering state-of-the-art capabilities for detecting Personal and Sensitive Information in unstructured documents. Built for modern compliance frameworks and AI model training workflows.


✨ Features

  • 🎯 Advanced PII Detection - Identify personal and sensitive information across multiple data types
  • 🔄 Seamless Integration - Low-effort integration with existing ML pipelines
  • 📊 Structured & Unstructured Data - Support for both data formats
  • 🌐 REST API - Easy-to-use HTTP interface for remote processing
  • 🧪 Extensible Framework - Modular design for custom privacy requirements
  • 📝 Comprehensive Examples - Jupyter notebooks with real-world use cases

🚀 Quick Start

Prerequisites

  • Python 3.11 or higher
  • Git with git-lfs support (for large files >50 MB)
  • uv (recommended) - A fast Python package installer

Installation

Recommended: Using uv (10-100x faster)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install READI
uv pip install git+https://github.com/IBM/READI.git

Standard Installation with pip:

pip install git+https://github.com/IBM/READI.git

Clone Repository:

git clone https://github.com/IBM/READI.git
cd READI

# With uv (recommended)
uv pip install -e .

# Or with pip
pip install -e .

💻 Development Setup

For contributors and developers:

Recommended: Using uv

# Install in editable mode with development dependencies
uv pip install -e .
uv pip install -r requirements-dev.txt

# Set up pre-commit hooks (recommended)
pre-commit install

Alternative: Using pip

# Install in editable mode with development dependencies
pip install -e .
pip install -r requirements-dev.txt

# Set up pre-commit hooks (recommended)
pre-commit install

This installs the project in editable mode along with development tools (pytest, ruff, bandit, etc.).

💡 Tip: Using uv provides significantly faster dependency resolution and installation compared to traditional pip.


🌐 REST API Usage

READI provides a simple REST API for remote processing.

Setup

# Install with REST API support
pip install -e '.[rest]'

# Start the server
uvicorn risk_assessment.entry_points.rest.api:app

Example Request

curl -H 'Content-Type: application/json' \
     http://localhost:8000/detect_phi \
     --data-raw '{"text":"My text with email: john@gmail.com"}'

The API will be available at http://localhost:8000 with interactive documentation at /docs.


📚 Examples & Tutorials

Explore our comprehensive Jupyter notebooks in the notebooks/ directory:

Notebook Description
Unstructured Data Classification General overview of READI API for free-text processing
Structured Data Classification Working with tabular and structured datasets

📖 Documentation

For detailed documentation, API references, and advanced usage patterns, please visit our documentation portal (coming soon).


🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details on:

  • Code style and standards
  • Testing requirements
  • Pull request process
  • Development workflow

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


📌 How to Cite

If you use READI in academic work, please cite the most relevant publication from the references below. A general citation entry is:

@software{readi_ibm,
  title        = {READI: Risk Evaluation and De-Identification},
  author       = {Stefano Braghin and Liubov Nedoshivina and Anisa Halimi and Naoise Holohan and Kieran Fraser},
  year         = {2026},
  url          = {https://github.com/IBM/READI}
}

When your usage specifically relates to unstructured document de-identification, prefer citing:

@article{nedoshivina2024pragmatic,
  title   = {Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering},
  author  = {Liubov Nedoshivina and Anisa Halimi and Joa Bettencourt-Silva and Stefano Braghin},
  journal = {AMIA Summits on Translational Science Proceedings},
  volume  = {2024},
  pages   = {85},
  year    = {2024}
}

📚 Academic References

READI is built on years of privacy research. Key publications:

  1. Nedoshivina, L., Halimi, A., Bettencourt-Silva, J., & Braghin, S. (2024). Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering. AMIA Summits on Translational Science Proceedings, 2024, 85.

  2. Pachilakis, M., Antonatos, S., Levacher, K., & Braghin, S. (2020). PrivLeAD: Privacy Leakage Detection on the Web. Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. DOI: 10.1007/978-3-030-55180-3_32

  3. Braghin, S., Bettencourt-Silva, J. H., Levacher, K., & Antonatos, S. (2019). An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures. MEDINFO 2019: Health and Wellbeing e-Networks for All (pp. 1140-1144). IOS Press. DOI: 10.3233/SHTI190404

  4. Antonatos, S., Braghin, S., Holohan, N., Gkoufas, Y., & Mac Aonghusa, P. (2018). PRIMA: An End-to-End Framework for Privacy at Scale. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1531-1542. DOI: 10.1109/ICDE.2018.00171

  5. Gkoulalas-Divanis, A., & Braghin, S. (2016). IPV: A system for identifying privacy vulnerabilities in datasets. IBM Journal of Research and Development, vol. 60, no. 4, pp. 14:1-14:10. DOI: 10.1147/JRD.2016.2576818

  6. Gkoulalas-Divanis, A., Braghin, S., & Antonatos, S. (2016). FPVI: A scalable method for discovering privacy vulnerabilities in microdata. 2016 IEEE International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2016.7580849

  7. Gkoulalas-Divanis, A., & Braghin, S. (2015). Efficient algorithms for identifying privacy vulnerabilities. 2015 IEEE First International Smart Cities Conference (ISC2), pp. 1-8. DOI: 10.1109/ISC2.2015.7366170


🙏 Acknowledgment

This project is partly supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No. 101172997 – SEARCH.


💬 Support & Community


Built with ❤️ by IBM Research

DocumentationExamplesContributingLicense

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readi_privacy-0.1.3.tar.gz (15.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readi_privacy-0.1.3-py3-none-any.whl (13.1 MB view details)

Uploaded Python 3

File details

Details for the file readi_privacy-0.1.3.tar.gz.

File metadata

  • Download URL: readi_privacy-0.1.3.tar.gz
  • Upload date:
  • Size: 15.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for readi_privacy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ee9334cd35880889b65af40e813c2e29c6178f3eef1f3927521ca44613054239
MD5 1c378a5b05ebf2014f0305c85ebc685a
BLAKE2b-256 1e74f2d9f826a078dc5e82e09a0831cad5d4f283f51e923e3419eca1e9e5a851

See more details on using hashes here.

Provenance

The following attestation bundles were made for readi_privacy-0.1.3.tar.gz:

Publisher: publish.yml on IBM/READI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file readi_privacy-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: readi_privacy-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 13.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for readi_privacy-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 497be259fb079196d5762dbe1db08a9ddf74ecc78deb7cbea4e56965d77a6e99
MD5 b52101f8b0cc931b46b994bf9ab9371e
BLAKE2b-256 b37b55e43e2d307ef34a7d0bf0ad79d05ca81f73e9a5c38706605547bdc62b45

See more details on using hashes here.

Provenance

The following attestation bundles were made for readi_privacy-0.1.3-py3-none-any.whl:

Publisher: publish.yml on IBM/READI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page