Skip to main content

A Python package for leveraging Presidio for anonymizing sensitive PII data using Spark.

Project description

RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft's Presidio and Apache Spark.

Key Features

  • Integration with Presidio: Detects and anonymizes PII such as names, emails, phone numbers, and more.
  • Spark-powered processing: Handle large-scale data anonymization with PySpark.
  • Custom Recognizers: Extend PII detection with custom logic for your specific needs.

Models


Installation

You can install RedactifyAI from PyPI or by building the wheel file locally.

Install from PyPI (if published)

pip install redactify-ai

Build Locally

  1. Clone the repository:
    git clone https://github.com/your-repo/redactify-ai.git
    cd redactify-ai
    
  2. Build the wheel:
    rm -rf build dist *.egg-info
    python setup.py sdist bdist_wheel
    
  3. Install the wheel:
    pip install dist/redactify_ai-0.0.1-py3-none-any.whl
    
  4. Upload the Package to PyPi
    1. Install Twine pip install twine
    2. Generate token for PyPi account
    3. Upload the Package twine upload dist/*

Usage

Step 1: Configuration

Prepare a config.yaml file for Presidio configuration (e.g., recognizers, anonymization rules). Example:

presidio:
   entities:
      - PERSON
      - PHONE_NUMBER
      - EMAIL_ADDRESS
      - LOCATION
      - DATE_TIME
      - CREDIT_CARD
   language: en
   score_threshold: 0.6
   mask_character: "*"
   spacy_model: en_core_web_lg # download the model of your choice, e.g. en_core_web_sm
   spacy_model_dir: /path/to/model/   # Custom model storage path

Step 2: Create a Processor

from redactify_ai.config import load_config
from redactify_ai.processor import PresidioDLPProcessor

# Load configuration
config = load_config("config.yaml")
processor = PresidioDLPProcessor(config)

Step 3: Anonymize DataFrame with PySpark

from redactify_ai.utils import anonymize_text_udf
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()

# Create mock DataFrame
data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
df = spark.createDataFrame(data, ["transcripts"])

# Apply anonymization
anonymize_udf = anonymize_text_udf(processor)
df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
df_redacted.show(truncate=False)

Running the Pipeline

To run the pipeline script provided in this repository:

python run_pipeline.py

End-to-End Integration Testing

If you want to verify that the RedactifyAI pipeline correctly redacts PII over a full environment—including Spark and real NLP models—an end-to-end test is provided.

Prerequisites

  • Docker installed on your system
  • test_config.yaml, the end-to-end test script (e.g., test_pipeline_integration.py), and the project source code in your working directory

Note: The test requires the en_core_web_lg SpaCy model to be present. The provided Dockerfile ensures this model is pre-installed and ready for use.

Running the End-to-End Test

  1. Build the Docker Image

    In your project directory, run:

    docker build -t redactify-test .
    
  2. Run the Test Suite

    Spawn a container from your image and execute the integration test:

    docker run --rm -v "$PWD:/app" -w /app redactify-test bash -c "
     python setup.py sdist bdist_wheel &&
     pip install dist/redactify_ai-0.0.1-py3-none-any.whl &&
     pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short
     "
    

    This will build/wheel/install/testing using the real Spark instance, SpaCy language model, and all necessary dependencies.

What This Test Does

  • Runs the pipeline using a sample DataFrame with mock PII.
  • Uses the actual PresidioDLPProcessor and redaction logic.
  • Asserts that PII is redacted and replaced with mask characters as defined in your configuration.

How to Interpret Results

  • The test will pass if the sensitive information is successfully redacted from the output.
  • The script will print the redacted text, which should show mask characters (e.g., *) in place of PII.

If you make changes to the model configuration or the pipeline logic, rerunning this end-to-end test will ensure your modifications continue to correctly anonymize sensitive data.

For any issues, please check that your Docker build completes successfully and that your working directory contains all the necessary files (test_config.yaml, test script, and source code).

Contributing

Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redactify_ai-0.0.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redactify_ai-0.0.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file redactify_ai-0.0.1.tar.gz.

File metadata

  • Download URL: redactify_ai-0.0.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for redactify_ai-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6018e97c02ecbcfa7b63de6ac905518857ae649b0bfa498b1f96f95b1b930a4d
MD5 4e7c45f6a975c74442fe867c87e069ee
BLAKE2b-256 a989e772395cfc13847f2adb22350c36834c1ec9c1a914b4493499f23828e6dc

See more details on using hashes here.

File details

Details for the file redactify_ai-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: redactify_ai-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for redactify_ai-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44fe6609b881df3e425c96de5e2b3eb9c147261a07de1372cad2398b1203f7b9
MD5 fe1cd762862ee66a9f7e0c4a3907f576
BLAKE2b-256 fba621d6e15cf2e336918519cc364baab4ddb6542bc8d4bd0ae4bbeca96977b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page