A Python package for leveraging Presidio for anonymizing sensitive PII data using Spark.

These details have not been verified by PyPI

Project links

Homepage

Project description

RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft's Presidio and Apache Spark.

Key Features

Integration with Presidio: Detects and anonymizes PII such as names, emails, phone numbers, and more.
Spark-powered processing: Handle large-scale data anonymization with PySpark.
Custom Recognizers: Extend PII detection with custom logic for your specific needs.

Models

en_core_web_lg

Installation

You can install RedactifyAI from PyPI or by building the wheel file locally.

Install from PyPI (if published)

pip install redactify-ai

Build Locally

Clone the repository:

git clone https://github.com/your-repo/redactify-ai.git
cd redactify-ai

Build the wheel:

rm -rf build dist *.egg-info
python setup.py sdist bdist_wheel

Install the wheel:

pip install dist/redactify_ai-0.0.1-py3-none-any.whl

Upload the Package to PyPi
1. Install Twine pip install twine
2. Generate token for PyPi account
3. Upload the Package twine upload dist/*

Usage

Step 1: Configuration

Prepare a config.yaml file for Presidio configuration (e.g., recognizers, anonymization rules). Example:

presidio:
   entities:
      - PERSON
      - PHONE_NUMBER
      - EMAIL_ADDRESS
      - LOCATION
      - DATE_TIME
      - CREDIT_CARD
   language: en
   score_threshold: 0.6
   mask_character: "*"
   spacy_model: en_core_web_lg # download the model of your choice, e.g. en_core_web_sm
   spacy_model_dir: /path/to/model/   # Custom model storage path

Step 2: Create a Processor

from redactify_ai.config import load_config
from redactify_ai.processor import PresidioDLPProcessor

# Load configuration
config = load_config("config.yaml")
processor = PresidioDLPProcessor(config)

Step 3: Anonymize DataFrame with PySpark

from redactify_ai.utils import anonymize_text_udf
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()

# Create mock DataFrame
data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
df = spark.createDataFrame(data, ["transcripts"])

# Apply anonymization
anonymize_udf = anonymize_text_udf(processor)
df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
df_redacted.show(truncate=False)

Running the Pipeline

To run the pipeline script provided in this repository:

python run_pipeline.py

End-to-End Integration Testing

If you want to verify that the RedactifyAI pipeline correctly redacts PII over a full environmentâ€”including Spark and real NLP modelsâ€”an end-to-end test is provided.

Prerequisites

Docker installed on your system
test_config.yaml, the end-to-end test script (e.g., test_pipeline_integration.py), and the project source code in your working directory

Note: The test requires the en_core_web_lg SpaCy model to be present. The provided Dockerfile ensures this model is pre-installed and ready for use.

Running the End-to-End Test

Build the Docker Image

In your project directory, run:
```
docker build -t redactify-test .
```

Run the Test Suite

Spawn a container from your image and execute the integration test:

docker run --rm -v "$PWD:/app" -w /app redactify-test bash -c "
 python setup.py sdist bdist_wheel &&
 pip install dist/redactify_ai-0.0.1-py3-none-any.whl &&
 pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short
 "

This will build/wheel/install/testing using the real Spark instance, SpaCy language model, and all necessary dependencies.

What This Test Does

Runs the pipeline using a sample DataFrame with mock PII.
Uses the actual PresidioDLPProcessor and redaction logic.
Asserts that PII is redacted and replaced with mask characters as defined in your configuration.

How to Interpret Results

The test will pass if the sensitive information is successfully redacted from the output.
The script will print the redacted text, which should show mask characters (e.g., *) in place of PII.

If you make changes to the model configuration or the pipeline logic, rerunning this end-to-end test will ensure your modifications continue to correctly anonymize sensitive data.

For any issues, please check that your Docker build completes successfully and that your working directory contains all the necessary files (test_config.yaml, test script, and source code).

Contributing

Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.1

Jul 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redactify_ai-0.0.1.tar.gz (8.7 kB view details)

Uploaded Jul 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

redactify_ai-0.0.1-py3-none-any.whl (7.9 kB view details)

Uploaded Jul 24, 2025 Python 3

File details

Details for the file redactify_ai-0.0.1.tar.gz.

File metadata

Download URL: redactify_ai-0.0.1.tar.gz
Upload date: Jul 24, 2025
Size: 8.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for redactify_ai-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6018e97c02ecbcfa7b63de6ac905518857ae649b0bfa498b1f96f95b1b930a4d`
MD5	`4e7c45f6a975c74442fe867c87e069ee`
BLAKE2b-256	`a989e772395cfc13847f2adb22350c36834c1ec9c1a914b4493499f23828e6dc`

See more details on using hashes here.

File details

Details for the file redactify_ai-0.0.1-py3-none-any.whl.

File metadata

Download URL: redactify_ai-0.0.1-py3-none-any.whl
Upload date: Jul 24, 2025
Size: 7.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for redactify_ai-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44fe6609b881df3e425c96de5e2b3eb9c147261a07de1372cad2398b1203f7b9`
MD5	`fe1cd762862ee66a9f7e0c4a3907f576`
BLAKE2b-256	`fba621d6e15cf2e336918519cc364baab4ddb6542bc8d4bd0ae4bbeca96977b1`

See more details on using hashes here.

redactify-ai 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RedactifyAI

Key Features

Models

Installation

Install from PyPI (if published)

Build Locally

Usage

Step 1: Configuration

Step 2: Create a Processor

Step 3: Anonymize DataFrame with PySpark

Running the Pipeline

End-to-End Integration Testing

Prerequisites

Running the End-to-End Test

What This Test Does

How to Interpret Results

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes