A Python package for leveraging Presidio for anonymizing sensitive PII data using Spark.
Project description
RedactifyAI
RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft's Presidio and Apache Spark.
Key Features
- Integration with Presidio: Detects and anonymizes PII such as names, emails, phone numbers, and more.
- Spark-powered processing: Handle large-scale data anonymization with PySpark.
- Custom Recognizers: Extend PII detection with custom logic for your specific needs.
Models
Installation
You can install RedactifyAI from PyPI or by building the wheel file locally.
Install from PyPI (if published)
pip install redactify-ai
Build Locally
- Clone the repository:
git clone https://github.com/your-repo/redactify-ai.git cd redactify-ai
- Build the wheel:
rm -rf build dist *.egg-info python setup.py sdist bdist_wheel
- Install the wheel:
pip install dist/redactify_ai-0.0.1-py3-none-any.whl
- Upload the Package to PyPi
- Install Twine
pip install twine - Generate token for PyPi account
- Upload the Package
twine upload dist/*
- Install Twine
Usage
Step 1: Configuration
Prepare a config.yaml file for Presidio configuration (e.g., recognizers, anonymization rules).
Example:
presidio:
entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- LOCATION
- DATE_TIME
- CREDIT_CARD
language: en
score_threshold: 0.6
mask_character: "*"
spacy_model: en_core_web_lg # download the model of your choice, e.g. en_core_web_sm
spacy_model_dir: /path/to/model/ # Custom model storage path
Step 2: Create a Processor
from redactify_ai.config import load_config
from redactify_ai.processor import PresidioDLPProcessor
# Load configuration
config = load_config("config.yaml")
processor = PresidioDLPProcessor(config)
Step 3: Anonymize DataFrame with PySpark
from redactify_ai.utils import anonymize_text_udf
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()
# Create mock DataFrame
data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
df = spark.createDataFrame(data, ["transcripts"])
# Apply anonymization
anonymize_udf = anonymize_text_udf(processor)
df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
df_redacted.show(truncate=False)
Running the Pipeline
To run the pipeline script provided in this repository:
python run_pipeline.py
End-to-End Integration Testing
If you want to verify that the RedactifyAI pipeline correctly redacts PII over a full environment—including Spark and real NLP models—an end-to-end test is provided.
Prerequisites
- Docker installed on your system
test_config.yaml, the end-to-end test script (e.g.,test_pipeline_integration.py), and the project source code in your working directory
Note: The test requires the
en_core_web_lgSpaCy model to be present. The provided Dockerfile ensures this model is pre-installed and ready for use.
Running the End-to-End Test
-
Build the Docker Image
In your project directory, run:
docker build -t redactify-test .
-
Run the Test Suite
Spawn a container from your image and execute the integration test:
docker run --rm -v "$PWD:/app" -w /app redactify-test bash -c " python setup.py sdist bdist_wheel && pip install dist/redactify_ai-0.0.1-py3-none-any.whl && pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short "
This will build/wheel/install/testing using the real Spark instance, SpaCy language model, and all necessary dependencies.
What This Test Does
- Runs the pipeline using a sample DataFrame with mock PII.
- Uses the actual
PresidioDLPProcessorand redaction logic. - Asserts that PII is redacted and replaced with mask characters as defined in your configuration.
How to Interpret Results
- The test will pass if the sensitive information is successfully redacted from the output.
- The script will print the redacted text, which should show mask characters (e.g.,
*) in place of PII.
If you make changes to the model configuration or the pipeline logic, rerunning this end-to-end test will ensure your modifications continue to correctly anonymize sensitive data.
For any issues, please check that your Docker build completes successfully and that your working directory contains all
the necessary files (test_config.yaml, test script, and source code).
Contributing
Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redactify_ai-0.0.1.tar.gz.
File metadata
- Download URL: redactify_ai-0.0.1.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6018e97c02ecbcfa7b63de6ac905518857ae649b0bfa498b1f96f95b1b930a4d
|
|
| MD5 |
4e7c45f6a975c74442fe867c87e069ee
|
|
| BLAKE2b-256 |
a989e772395cfc13847f2adb22350c36834c1ec9c1a914b4493499f23828e6dc
|
File details
Details for the file redactify_ai-0.0.1-py3-none-any.whl.
File metadata
- Download URL: redactify_ai-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44fe6609b881df3e425c96de5e2b3eb9c147261a07de1372cad2398b1203f7b9
|
|
| MD5 |
fe1cd762862ee66a9f7e0c4a3907f576
|
|
| BLAKE2b-256 |
fba621d6e15cf2e336918519cc364baab4ddb6542bc8d4bd0ae4bbeca96977b1
|