Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.
Project description
Open-source PII Detection & Anonymization.
Installation
DataFog can be installed via pip:
pip install datafog
CLI
📚 Quick Reference
Command | Description |
---|---|
scan-text |
Analyze text for PII |
scan-image |
Extract and analyze text from images |
redact-text |
Redact PII in text |
replace-text |
Replace PII with anonymized values |
hash-text |
Hash PII in text |
health |
Check service status |
show-config |
Display current settings |
download-model |
Get a specific spaCy model |
list-spacy-models |
Show available models |
list-entities |
View supported PII entities |
🔍 Detailed Usage
Scanning Text
To scan and annotate text for PII entities:
datafog scan-text "Your text here"
Example:
datafog scan-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
Scanning Images
To extract text from images and optionally perform PII annotation:
datafog scan-image "path/to/image.png" --operations extract
Example:
datafog scan-image "nokia-statement.png" --operations extract
To extract text and annotate PII:
datafog scan-image "nokia-statement.png" --operations scan
Redacting Text
To redact PII in text:
datafog redact-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
which should output:
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]
Replacing Text
To replace detected PII:
datafog replace-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
which should return something like:
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]
Note: a unique randomly generated identifier is created for each detected entity
Hashing Text
You can select from SHA256, SHA3-256, and MD5 hashing algorithms to hash detected PII. Currently the hashed output does not match the length of the original entity, for privacy-preserving purposes. The default is SHA256.
datafog hash-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
generating an output which looks like this:
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb
Utility Commands
🏥 Health Check
datafog health
⚙️ Show Configuration
datafog show-config
📥 Download Model
datafog download-model en_core_web_sm
📂 Show Model Directory
datafog show-spacy-model-directory en_core_web_sm
📋 List Models
datafog list-spacy-models
🏷️ List Entities
datafog list-entities
⚠️ Important Notes
- For
scan-image
andscan-text
commands, use--operations
to specify different operations. Default isscan
. - Process multiple images or text strings in a single command by providing multiple arguments.
- Ensure proper permissions and configuration of the DataFog service before running commands.
💡 Tip: For more detailed information on each command, use the --help
option, e.g., datafog scan-text --help
.
Python SDK
Getting Started
To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:
from datafog import DataFog
# For text annotation
client = DataFog(operations="scan")
# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract")
Text PII Annotation
Here's an example of how to annotate PII in a text document:
import requests
# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]
# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)
OCR PII Annotation
For OCR capabilities, you can use the following:
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def run_ocr_pipeline_demo():
image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
print("OCR Pipeline Results:", results)
loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())
Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async
/await
syntax when calling the appropriate methods.
Text Anonymization
DataFog provides various anonymization techniques to protect sensitive information. Here are examples of how to use them:
Redacting Text
To redact PII in text:
from datafog import DataFog
from datafog.config import OperationType
client = DataFog(operations=[OperationType.SCAN, OperationType.REDACT])
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
redacted_text = client.run_text_pipeline_sync([text])[0]
print(redacted_text)
Output:
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]
Replacing Text
To replace detected PII with unique identifiers:
from datafog import DataFog
from datafog.config import OperationType
client = DataFog(operations=[OperationType.SCAN, OperationType.REPLACE])
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
replaced_text = client.run_text_pipeline_sync([text])[0]
print(replaced_text)
Output:
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]
Hashing Text
To hash detected PII:
from datafog import DataFog
from datafog.config import OperationType
from datafog.models.anonymizer import HashType
client = DataFog(operations=[OperationType.SCAN, OperationType.HASH], hash_type=HashType.SHA256)
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
hashed_text = client.run_text_pipeline_sync([text])[0]
print(hashed_text)
Output:
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb
You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the hash_type
parameter
Examples
For more detailed examples, check out our Jupyter notebooks in the examples/
directory:
text_annotation_example.ipynb
: Demonstrates text PII annotationimage_processing.ipynb
: Shows OCR capabilities and text extraction from images
These notebooks provide step-by-step guides on how to use DataFog for various tasks.
Dev Notes
For local development:
- Clone the repository.
- Navigate to the project directory:
cd datafog-python
- Create a new virtual environment (using
.venv
is recommended as it is hardcoded in the justfile):python -m venv .venv
- Activate the virtual environment:
- On Windows:
.venv\Scripts\activate
- On macOS/Linux:
source .venv/bin/activate
- On Windows:
- Install the package in editable mode:
pip install -r requirements-dev.txt
- Set up the project:
just setup
Now, you can develop and run the project locally.
Important Actions:
- Format the code:
This runsjust format
isort
to sort imports. - Lint the code:
This runsjust lint
flake8
to check for linting errors. - Generate coverage report:
This runsjust coverage-html
pytest
and generates a coverage report in thehtmlcov/
directory.
We use pre-commit to run checks locally before committing changes. Once installed, you can run:
pre-commit run --all-files
Dependencies
For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/
in the following files:
dev-cicd.yml
feature-cicd.yml
main-cicd.yml
Testing
- Python 3.10
License
This software is published under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datafog-4.0.0.tar.gz
.
File metadata
- Download URL: datafog-4.0.0.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 086d8423b9ef4535dd22fdd6ddc7e181d31234ebe5091eb55e4813eef9029e06 |
|
MD5 | bc7db61f7414de416c45c2af9bda7f24 |
|
BLAKE2b-256 | f3c961520ddb69b4a07178d152a908dde55fb430326fc0a9fb15410690fa00e9 |
File details
Details for the file datafog-4.0.0-py3-none-any.whl
.
File metadata
- Download URL: datafog-4.0.0-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f7fc1e4bfaee389b38139b77d3eb788f7629dbe0b835441fa1f3bb9d8a16200 |
|
MD5 | 2505bdcda7ce6c75ac63252ff330eee6 |
|
BLAKE2b-256 | a69ba3e46972bd161cf26bfb155d6bafa9c2505a2a68f160fe562bc9e8e3bdd4 |