Skip to main content

Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.

Project description

DataFog logo

Open-source DevSecOps for Generative AI Systems.

PyPi Version PyPI pyversions GitHub stars PyPi downloads Discord Code style: black codecov GitHub Issues

Overview

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Installation

DataFog can be installed via pip:

pip install datafog

Getting Started

To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:

from datafog import DataFog

# For text annotation
client = DataFog(operations="annotate_pii")

# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract_text")

Text PII Annotation

Here's an example of how to annotate PII in a text document:

import requests

# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]

# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)

OCR PII Annotation

For OCR capabilities, you can use the following:

import asyncio
import nest_asyncio

nest_asyncio.apply()


async def run_ocr_pipeline_demo():
    image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
    results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
    print("OCR Pipeline Results:", results)


loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async/await syntax when calling the appropriate methods.

Examples

For more detailed examples, check out our Jupyter notebooks in the examples/ directory:

  • text_annotation_example.ipynb: Demonstrates text PII annotation
  • image_processing.ipynb: Shows OCR capabilities and text extraction from images

These notebooks provide step-by-step guides on how to use DataFog for various tasks.

Dev Notes

For local development:

  1. Clone the repository.
  2. Navigate to the project directory:
    cd datafog-python
    
  3. Create a new virtual environment (using .venv is recommended as it is hardcoded in the justfile):
    python -m venv .venv
    
  4. Activate the virtual environment:
    • On Windows:
      .venv\Scripts\activate
      
    • On macOS/Linux:
      source .venv/bin/activate
      
  5. Install the package in editable mode:
    pip install -r requirements-dev.txt
    
  6. Set up the project:
    just setup
    

Now, you can develop and run the project locally.

Important Actions:

  • Format the code:
    just format
    
    This runs isort to sort imports.
  • Lint the code:
    just lint
    
    This runs flake8 to check for linting errors.
  • Generate coverage report:
    just coverage-html
    
    This runs pytest and generates a coverage report in the htmlcov/ directory.

We use pre-commit to run checks locally before committing changes. Once installed, you can run:

pre-commit run --all-files

Dependencies

For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/ in the following files:

  • dev-cicd.yml
  • feature-cicd.yml
  • main-cicd.yml

Testing

  • Python 3.10

License

This software is published under the MIT license.

Project details


Release history Release notifications | RSS feed

This version

3.4.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafog-3.4.0.tar.gz (16.3 kB view hashes)

Uploaded Source

Built Distribution

datafog-3.4.0-py3-none-any.whl (18.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page