Skip to main content

Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.

Project description

DataFog logo

Open-source DevSecOps for Generative AI Systems.

PyPi Version PyPI pyversions GitHub stars PyPi downloads Discord Code style: black codecov GitHub Issues

Overview

What is DataFog?

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

What problem are we solving?

Context

The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!

Problem

How do you keep:

  • Customer PII
  • Employee PII
  • Sensitive company information pertaining to org changes or restructurings
  • Pending M&A activity
  • Conversations with external counsel on material corporate matters (i.e. product recall, etc)
  • and more

from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.

That's where DataFog comes in. Our solution to this problem is through two major approaches:

PII Observability Take in your batch/streaming data and return a scan indicating character-level detection of entities Privacy Filter DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database

With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our Getting Started) and within a few lines of code you're up and running.

How it works

DataFog Overview

There's lots of PII tools out there; why DataFog?

If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA'). In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products are purpose-built for the problem that they are solving.

However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.

Installation

DataFog can be installed via pip:

pip install datafog

and in your python environment:

from datafog import PresidioEngine as presidio
datafog = datafog.DataFog()

Examples

Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our Getting Started notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!

Scanning a single string

  ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."

  scan_results1 = presidio.scan(ceo_email_chunk)
  print("PII Detected - base case:", scan_results1)
  # PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]


  scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])
  print("PII Detected with deny list:", scan_results2)
  # PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]

Scanning a list of PDFs

file_dir = ["/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf",
           "/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf"]
datafog = datafog.DataFog()
result = datafog.upload_files(uploaded_files=file_dir)
print(result)

The output here will be a dictionary where the keys are the file names and the values are the scan results for that file. for ex: {'agi-builder-meetup.pdf': "2/26/24, 2:16 PM\nAGI Builders Meetup SF · Luma\nContact the HostReport Event29\nEvent FullIf youʼd like"}

Contributing

DataFog is a community-driven open-source platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our Discord and join our growing community.

Dev Notes

  • Justfile commands:
    • just format to apply formatting.
    • just lint to check formatting and style.

Testing

To run the datafog unit tests, check out this repository and do


tox

License

This software is published under the MIT license.

Project details


Release history Release notifications | RSS feed

This version

2.4.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafog-2.4.0.tar.gz (15.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page