GenAI DLP and Prompt generator
Project description
GenAI DLP Prompt Generator
GenAI DLP Prompt Generator is a Python tool designed to scrape DLP data and use it to generate GenAI prompt. It has three main modules:
- It fetches DLP test sample data from specified URLs, saves the data in text format, and then converts these text files into PDFs.
- It uses an OpenAI Assistant to generate DLP mock data
- It uses OpenAI Chat Completions to generate prompts for each DLP category
The output is suitable for benchmarking DLP systems or Generative AI Language Learning Models (GenAI LLMs).
Features
- Web scraping from specified URLs.
- Data extraction and saving in text format.
- Conversion of text data to PDF format, ideal for benchmarking DLP systems or GenAI LLMs.
Installing
To install DLP Data Scraper, clone the repository and install the required packages:
git clone https://github.com/BenderScript/DLPDataScraper.git
cd DLPDataScraper/dlp_data_scraper
pip3 install -r requirements.txt
Usage
Make sure you have a OpenAI API key and set it as an environment variable:
export OPENAI_API_KEY=<your key here>
The file with DLP categories currently under tests/dlp_categories.md
. Need to be copies to a new location
and the path passed to the OpenAIDLPAssistant
or OpenAIChat
classes.
To use the DLP Data Scraper:
The scraper access a URL with dynamic content, waits for it to load and extracts all DLP categories
from dlp_data_scraper.umbrella import Umbrella
from file_utils.FileUtils import FileUtils
pdf_data = "umbrella/pdf_data"
text_data = "umbrella/text_data"
file_utils = FileUtils()
url = (
'https://support.umbrella.com/hc/en-us/articles/4402023980692-Data-Loss-Prevention-DLP-Test-Sample-Data-for'
'-Built-In-Data-Identifiers')
scraper = Umbrella(url=url, text_data=text_data, pdf_data=pdf_data)
html_content = scraper.initialize_browser()
scraped_data = scraper.scrape_data()
scraper.save_data_to_files()
file_utils.convert_txt_to_pdf(text_data, pdf_data)
print("Scraping and conversion to PDF completed.")
After the run is over, the generated data under the umbrella/text_data
and umbrella/pdf_data
directory. There will be one file for each DLP category.
To use the OpenAI Assistant DLP generator:
from dlp_data_gen.openai_dlp_assistant import OpenAIDLPAssistant
dlp_gen_assistant = OpenAIDLPAssistant(text_data="openai_dlp/text_data", pdf_data="openai_dlp/pdf_data",
dlp_categories_file="dlp/dlp_categories.md")
dlp_gen_assistant.run()
After the run is over, the generated data will be under the openai_dlp/text_data
and openai_dlp/pdf_data
directory. There will be a single file with mock DLP data for
each category.
To use the OpenAI DLP Prompt Generator
from prompt_gen.openai_chat import OpenAIChat
chat_gen = OpenAIChat(text_data="openai_chat_prompt/text_data",
pdf_data="openai_chat_prompt/pdf_data",
dlp_categories_file="dlp/dlp_categories.md")
chat_gen.run()
Contributing
Contributions to DLP Data Scraper are welcome. Please feel free to submit pull requests or open issues to improve the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genai_dlp_prompter-0.1.1.tar.gz
.
File metadata
- Download URL: genai_dlp_prompter-0.1.1.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2a9019c14b55f1170e61c087ba2774395131e07246a2d71bc8dcbbb24e6a837 |
|
MD5 | 6d24ab680093161015422eb45fa4dc0b |
|
BLAKE2b-256 | 86edc639be5b39286c28e22f0ad699a7d04350c0b2cd432d16b8b90efec309c3 |
File details
Details for the file genai_dlp_prompter-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: genai_dlp_prompter-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bce3549d68865da9973edb2a7d98c3b069e122922e181f78c4ce6f0e9c22c564 |
|
MD5 | 2361a3df619fc06fe04459f58f61bae5 |
|
BLAKE2b-256 | add358591efa13014245deeb2f197d84288e055ab665fcc4495aa26906c6f3d8 |