Skip to main content

Generate mock documents in various formats (CSV, DOCX, PDF, TXT, and more) that embed seed data and can be used to test data classification software.

Project description

Mockingbird: Generate mock documents for data classification

About

Mockingbird is a Python library for generating mock documents in various formats. It accepts user-defined data, and embeds it into documents generated in many different formats. Developers can use Mockingbird to quickly generate datasets, with particular use for validating the efficacy of a data classification software.

Installation

The easiest way to install Mockingbird is by using pip:

pip install mockingbird

For local development, clone the repository and run pip install .

Getting Started

Mockingbird can run as a functional Python library or as a CLI.

CLI Usage

Once installed with pip, unix-like systems can use the command mockingbird_cli --h to access Mockingbird's command line interface. Some sample CLI calls are:

mockingbird_cli --type dry -o ./output/dry_test/
mockingbird_cli --type csv -i ./samples/csv_sample.csv -o ./output/csv/
mockingbird_cli --type csv_curl -i <curl'able URL> -o ./output/csv_curl/
mockingbird_cli --type mockaroo -i ./samples/sample_schema.json --mockaroo_api <mockaroo API> -o ./output/mockaroo

As a Python Library

Starting from Code

Mockingbird functions as a fully functional Python library. A basic example generating documents using mock-data is demonstrated below. In this example, key-value pairs are inserted as strings mapping to a list of strings.

from mockingbird import Mockingbird

# Spawn a new Mockingbird session
fab = Mockingbird()

# Set which file extensions to output
fab.set_file_extensions(["html", "docx", "yaml", "xlsx", "odt"])

# Input the data we want to test / inject into the documents
fab.add_sensitive_data(keyword="ssn", entries=["000-000-0000", "999-999-9999"])
fab.add_sensitive_data(keyword="dob", entries=["01/01/1991", "02/02/1992"])

# Generate and save the fabricated documents
fab.save(save_path="./output_basic/")
fab.dump_meta_data(output_file="./output_basic/meta_data.json")

Starting from CSV

Mockingbird can be started using a CSV file, treating the column headers as keywords, and the remaining rows as entries.

The CSV's are expected to be structured as the following,

FILE: mockingbird_data.csv

ssn, dob
000-000-000, 01/01/1991
999-999-999, 02/02/1992
from mockingbird.mb_wrappers import MockingbirdFromCSV


# This effectively loads files from the csv and generates a session using each column
fab = MockingbirdFromCSV("csv_sample.csv")
fab.set_all_extensions()

fab.save(save_path="./output_csv/")
fab.dump_meta_data(output_file="./output_csv/meta_data.json")

Optionally, multiple keywords can be defined in the CSV header file, which Mockingbird will split up into separate keywords. For example, rather than just testing the keyword ssn, we can test ssn and social security number. Multiple keywords can be defined in the CSV file by using ; as a delimiter.

For example,

FILE: mockingbird_data.csv

ssn;social security number,dob;date of birth;birth
000-000-000, 01/01/1991
999-999-999, 02/02/1992

This will generate documents for each keyword in each column header.

Starting Using Mockaroo

Using a Mockaroo API key, we can request mocked data using json requests from Mockaroo's servers. Currently, the request has to be saved to a json file on disk, and loaded during runtime. More documentation can be found at Mockaroo's Website, but below is a json-example.

FILE: mockaroo_request.json

[
  {
    "name": "ssn;social security;social",
    "type": "SSN"
  },
  {
    "name": "cc;credit card",
    "type": "Credit Card #"
  },
  {
    "name": "phone;phone-number;number",
    "type": "Phone"
  },
  {
    "name": "name;fullname;full name",
    "type": "Full Name"
  }
]

In code, Mockingbird can use this request as a json-payload,

import json
from mockingbird.mb_wrappers import MockingbirdFromMockaroo

with open("mockaroo_request.json") as json_file:
    schema_request = json.load(json_file)

fab = MockingbirdFromMockaroo(api_key="MOCKAROO_API_KEY", schema_request=schema_request)
fab.set_all_extensions()
fab.save(save_path="./output_mockaroo/")
fab.dump_meta_data(output_file="./output_mockaroo/meta_data.json")

License

Licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mockingbird-1.1.2.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

mockingbird-1.1.2-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file mockingbird-1.1.2.tar.gz.

File metadata

  • Download URL: mockingbird-1.1.2.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for mockingbird-1.1.2.tar.gz
Algorithm Hash digest
SHA256 2c6b72dc2a65a6ae760d6d9e19b9fab966e1cc74a1b896933ac6eedf6f13547a
MD5 3d65ca767620113e84f1e97eceade42c
BLAKE2b-256 93aa7bdf430f9b8d763869465c27209f072dd62ab62444cc09858268ab4bbf5f

See more details on using hashes here.

File details

Details for the file mockingbird-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: mockingbird-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for mockingbird-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 943fa81818482c4abed3c652e96a31e5fe748fe6fb714a50de015d82cd0efa34
MD5 c7437f39d9e15ea390eaf4be0e2e487b
BLAKE2b-256 3d6a5ec4e2ae04040d4876682951dbc18e04981483e2ad6fae03ec55a3cae1dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page