Python-native mocking of realistic datasets by defining schemas for prototyping, testing, and demos
Project description
datamock
Python-native mocking of realistic datasets by defining schemas for prototyping, testing, and demos
Installation
pip install datamock
Usage
Here's a moderately complex example demonstrating how to model an e-commerce system with customers, orders, and products. This showcases features like nested schemas, ListOf, and Derived fields.
import json
from datamock import Schema, ListOf, Derived, String, Float, Choice
from datamock.field import Name, Email
# Define a schema for a product
class Product(Schema):
name = String(min_length=5, max_length=20)
price = Float((10, 1000), round_to=2)
category = Choice(choices=['electronics', 'books', 'clothing', 'home goods'])
# Define a schema for an order, which contains a list of products
class Order(Schema):
order_id = String(regex=r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}')
products = ListOf(Product(), min_items=1, max_items=3)
# The total cost is derived from the sum of the prices of the products in the order
total_cost = Derived(
lambda context: sum(p['price'] for p in context['products'])
)
# Define a schema for a customer, who can have multiple orders
class Customer(Schema):
name = Name()
email = Email()
orders = ListOf(Order(), min_items=1, max_items=3)
# The total amount spent by the customer is derived from the sum of the total costs of their orders
total_spent = Derived(
lambda context: sum(o['total_cost'] for o in context['orders'])
)
# Generate a batch of 3 customers
customers = Customer.generate_batch(3)
print(json.dumps(customers, indent=2))
The above will generate random data satisfying the schema. An example output is:
[
{
"name": "Joseph May",
"email": "ymathis@example.com",
"orders": [
{
"order_id": "ca077fe5-a5a6-349d-d45d-83bd54c4ffb9",
"products": [
{
"name": "x4_E'`\r68F~",
"price": 82.26,
"category": "electronics"
},
{
"name": "B.J_$",
"price": 457.28,
"category": "clothing"
}
],
"total_cost": 539.54
}
],
"total_spent": 539.54
},
{
"name": "Timothy Sanchez",
"email": "connor13@example.org",
"orders": [
{
"order_id": "ac51d095-155c-9f57-188b-a2d91034e06a",
"products": [
{
"name": "HaYD;eK\\^i",
"price": 814.76,
"category": "clothing"
}
],
"total_cost": 814.76
},
{
"order_id": "7cfca1f6-43af-8e4f-c31b-754a88e0b5c8",
"products": [
{
"name": "D6eE<Y`AC2o",
"price": 106.45,
"category": "electronics"
},
{
"name": "FUwcTh)hX\u000bb5]DeK",
"price": 936.42,
"category": "clothing"
}
],
"total_cost": 1042.87
},
{
"order_id": "48104c52-3076-1e53-9070-91795c55afab",
"products": [
{
"name": "z8bG3g*I7R#eyW",
"price": 182.25,
"category": "books"
}
],
"total_cost": 182.25
}
],
"total_spent": 2039.8799999999999
},
{
"name": "Robert Lam",
"email": "middletonamanda@example.org",
"orders": [
{
"order_id": "af634dc5-3d67-e501-1b16-c4e2de49d66b",
"products": [
{
"name": ";\\9^%u0Vt#'?Un\\( ;U6",
"price": 499.4,
"category": "books"
},
{
"name": "@.,W(@nP-ZfOrq",
"price": 373.37,
"category": "clothing"
}
],
"total_cost": 872.77
},
{
"order_id": "568d3603-14b7-7e20-7c8f-ec94f7fa98e9",
"products": [
{
"name": ")GG,Tv]9m\"(\u000bn\r<5 ",
"price": 640.68,
"category": "home goods"
},
{
"name": "pqy-Ze\u000bf9`PHde9\u000b00,`",
"price": 184.34,
"category": "clothing"
}
],
"total_cost": 825.02
}
],
"total_spent": 1697.79
}
]
Field Types
Static
Field type that only requires a fixed value. The fixed value can be of any type.
from datamock import Static
static_value = Static(value='my-static-value')
generated_value = static_value.generate()
print(generated_value)
Choice
Choice field type to be used when generated values should come from a pre-defined set of choices.
from datamock import Choice
# No weights
my_choices = ['option1', 'option2', 'option3']
string_field = Choice(choices=my_choices)
generated_value = string_field.generate()
print(generated_value)
# Weighting
my_choices = [{'key1': 1}, {'key2': 'something'}, {'key3': 2.0, 'key4': 'this'}]
weights = [0.1, 0.8, 0.1]
string_field = Choice(choices=my_choices, weights=weights)
generated_value = string_field.generate()
print(generated_value)
String
String type with support for various types of values.
from datamock import String
social_security_no_regex = r'\d{3}-\d{2}-\d{4}'
string_field = String(regex=social_security_no_regex)
generated_value = string_field.generate()
print(generated_value)
Float
Float type to generate random floating point numbers. This can be controlled in various ways:
- Generate a float from $\mathcal{U}(\texttt{min}, \texttt{max})$
- Generate a float from $\mathcal{N}(\mu, \sigma)$
The optional may also be optionally rounded to a specified number of decimal places.
from datamock import Float
# Uniform distribution
float_field = Float((0.0, 1.0))
generated_value = float_field.generate()
print(generated_value)
# Normal distribution
float_field = Float(
distribution="normal",
distribution_params={"mean": 0.5, "std": 0.1},
round_to=4
)
generated_value = float_field.generate()
print(generated_value)
Int
Int type to generate uniform random integers within a range.
from datamock import Int
int_field = Int((0, 100))
generated_value = int_field.generate()
print(generated_value)
Maybe
Field type with behaviour like Optional (i.e. makes a field null with a specified probability).
from datamock import Maybe, Int
maybe_int_field = Maybe(Int(), probability=0.1)
generated_value = maybe_int_field.generate()
print(generated_value)
Boolean
Boolean field type, with optional weighting. If provided, the weights array
is of the format [<true_weight>, <false_weight>].
from datamock import Boolean
boolean_field = Boolean(weights=[0.8, 0.2])
generated_value = boolean_field.generate()
print(generated_value)
Date
Date field type to generate a random date within a specified date range.
from datamock import Date
date_field = Date(start='2000-01-01', end='2030-01-01', fmt='%Y-%m-%d')
generated_value = date_field.generate()
print(generated_value)
Custom
Custom field type that allows the user to specify how values are generated for the field. This is enabled by providing a callable that should accept no arguments. This can be used to generate custom values that do not conform to one of the existing field types. Possible use cases are making APIs calls, running inference on ML models, and calling external libraries.
import random
from datamock import Custom
def generate_random_embedding():
return [random.random() for _ in range(512)]
custom_field = Custom(generate_random_embedding)
generated_value = custom_field.generate()
print(generated_value)
Derived
Field type that enables the computation of values deriving from other values of other fields in the schema.
from datamock import Derived, Schema, Float, Int
# Example 1 - single source field
class Example1Schema(Schema):
base_price = Float((100, 200))
price_with_vat = Derived(lambda ctx: ctx['base_price'] * 1.2)
schema = Example1Schema()
print(schema.generate())
# Example 2 - multiple source fields
class Example2Schema(Schema):
quantity = Int((1, 100))
unit_price = Float((100, 200))
total = Derived(lambda ctx: ctx['quantity'] * ctx['unit_price'])
schema = Example2Schema()
print(schema.generate())
FakerProvider
Field type from which any supported provider from the Faker library can be leveraged.
from datamock import FakeProvider
string_field = FakeProvider(faker_provider='name')
generated_value = string_field.generate()
print(generated_value)
Common providers are provided as lightweight fields and can be used as follows:
from datamock.field import City, Name, UUID, URL # and more...
# Example usage:
city_field = City()
generated_value = city_field.generate()
print(generated_value)
ListOf
List field type to be used to create lists of fields (either field types or
(possibly nested) schemas). ListOf can also be nested as required.
from datamock import ListOf, Float
list_field = ListOf(Float())
generated_value = list_field.generate()
print(generated_value)
Schemas
Fields can be combined into schemas. Data can then be generated for schemas (on an instance or batch level). For example:
from datamock import Schema, Float, Int, Maybe, Boolean, ListOf
from datamock.field import Name
class Person(Schema):
name = Name()
salary = Maybe(Float((10_000, 100_000)))
is_member = Boolean()
device_ids = ListOf(Int())
person = Person()
print(person.generate())
print(person.generate_batch(10))
Schemas can also be nested:
from datamock import Schema, Float, Int, Maybe, Boolean, ListOf
from datamock.field import Name, Country
class Manufacturer(Schema):
name = Name()
country = Country()
class Product(Schema):
manufacturer = Manufacturer()
price = Float()
product = Product()
print(product.generate())
print(product.generate_batch(5))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamock-0.1.0.tar.gz.
File metadata
- Download URL: datamock-0.1.0.tar.gz
- Upload date:
- Size: 48.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7422cc4855bff9b94b11f5e789262f9c355f7473275aa5f4bfc59f15c4b0fe38
|
|
| MD5 |
484c1740deeadc7bb606f1d053399bda
|
|
| BLAKE2b-256 |
97d20dc9ace6231f760846e563ba6808955e42bd30e84833a8cdef3359894790
|
Provenance
The following attestation bundles were made for datamock-0.1.0.tar.gz:
Publisher:
ci.yaml on DavidTorpey/datamock
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamock-0.1.0.tar.gz -
Subject digest:
7422cc4855bff9b94b11f5e789262f9c355f7473275aa5f4bfc59f15c4b0fe38 - Sigstore transparency entry: 737500239
- Sigstore integration time:
-
Permalink:
DavidTorpey/datamock@10a0cde8c92335f99f613d271f33db973e1e12a6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DavidTorpey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@10a0cde8c92335f99f613d271f33db973e1e12a6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file datamock-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datamock-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ac0ba8f022e7ef16c61be6a429106386c8a421cd56f13ad52c8c23269bfdc51
|
|
| MD5 |
901342075a2f2cce6170b849b6163c5e
|
|
| BLAKE2b-256 |
9eb8bea56bb01b2122a2615f52f410b83e3c88b559be8c6af2d29fbba24bd758
|
Provenance
The following attestation bundles were made for datamock-0.1.0-py3-none-any.whl:
Publisher:
ci.yaml on DavidTorpey/datamock
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamock-0.1.0-py3-none-any.whl -
Subject digest:
2ac0ba8f022e7ef16c61be6a429106386c8a421cd56f13ad52c8c23269bfdc51 - Sigstore transparency entry: 737500241
- Sigstore integration time:
-
Permalink:
DavidTorpey/datamock@10a0cde8c92335f99f613d271f33db973e1e12a6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DavidTorpey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@10a0cde8c92335f99f613d271f33db973e1e12a6 -
Trigger Event:
push
-
Statement type: