Extension of mimesis library for data generation with statistical properties
Project description
mimesis_stats
This package exists to extend the capabilities of mimesis for use in statistical data pipelines.
mimesis
provides fast fake data generation, and comes with a wide range of data providers, formats, structures and localised options. In addition, it provides a schema structure which makes data generation for data frames very easy. Before using this package it is recommended to become familiar with the basics of mimesis
fake data generation, such as through this getting started page.
Due to the extensibility, custom data providers can be created for use within the framework. This mimesis_stats
package aims to use the framework for use in statistical pipelines, particularly for generating dummy data for surveys.
However, mimesis
data generation / providers have two primary limitations this package extension addresses:
- Uni-variable - each data provider method produces a single value. Often in practice there are dependencies and relationships between different variables / columns.
- Limited in statistical properties -
mimesis
draws samples using a uniform distribution. Real distributions are often weighted, or have specific properties (such as a gaussian)
mimesis_stats
uses a StatsSchema
object that allows multiple variables related to one another to be created using methods from MultiVariable
.
mimesis_stats
adds data providers for: discrete choice distributions, as well as the ability to pass in custom functions, such as those from numpy
or scipy
, or user defined functions.
To see an example use case of this package scroll to the bottom of this document in the "Working with pandas" section.
mimesis_stats providers
The package contains two supplementary providers, the main object of generating mimesis
data. One for producing discrete / continuous distributions and the other for dependent multi-variable samples.
Distribution
Ideal for generating categorical data with Distribution.discrete_distribution()
or a numerical variable using Distribution.generic_distribution()
with a user defined or numpy/scipy
function.
All mimesis_stats
providers have null_prop
and null_value
arguments to add in missing at random null values. For multi-variable producers this is done by passing in a list of propritionas and missing values corresponding to each variable made.
Categorical
General use for discrete distributions, the main addition from base mimesis
are the weighting and null options.
>>> from mimesis_stats.providers.distribution import Distribution
>>> Distribution.distrete_distribution(
... population=["First", "Second", "Third"],
... weights=[0.01, 0.01, 0.98]
... )
"Third"
>>> Distribution.distrete_distribution(
... population=["Apple", "Banana"],
... weights=[0.5, 0.5],
... null_prop=1.0,
... null_value=None
... )
None
MultiVariable
This provider allows multiple variables dependent or related to each other to be created through one provider call.
In practice, produced dictionary key-value pairs can be separated into different variables.
>>> from mimesis_stats.providers.multivariable import MultiVariable
>>> MultiVariable.dependent_variables(
... variable_names=["consent", "favourite_fruit"],
... options=[("Yes", "Lemon"), ("No", None)],
... weights=[0.7, 0.3]
... )
{"consent": "Yes", "favourit_fruit": "Lemon")
Within the possible combinations other provider calls can be made to extend the complexity of generation.
>>> from mimesis_stats.providers.multivariable import MultiVariable
>>> from mimesis import Food
>>> MultiVariable.dependent_variables(
... variable_names=["consent", "favourite_fruit"],
... options=[("Yes", Food.fruit()), ("No", None)],
... weights=[0.9, 0.1]
... )
{"consent": "Yes", "favourit_fruit": "Banana")
StatsSchema
For generating samples of many variables consistently it is recommended to use a schema. mimesis
has a Schema
object, however, in order to fully take advantage of the seeding and multi-variable nature of the mimesis_stats.providers
approaches StatsSchema
should be used instead to define a schema.
A StatsSchema
object requies a schema
to be passed to it.
A schema
/schema_blueprint
is a lambda
function that contains the code to generate each variable when called.
To define a schema_blueprint
a StatsField
(equivalent to Field
from mimesis
) needs to be declared. This sets a seed and a location basis for providers.
The schema_blueprint
is then passed to the StatsSchema
to define the generator.
Example mimesis_stats
schema:
>>> from mimesis_stats.stats_schema import StatsField, StatsSchema
>>> from numpy.random import pareto
>>> field = StatsField(seed=42)
>>> schema_blueprint = lambda: {
... "name": field("person.full_name"),
... "salary": field("generic_distribution", func=pareto, a=3)
... }
>>> schema = StatsSchema(schema=schema_blueprint)
>>> schema.create(iterations=1)
[{'name': 'Annika Reilly', 'salary': 0.16932036645405568}]
>>> schema.create(iterations=2)
[{'name': 'Hank Day', 'salary': 1.7274682836709054},
{'name': 'Crystle Osborn', 'salary': 0.5510238033601347}]
Working with pandas
Standard use of the package will be with a dataframe.
The code snippets below outline the suggested approach for generating a dataframe of random data, such as a survey responses.
Consider the following basic survey.
We collect the following information:
- An ID code identifying each respondant -
"ID"
- Their email address -
"email"
- Their occupation -
"occupation"
- Whether they are a parent or not -
"parent"
- How important they think schools are when buying a house (out of 10) -
"school_importance"
The # fmt: off/on
lines stop the black
formatter changing the schema blueprint.
import pandas as pd
from mimesis_stats.stats_schema import StatsField, StatsSchema
from scipy.stats import truncnorm
# Define parameters of truncated normal
lower = 0
upper = 10
mu_true = 7
mu_false = 4
sigma = 2.5
field = StatsField(seed=42)
# fmt: off
schema_blueprint = lambda: {
"ID": field("random.custom_code", mask='SCHL#####', digit="#"),
"email": field("person.email"),
"occupation": field("person.occupation"),
"parent_school_importance": field(
"dependent_variables",
variable_names=["parent", "school_importance"],
options=[
(True, round(truncnorm.rvs(a=(lower-mu_true)/sigma, b=(upper-mu_true)/sigma,
loc=mu_true, scale=sigma))),
(False, round(truncnorm.rvs(a=(lower-mu_false)/sigma, b=(upper-mu_false)/sigma,
loc=mu_false, scale=sigma)))
],
weights=[0.3, 0.7],
)
}
# fmt: on
schema = StatsSchema(schema_blueprint)
df = pd.DataFrame(schema.create(iterations=1000))
print(df.head())
Output:
ID email occupation parent school_importance
0 SCHL60227 pyoses1812@protonmail.com Milklady False 8
1 SCHL68040 dreep1871@yandex.com Choreographer True 7
2 SCHL25016 killing1844@protonmail.com Scientist False 7
3 SCHL52580 brach1847@gmail.com Leaflet Distributor False 0
4 SCHL86319 cyrenaic1813@yandex.com Yacht Master True 9
# Check the summary stats of the two distributions
# (remember mean of sample != mean of generation parameter due to truncation)
parent_breakdown = df.groupby("parent").agg(["min", "max", "median", "mean"])
print(parent_breakdown)
Output:
school_importance
min max median mean
parent
False 0 10 4 4.219477
True 0 10 7 6.432692
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.