No project description provided
Project description
fibber
(This is still under development)
Teaching machine learning things is hard. The idea behind this library is to generate data in such a way that certain principles can be highlighted without resorting to "finding" the perfect dataset to do so.
Currently the library can be installed using pip
:
pip install fibberio
Once the library is installed in your python environment, you can start generating data by:
fibber -t .\tests\data\programmers.json -o .\sandbox\programmers.csv -c 10000
where -t
is the Task Description file and -o
is the output file. To specify the record count, the -c
flag is used. Successfully running the command should show the following:
Generating 10000 items using "programmers.json"
-----------------------------------------------
FirstName LastName age style desc accept
count 10000 10000 10000.000000 10000 10000.000000 10000
unique 966 1000 NaN 2 NaN 2
top Remy Anthony NaN tabs NaN False
freq 29 21 NaN 6642 NaN 5378
mean NaN NaN 35.985700 NaN 21.736883 NaN
std NaN NaN 4.983832 NaN 10.526532 NaN
min NaN NaN 18.000000 NaN 5.010000 NaN
25% NaN NaN 33.000000 NaN 12.580000 NaN
50% NaN NaN 36.000000 NaN 20.070000 NaN
75% NaN NaN 39.000000 NaN 34.660000 NaN
max NaN NaN 57.000000 NaN 36.800000 NaN
Saving csv to C:\projects\fibberio\sandbox\programmers.csv
Task complete
The programmers.json file is a good starting point for understanding task descriptions.
Task Description
The best way to understand how it works is to look at a task description:
{
"sources": [
{
"id": "names",
"pandas": {
"path": "./full_names.csv",
"read_csv": {
"encoding": "unicode_escape",
"engine": "python"
}
}
}
],
"features": [
{
"id": "first_name",
"source": {
"id": "names",
"target": "FirstName"
}
},
{
"id": "age",
"normal": {
"mean": 36,
"stddev": 5,
"precision": 0
}
},
{
"id": "style",
"discrete": {
"tabs": 2,
"spaces": 1
}
},
]
}
There are two specific sections:
- Sources - external reference data
- Features - columns to generate
Sources
The sources
section contains a dictionary containing references to external files with data that can be sampled later as features.
{
"id": "names",
"pandas": {
"path": "./full_names.csv",
"read_csv": {
"encoding": "unicode_escape",
"engine": "python"
}
}
}
The id
is the identifier used to reference this data source later in the features. read_csv
in this case is the call to the pandas read_csv
function call with the enclosed dictionary representing the **kwargs
passed to that function. In theory, any pandas call to load any file type can be used here (although as of the time of this writing, read_csv
is the only one that has been tried).
The path
to the data file (in the case above ./full_names.csv) is in relation to the task description file unless the full path is specified.
Features
The features
section contains the features the system should generate along with their corresponding distributions:
"features": [
{
"id": "first_name",
"source": {
"id": "names",
"target": "FirstName"
}
},
{
"id": "age",
"normal": {
"mean": 36,
"stddev": 5,
"precision": 0
}
},
{
"id": "style",
"discrete": {
"tabs": 2,
"spaces": 1
}
}
]
In this example there are exactly three features:
- first_name - this references the
names
source and samples from theFirstName
column - age - this samples from the
normal
distribution with three parameters passed in to theNormal
class as**kwargs
- style - this samples from a discrete distribution that will generate
tabs
andspaces
in a 2 to 1 ratio
The standard definition for a feature therefore consists of:
{
"id": "feature_id"
"distribution_class": {
[... distribution args ...]
}
}
Where the feature_id
represents the id of the feature and the column name (this can be overriden in certain samplers). The distribution_class
is the name of a Distribution
class which is instantiated with the corresponding args.
Essentially, if the Distribution class is instantiated by:
distribution_class(prop1=2, prop2=seismic)
then the corresponding kwargs
should look like
{
"prop1": 2,
"prop2": "seismic"
}
and get instantiated by
distribution_class(**kwargs)
I am optimizing for readibility as opposed to brevity. This requires the class to have an __init()__
with default named parameters.
The optional conditional
part of the feature is described next.
Conditionals
Feature conditionals allow for conditional sampling based on the parent distribution. Here's an example:
"features": [
{
"id": "age",
"normal": {
"mean": 36,
"stddev": 5,
"precision": 0
}
},
{
"id": "score",
"conditional": {
"marginal": "age",
"posterior": [
{
"value": "[14, 65)",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
},
{
"value": "[65, *)",
"normal": {
"mean": 35,
"stddev": 0.5
}
},
{
"value": "*",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
}
]
}
}
]
This describes score
feature conditioned on the age
feature (as the marginal). Since the parent distribution is continuous, the conditional subdivisions should be represented by ranges:
- $[a, b]$ the closed interval ${ x \in \mathbb{R}: a \le x \le b }$
- $[a, b)$ the interval ${ x \in \mathbb{R}: a \le x \lt b }$
- $(a, b]$ the interval ${ x \in \mathbb{R}: a \lt x \le b }$
- $(a, b)$ the open interval ${ x \in \mathbb{R}: a \lt x \lt b }$
with *
representing a catch within the range interval or as the "catch-all" - these are processed in order and an exception is raised if none of the criteria fit.
The task processes each top level feature and then passes the generated value to the conditional which evaluates each range and generates from the distribution which "catches" the generated top level value.
This also is true for discrete probability distributions:
"features": [
{
"id": "style",
"discrete": {
"tabs": 234,
"spaces": 2332,
"agile": 21,
"scrum": 128
},
},
{
"id": "score",
"conditional": {
"marginal": "score",
"posterior": [
{
"value": "tabs",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
},
{
"value": "*",
"normal": {
"mean": 12,
"stddev": 3
}
}
]
}
}
]
In this case, the conditional score
feature will sample from the uniform
distribution if "tabs" is generated for the style
feature, otherwise the catch-all *
will sample from the normal
distribution.
These dependencies can be chained:
"features": [
{
"id": "style",
"discrete": {
"tabs": 234,
"spaces": 2332,
"agile": 21,
"scrum": 128
},
},
{
"id": "score",
"conditional": {
"marginal": "style",
"posterior": [
{
"value": "tabs",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
},
{
"value": "*",
"normal": {
"mean": 12,
"stddev": 3
}
}
]
}
},
{
"id": "accepted",
"conditional": {
"marginal": "score",
"posterior": [
{
"value": "[14, 65)",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
},
{
"value": "[65, *)",
"normal": {
"mean": 35,
"stddev": 0.5
}
},
{
"value": "*",
"uniform": {
"low": 5,
"high": 25,
"itype": "float",
"precision": 2
}
}
]
}
}
]
Notice that in this case, the first conditional required discrete values while the second used ranges. An exception is raised if there is a mismatch.
The main idea is that every Feature has a distribution
and optional dependant conditional
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fibberio-1.2.0.tar.gz
.
File metadata
- Download URL: fibberio-1.2.0.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.9.12 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a213ddf9e9baf12158b33044d838fa25f4b1ea00e454452c7aaf552c16b73856 |
|
MD5 | 0eba4ee9d97bb6eebe46adca463b8933 |
|
BLAKE2b-256 | b91b7d267a5d5c73f70d4d1e48f6be32f89ed23e00d4e4ef14d651b49a80f01e |
File details
Details for the file fibberio-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: fibberio-1.2.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.9.12 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4929e7075ad356c33c20f30b314ae2b941dc5a4e4e8fb711710d8020d5869b28 |
|
MD5 | 4896663b3e8de48a532c4a7defc1b91d |
|
BLAKE2b-256 | b64a15c27bbc51cd5007983b8a7ae6f4be4e76b06675e6ce76f3a30f105a11b8 |