Skip to main content

Generation and mutation of realistic data at scale.

Project description

Gecko is a Python library for the bulk generation and mutation of realistic personal data. It is a spiritual successor to the GeCo framework which was initially published by Tran, Vatsalan and Christen. Gecko reimplements the most promising aspects of the original framework for modern Python with a simplified API, adds extra features and massively improves performance thanks to NumPy and Pandas.

Installation

Install with pip:

pip install gecko-syndata

Install with Poetry:

poetry add gecko-syndata

Basic usage

Please see the docs for an in-depth guide on how to use the library.

Writing a data generation script with Gecko is usually split into two consecutive steps. In the first step, data is generated based on information that you provide. Most commonly, Gecko pulls the information it needs from frequency tables, although other means of generating data are possible. Gecko will then output a dataset to your specifications.

In the second step, a copy of this dataset is mutated. Gecko provides functions which deliberately introduce errors into your dataset. These errors can take shape in typos, edit errors and other common data sources. By the end, you will have a generated dataset and a mutated copy thereof.

Common workflow with Gecko

Gecko exposes two modules, generator and mutator, to help you write data generation scripts. Both contain built-in functions covering the most common use cases for generating data from frequency information and mutating data based on common error sources, such as typos, OCR errors and much more.

The following example gives a very brief overview of what a data generation script with Gecko might look like. It uses frequency tables from the Gecko data repository which has been cloned into a directory next to the script itself.

from pathlib import Path

import numpy as np

from gecko import generator, mutator

# create a RNG with a set seed for reproducible results
rng = np.random.default_rng(727)
# path to the Gecko data repository
gecko_data_dir = Path(__file__).parent / "gecko-data"

# create a data frame with 10,000 rows and a single column called "last_name" 
# which sources its values from the frequency table with the same name
df_generated = generator.to_data_frame(
    [
        ("last_name", generator.from_frequency_table(
            gecko_data_dir / "de_DE" / "last-name.csv",
            value_column="last_name",
            freq_column="count",
            rng=rng,
        )),
    ],
    10_000,
)

# mutate this data frame by randomly deleting characters in 1% of all rows
df_mutated = mutator.mutate_data_frame(
    df_generated,
    [
        ("last_name", (.01, mutator.with_delete(rng))),
    ],
    rng,
)

# export both data frames using Pandas' to_csv function
df_generated.to_csv("german-generated.csv", index_label="id")
df_mutated.to_csv("german-mutated.csv", index_label="id")

For a more extensive usage guide, refer to the docs.

Rationale

The GeCo framework was originally conceived to facilitate the generation and mutation of personal data to validate record linkage algorithms. In the field of record linkage, acquiring real-world personal data to test new algorithms on is hard to come by. Hence, GeCo went for a synthetic approach using statistical models from publicly available data. GeCo was built for Python 2.7 and has not seen any active development since its last publication in 2013. The general idea of providing shareable and reproducible Python scripts to generate personal data however still holds a lot of promise. This has led to the development of the Gecko library.

A lot of GeCo's weaknesses were rectified with this library. Vectorized functions from Pandas and NumPy provide significant performance boosts and aid integration into existing data science applications. A simplified API allows for a much easier development of custom generators and mutators. NumPy's random number generation routines instead of Python's built-in random module make fine-tuned reproducible results a breeze. Gecko therefore seeks to be GeCo's "bigger brother" and aims to provide a much more refined experience to generate realistic personal data.

Disclaimer

Gecko is still very much in a "beta" state. As it stands, it satisfies our internal use cases within the Medical Data Science group, but we also seek wider adoption. If you find any issues or improvements with the library, do not hesitate to contact us.

License

Gecko is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gecko_syndata-0.5.2.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

gecko_syndata-0.5.2-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file gecko_syndata-0.5.2.tar.gz.

File metadata

  • Download URL: gecko_syndata-0.5.2.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for gecko_syndata-0.5.2.tar.gz
Algorithm Hash digest
SHA256 c615200f5c044900a652d64581c13f511ce16a9c953dc148d3371074eb7d3f19
MD5 c5f0965bc8dff05d7d0cf00a90d94c67
BLAKE2b-256 6898b7aa2b9e7cf68698ccb81f14c1948587cb8fdcb8aa034c5b1a0ae592f1af

See more details on using hashes here.

File details

Details for the file gecko_syndata-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: gecko_syndata-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for gecko_syndata-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2f17a0f6feb2c0ea94984f22acae67bf801519955f1cc93a120a569d0d249e6f
MD5 4686ebd50f988882da2c19f67c3652f3
BLAKE2b-256 7e254ef9f7c7022196a39a9f3c4e374e05d13f944a211b7b530eb2a43fe811cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page