Skip to main content

Random dataset generation tool

Project description

RandomDataset

Documentation Status Testing codecov

Generates random datasets for testing and fun.

This repository contains a simple library for generating random tabular datasets of virtually any size. It also serves as an example repository for a Python code base with basic CI/CD integration and tools.

Install this library from a git clone:

$ pip install .

Data is generated from a YAML schema describing the names of tables/datasets and the fields they have. The YAML file consists of a sequence of dictionaries used to instantiate objects from the library or from other libraries present in the Python path. This allows custom code to be injected into the generation process.

An example schema is used to generate a list of customer records in customerschema.yaml:

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: customers
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14

This will create a single dataset "customers" stored in a CSV file customers.csv. This file is geneated by invoking the included command:

$ generate_dataset customerschema.yaml .

This generates the customers.csv file:

id,FirstName,LastName
0,"QDFFgv4XBd5VW","O1Odro"
1,"Gp4mYq","82IPIChjBALg"
2,"LR7KVudB","HcAPBwM"
3,"6FfWGEYS0Q","5NbspSBJk"
4,"si1Tj0xSBB2","eChYKAaW5aa8R"
5,"DYP6OMerUUFOR","pYNXUTNLqdrv"
6,"ltfnhTgrJF","2Rctye"
7,"1tAoaDl57Lo5","xMkVKt6O"
8,"1yJImoqiwf","IJICD8W6B8k"
9,"XkYgS7","8owHyjR"

Repository Setup

A relatively simple set of features which link into the code are set up on this repo to ensure good coding practice:

  • Automatic documentation generation is done using ReadTheDocs, see README.md
  • CI/CD implemented as flake8 and unit test execution using Github Actions, see python-app.yml
  • Code coverage is displayed using Codecov

Both ReadTheDocs and Codecov are integrated with the repo as webhooks. These can be setup through their respective sites which require Github credentials to link with repos.

This repo mostly follows GitFlow with a master branch which is always the current release of the code, and a dev branch that is the development version of the code. Branch protection rules are in place for master which ensure that code can only be committed to the branch through reviewed PRs:

  • Require pull request reviews before merging
  • Require status checks to pass before merging ("build" action selected)
  • Require branches to be up to date before merging
  • Require linear history
  • Include administrators

PyPI Release

Whenever a new release is made this is uploaded automatically to PyPI using the default Github workflow "Publish Python Package". To upload to PyPI these steps explain the process. For this repo the basic steps are:

  1. Create account on pypi.org
  2. Create a wheel file with python setup.py bdist_wheel, this creates dist/RandomDataset-0.1.0-py3-none-any.whl
  3. Upload this package manually to PyPI with python -m twine upload dist/* (assuming you have twine already installed)
  4. Get the API token for the new package and set it to the secret PYPI_API_TOKEN in the repository's settings
  5. Add the workflow file .github/workflows/python-publish.yml from here.
  6. Commit changes and create a release for the project, this should upload to PyPI automatically

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

randomdataset-0.2.1.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

randomdataset-0.2.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file randomdataset-0.2.1.tar.gz.

File metadata

  • Download URL: randomdataset-0.2.1.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for randomdataset-0.2.1.tar.gz
Algorithm Hash digest
SHA256 59b7d980b1be8fa8e9614f339227fc02803915b5e26ebb6eccab80f5357d5d0f
MD5 e1ebdd283a13624d478d43e9945d9a46
BLAKE2b-256 ad265dec9d91c22d1fa1142f623e553546433c04e051e335e73569554ab2cbc7

See more details on using hashes here.

File details

Details for the file randomdataset-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for randomdataset-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 237d5a987bc2efe3acc51368ee2e60fd83e68c7750cc6d772c7ab866ab76f2b7
MD5 228f722cdefe757521ce94833ab89138
BLAKE2b-256 c3b95bc19ecb65e67f381545d911780b095c71d3f9a2527ae3c21f1c6d978bdd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page