Skip to main content

A robust and simple library for generating synthetic datasets for ML/DL projects.

Project description

DataGenix

An advanced and robust library for generating synthetic datasets for machine learning and deep learning projects. Go from idea to prototype in seconds without data acquisition bottlenecks.

Installation

Install from PyPI (once published):

pip install datagenix

Or install directly from the repository:

git clone [https://github.com/yourusername/datagenix.git](https://github.com/yourusername/datagenix.git)
cd datagenix
pip install .

Ultimate Usage Example

Generate a complex, realistic dataset for a binary classification task with a single, intuitive command:

from datagenix import DataGenerator

generator = DataGenerator(seed=42)

df = generator.generate(
    num_rows=1000,
    numerical_whole=3,
    decimal=2,
    categorical=2,
    boolean=1,
    text=1,
    uuid=1,
    object_types=['name', 'email'],
    target_type='binary',
    missing_numerical=0.05,
    missing_categorical=0.1,
    correlation_strength=0.7,
    group_by='customer_id',
    num_groups=50,
    time_series=True,
    numerical_whole_range=(100, 999),
    add_outliers=True,
    outlier_fraction=0.02,
    text_style='review'
)

print(df.head())
print(df.info())

Advanced Features

  • Target Generation: Automatically create a target column for binary, multi-class, or regression tasks that is logically correlated with the features.
  • Missing Data: Inject missing values (NaN) into any feature type with precise fractional control (e.g., missing_numerical=0.1).
  • Feature Correlation: Create linear dependencies between numerical features with adjustable correlation_strength.
  • Grouped Data: Simulate real-world scenarios like customer data by grouping rows with a common ID using group_by and num_groups.
  • Time Series: Generate a chronologically sorted timestamp column for time-dependent modeling.
  • Outlier Injection: Introduce extreme values into numerical columns to test model robustness using add_outliers and outlier_fraction.
  • Custom Ranges: Define exact (min, max) ranges for numerical columns.
  • Text Styles: Generate varied text content like review, tweet, or standard sentence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datagenix-0.1.2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datagenix-0.1.2-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file datagenix-0.1.2.tar.gz.

File metadata

  • Download URL: datagenix-0.1.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for datagenix-0.1.2.tar.gz
Algorithm Hash digest
SHA256 24307aecc9b508975515d621737b3271ab142019936f537b66d7d71ce58aad5c
MD5 4b25eccad20be32cd417394126b5adc2
BLAKE2b-256 c56f12129e05f9ca8ad1bb235ab08c29943cb07d38bf5607617c6b52744ef34c

See more details on using hashes here.

File details

Details for the file datagenix-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: datagenix-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for datagenix-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f5c4fd62eb89c6408cb4e794e789847d95599e665a2ffca8cb6dab56dcd5924d
MD5 105f436434d7d6c836a8ea7157b92e2d
BLAKE2b-256 93d80127f027aa0c9a54813cb07c614140d0e5b62f5e35e9df93c65aaf9cdc18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page