Skip to main content

A colorful and interactive CLI tool to generate customizable synthetic datasets.

Project description

Schema-driven synthetic data generation for developers & data scientists

โšก Python API & CLI โ€ข ๐Ÿ“Š DataFrame Ready โ€ข ๐Ÿ” Strict Validation โ€ข ๐Ÿ“ CSV / JSON / Excel


๐Ÿš€ DataGen CLI

Built an official Python library with 12K+ downloads โ€” designed to solve a real problem faced by data science students and developers:
getting clean, structured, and usable datasets instantly.


โœจ About

Most tools generate random, unrealistic data.

DataGen CLI is different.

๐Ÿ‘‰ Define your own schema
๐Ÿ‘‰ Generate meaningful, constraint-aware datasets
๐Ÿ‘‰ Use directly in Python or via CLI


๐Ÿ”ฅ Key Features

  • ๐Ÿง  Schema-driven generation
  • ๐Ÿ“Š Returns pandas DataFrame
  • ๐Ÿ’ป Python + CLI support
  • ๐Ÿ“ Export to CSV, JSON, Excel
  • ๐Ÿ” Strict validation system
  • โšก Lightweight & fast
  • ๐Ÿงช Fully tested (24/24 tests passed)

โšก Quick Demo

๐Ÿ Python

from datagen import generate_dataset

schema = {
    "rows": 5,
    "columns": [
        {"name": "name"},
        {"name": "age", "generator": "int", "params": {"min": 18, "max": 60}},
        {"name": "salary", "generator": "int", "params": {"min": 30000, "max": 100000}}
    ]
}

df = generate_dataset(schema)
print(df)

๐Ÿ’ป CLI

datagen generate schema.json --rows 100 --output data.csv

๐Ÿงพ Example Output

        name   age    salary
0   Rahul Sharma   25   54000
1   Priya Verma   32   72000
2   Aman Gupta    28   61000

๐Ÿงฉ Architecture

datagen/
โ”‚
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ engine.py
โ”‚   โ”œโ”€โ”€ defaults.py
โ”‚   โ”œโ”€โ”€ schema_normalizer.py
โ”‚   โ””โ”€โ”€ schema_validator.py
โ”‚
โ”œโ”€โ”€ cli/
โ”‚   โ”œโ”€โ”€ app.py
โ”‚   โ””โ”€โ”€ main.py
โ”‚
โ”œโ”€โ”€ __init__.py
โ”‚
tests/
โ”œโ”€โ”€ test_cli.py
โ””โ”€โ”€ test_core.py

โš™๏ธ Phase 1 โ€” Core Engine (Completed โœ…)

โœ”๏ธ Implemented

  • Schema-driven engine (DataGenerationEngine)
  • Python API (generate() โ†’ DataFrame)
  • CLI (Typer-based interface)
  • Schema normalization
  • Strict validation system
  • File export support
  • Testing suite

๐Ÿ”’ Validation Rules

  • โœ”๏ธ min <= max for numeric fields
  • โœ”๏ธ Precision must be positive
  • โœ”๏ธ Date ranges must be valid
  • โœ”๏ธ Choice fields must not be empty
  • โœ”๏ธ Unsupported types are rejected

๐Ÿš€ Installation

pip install datagen

๐Ÿ’ป CLI Commands

Command Description
generate Generate dataset
preview Preview sample data
validate Validate schema

๐Ÿ“ค Export Formats

  • CSV
  • JSON
  • Excel (.xlsx)

โšก Performance

  • โšก Lightweight package (KBs only)
  • ๐Ÿš€ Tested with 1 crore rows ร— 17 columns
  • ๐Ÿง  Efficient memory usage

๐Ÿ”ฎ Roadmap

Phase 2 (Next)

  • ๐ŸŒ Region-based datasets (India ๐Ÿ‡ฎ๐Ÿ‡ณ, US ๐Ÿ‡บ๐Ÿ‡ธ)
  • ๐Ÿ”— Relationship-aware data (city โ†’ state, salary โ†’ role)

Phase 3

  • ๐Ÿค– Smart generation (learning from real datasets)

๐Ÿงช Testing

24 passed, 0 failed โœ…

Includes:

  • Core engine tests
  • CLI tests
  • Validation tests

๐Ÿค Contributing

git clone https://github.com/your-username/datagen-cli
cd datagen-cli
pip install -r requirements.txt

๐Ÿ‘จโ€๐Ÿ’ป Author

Rishabh Kumar

  • Built to solve a real-world problem faced during learning data science
  • Already helping thousands generate datasets instantly
  • Now evolving into a full-scale synthetic data engine

โญ Support

If this project helped you:

๐Ÿ‘‰ Star the repo
๐Ÿ‘‰ Share it
๐Ÿ‘‰ Use it in your projects


๐Ÿ’ก Vision

Data shouldn't be the bottleneck.
It should be generated instantly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datagen_cli-0.2.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file datagen_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: datagen_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for datagen_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fb47bc7c8ccedc83419030ec799c0a32a89c611e25f8951b2e3e1439701b2d0
MD5 b8c70ad1786cb4971668a1ef620fafd6
BLAKE2b-256 4c8ffbe5f364c9ad80e65349d93534f23e37356c265ef6525ef69057bb64da9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page