A colorful and interactive CLI tool to generate customizable synthetic datasets.
Project description
Schema-driven synthetic data generation for developers & data scientists
โก Python API & CLI โข ๐ DataFrame Ready โข ๐ Strict Validation โข ๐ CSV / JSON / Excel
๐ DataGen CLI
Built an official Python library with 12K+ downloads โ designed to solve a real problem faced by data science students and developers:
getting clean, structured, and usable datasets instantly.
โจ About
Most tools generate random, unrealistic data.
DataGen CLI is different.
๐ Define your own schema
๐ Generate meaningful, constraint-aware datasets
๐ Use directly in Python or via CLI
๐ฅ Key Features
- ๐ง Schema-driven generation
- ๐ Returns pandas DataFrame
- ๐ป Python + CLI support
- ๐ Export to CSV, JSON, Excel
- ๐ Strict validation system
- โก Lightweight & fast
- ๐งช Fully tested (24/24 tests passed)
โก Quick Demo
๐ Python
from datagen import generate_dataset
schema = {
"rows": 5,
"columns": [
{"name": "name"},
{"name": "age", "generator": "int", "params": {"min": 18, "max": 60}},
{"name": "salary", "generator": "int", "params": {"min": 30000, "max": 100000}}
]
}
df = generate_dataset(schema)
print(df)
๐ป CLI
datagen generate schema.json --rows 100 --output data.csv
๐งพ Example Output
name age salary
0 Rahul Sharma 25 54000
1 Priya Verma 32 72000
2 Aman Gupta 28 61000
๐งฉ Architecture
datagen/
โ
โโโ core/
โ โโโ engine.py
โ โโโ defaults.py
โ โโโ schema_normalizer.py
โ โโโ schema_validator.py
โ
โโโ cli/
โ โโโ app.py
โ โโโ main.py
โ
โโโ __init__.py
โ
tests/
โโโ test_cli.py
โโโ test_core.py
โ๏ธ Phase 1 โ Core Engine (Completed โ )
โ๏ธ Implemented
- Schema-driven engine (
DataGenerationEngine) - Python API (
generate()โ DataFrame) - CLI (Typer-based interface)
- Schema normalization
- Strict validation system
- File export support
- Testing suite
๐ Validation Rules
- โ๏ธ
min <= maxfor numeric fields - โ๏ธ Precision must be positive
- โ๏ธ Date ranges must be valid
- โ๏ธ Choice fields must not be empty
- โ๏ธ Unsupported types are rejected
๐ Installation
pip install datagen
๐ป CLI Commands
| Command | Description |
|---|---|
generate |
Generate dataset |
preview |
Preview sample data |
validate |
Validate schema |
๐ค Export Formats
- CSV
- JSON
- Excel (
.xlsx)
โก Performance
- โก Lightweight package (KBs only)
- ๐ Tested with 1 crore rows ร 17 columns
- ๐ง Efficient memory usage
๐ฎ Roadmap
Phase 2 (Next)
- ๐ Region-based datasets (India ๐ฎ๐ณ, US ๐บ๐ธ)
- ๐ Relationship-aware data (city โ state, salary โ role)
Phase 3
- ๐ค Smart generation (learning from real datasets)
๐งช Testing
24 passed, 0 failed โ
Includes:
- Core engine tests
- CLI tests
- Validation tests
๐ค Contributing
git clone https://github.com/your-username/datagen-cli
cd datagen-cli
pip install -r requirements.txt
๐จโ๐ป Author
Rishabh Kumar
- Built to solve a real-world problem faced during learning data science
- Already helping thousands generate datasets instantly
- Now evolving into a full-scale synthetic data engine
โญ Support
If this project helped you:
๐ Star the repo
๐ Share it
๐ Use it in your projects
๐ก Vision
Data shouldn't be the bottleneck.
It should be generated instantly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datagen_cli-0.2.0-py3-none-any.whl.
File metadata
- Download URL: datagen_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fb47bc7c8ccedc83419030ec799c0a32a89c611e25f8951b2e3e1439701b2d0
|
|
| MD5 |
b8c70ad1786cb4971668a1ef620fafd6
|
|
| BLAKE2b-256 |
4c8ffbe5f364c9ad80e65349d93534f23e37356c265ef6525ef69057bb64da9f
|