Skip to main content

A synthetic data generation library with 700+ built-in data fields across 22 categories.

Project description

๐ŸŽฒ Iki Data Generator

Generate realistic, diverse synthetic data with 700+ built-in fields across 22 categories. Perfect for testing, development, and prototyping โ€” without the legal baggage of real data.


image

What Is This?

Iki Data Generator is a Python library that creates synthetic datasets on demand. Instead of wrestling with dummy data or copy-pasting fake records, you define a schema (which fields you want), call .many(n) to generate n records, and export them to CSV, JSON, SQL, Excel, Parquet, or 10+ other formats. That's it.

It's built for developers who need:

  • Test data for unit/integration tests
  • Demo data for presentations or prototypes
  • Mock databases for local development
  • Privacy-friendly datasets with realistic properties but zero personal info
  • Performance testing with large datasets

Why Use Iki Data Generator?

โœ… You Get

Benefit What It Means
700+ Fields First name, email, credit card, medical codes, stock prices, cryptocurrencies, ML metrics, etc.
22 Categories Personal, Finance, Commerce, Healthcare, Location, Education, Legal, AI/ML, and more
Easy Schema Simple string shortcuts or full control with dicts
Flexible Export CSV, JSON, SQL, Parquet, DuckDB, Excel, XML, TSV, Firebase, more
Zero Dependencies on Real Data No need to anonymize or worry about PII
Blazing Fast Generates thousands of records instantly
Extensible Add custom providers for domain-specific fields

โŒ You Don't Get

  • No real person's data
  • No need for data anonymization lawyers
  • No internet calls to fake APIs
  • No massive CSV files to download and commit

Installation

From PyPI (recommended)

pip install iki-data-generator

From Source

git clone https://github.com/ikidevz/IkiDataGenerator.git
cd Iki-Data-Generator
pip install -e .

Requirements

  • Python โ‰ฅ 3.10
  • Dependencies: duckdb, pandas, pyarrow, numpy, openpyxl, bcrypt, and a few others (installed automatically)

Quick Start (60 Seconds)

The Simplest Example

from ikidatagen import IkiDataGenerator

# Define what fields you want
schema = ["first_name", "last_name", "email_address", "gender_binary"]

# Generate 100 records
data = IkiDataGenerator(schema).many(100).export("users")

Result: You now have output/users.csv and output/users.json with 100 realistic user records.

A More Realistic Example

from ikidatagen import IkiDataGenerator

schema = [
    {
        "label": "User ID",
        "key_label": "row_number",
        "options": {"blank_percentage": 0}  # No blanks for ID
    },
    "first_name",
    "last_name",
    "email_address",
    {
        "label": "Account Created",
        "key_label": "current_timestamp",
        "options": {"blank_percentage": 5}  # 5% will be blank
    },
    {
        "label": "IP Address",
        "key_label": "ip_address_v4",
        "options": {"blank_percentage": 25}  # 25% will be blank
    },
    {
        "label": "Full Profile",
        "key_label": "template",
        "options": {
            "template": "{{first_name}} {{last_name}} ({{email_address}})"
        }
    },
]

# Generate 500 records and save to both CSV and JSON
IkiDataGenerator(schema).many(500).export("users", formats=["csv", "json"])

Result: output/users.csv and output/users.json with 500 complete user records, ready to use.


Schema Definition

The schema is the heart of Iki Data Generator. It tells the library what fields to generate.

Schema Entry Types

1. Simple String (Shorthand)

schema = ["first_name", "last_name", "email_address"]
# Generates fields with default settings, no options

2. Full Control (Dict)

schema = [
    {
        "key_label": "email_address",      # Required: which provider to use
        "label": "Email",                  # Optional: output column name (defaults to key_label)
        "group": "personal",               # Optional: provider category (auto-resolved if omitted)
        "options": {"blank_percentage": 10}  # Optional: provider-specific config
    }
]

Key Parameters

Parameter Required? Description
key_label โœ… Yes The provider name (e.g., first_name, credit_card_number)
label โŒ No How to name the output column (defaults to key_label)
group โŒ No Provider category (auto-resolved from registry; override if needed)
options โŒ No Provider-specific settings (e.g., blank_percentage, template)

Available Options (Common)

Option Type Example Effect
blank_percentage int/float 10 Percentage of records where field is empty
template str "{{first_name}} {{last_name}}" Combine fields with template syntax
pattern str "[A-Z]{3}-\d{4}" Regex pattern (for regular_expression provider)
min, max int/float min=0, max=100 Range for numeric fields

22 Data Categories

Iki Data Generator organizes 700+ fields into 22 categories. Here's a quick overview:

๐Ÿง‘ Personal (Name, Gender, Passport, etc.)

first_name, last_name, middle_name, gender_binary, gender_spectrum, title, passport_number, ssn, nationality, etc.

๐Ÿ’ฐ Finance (Credit Cards, Banking, Currency)

credit_card_number, credit_card_type, iban, bban, currency, currency_code, money, salary_range, stock_market, etc.

๐Ÿ›๏ธ Commerce (Products, Orders, Pricing)

product_name, product_category, product_price, barcode_ean13, order_status, payment_method, invoice_number, delivery_status, coupon_code, etc.

๐Ÿ“ง Communication (Email, Phone, Social)

email_address, phone_number, username, social_media_handle, chat_message, contact_name, etc.

๐Ÿ—๏ธ Construction (Building, Materials, Codes)

construction_code, building_type, material_type, foundation_type, roof_type, door_type, etc.

๐Ÿ’ป Tech/IT (Programming, Frameworks, Version)

programming_language, software_framework, version_number, log_level, http_status_code, file_extension, mime_type, etc.

๐Ÿฅ Healthcare (Diseases, Medications, Medical Codes)

disease_name, symptom_name, medication_name, blood_type, vaccination_status, ICD10_diagnosis, ICD9_diagnosis, HCPCS_code, etc.

๐ŸŒ Location (Countries, Cities, Addresses)

country, state, city, street_address, postal_code, latitude, longitude, timezone, airport_code, etc.

๐Ÿ“š Education (Schools, Courses, Subjects)

university_name, degree, major, course_name, subject, educational_attainment, etc.

โš–๏ธ Legal (Laws, Contracts, Jurisdictions)

legal_entity_type, contract_type, jurisdiction, court_type, legal_case_status, legal_term, etc.

๐ŸŽฌ Entertainment (Movies, Books, Games)

movie_title, movie_genre, book_title, author_name, video_game_title, music_genre, song_title, etc.

๐ŸŒฟ Nature (Plants, Animals, Weather)

plant_name, animal_name, tree_name, flower_name, weather_condition, season, biome, etc.

๐Ÿš— Automotive (Cars, VINs, Fuel)

car_make, car_model, car_vin, vehicle_type, license_plate, engine_type, fuel_type, transmission_type, etc.

๐Ÿ’ฑ Cryptocurrency (Coins, Blockchain, Wallets)

crypto_currency, crypto_address, crypto_transaction_id, blockchain_type, smart_contract_language, etc.

๐ŸŽฎ Gaming (Characters, Items, Guilds)

character_class, character_race, game_genre, npc_name, item_type, quest_name, guild_name, etc.

๐ŸŽต Music (Artists, Albums, Genres)

artist_name, album_name, song_title, music_genre, instrument_name, music_production_software, etc.

๐Ÿ“ฑ Marketing/Media (Campaigns, Analytics, Content)

campaign_name, social_media_platform, marketing_channel, content_type, target_audience, analytics_metric, etc.

๐ŸŒ Political (Countries, Parties, Elections)

political_party, political_ideology, election_type, government_structure, diplomatic_title, etc.

๐Ÿ“ Advanced (Templates, Regex, Lambdas)

template (combine fields), regular_expression (match patterns), lambda (custom Python), json_array, url, digit_sequence, character_sequence, etc.

๐Ÿค– AI/ML (Models, Metrics, Training)

model_type, model_framework, model_task, model_version, model_latency, model_confidence, cpu_utilization, gpu_utilization, data_drift_score, inference_result, etc.

โœจ Basic (Utilities, Random, Generators)

row_number, blank, boolean, number, datetime, color, emoji, password, password_hash, isbn, ulid, sentiment, words, sentences, paragraphs, etc.

๐ŸŽฒ Miscellaneous (Random & Fun)

dice_roll, coin_flip, rating, frequency, priority_level, dimension, duration, height, weight, temperature, etc.


Export Formats

Generate data in your preferred format:

# Export to multiple formats at once
IkiDataGenerator(schema).many(1000).export("dataset", formats=[
    "csv",      # Comma-separated values
    "json",     # JSON array of objects
    "sql",      # SQL INSERT statements
    "parquet",  # Apache Parquet (columnar)
    "excel",    # Excel workbook (.xlsx)
    "duckdb",   # DuckDB database
    # "tsv", "xml", "cql", "firebase", "dbunit" also supported
])
Format File Extension Best For Notes
CSV .csv Spreadsheets, import tools Universal format
JSON .json APIs, JavaScript, NoSQL Pretty-printed with indent=2
SQL .sql Databases INSERT statements (specify table name)
TSV .tsv Tab-delimited data Alternative to CSV
Excel .xlsx Business reports Native Excel format
Parquet .parquet Big Data, Pandas, BI tools Efficient columnar storage
DuckDB .duckdb Analytics, SQL queries Embedded database
XML .xml Legacy systems, config Structured XML export
Firestore .json Firebase/Firestore Firebase-ready format
DBUnit .xml Testing frameworks DBUnit test data format
CQL .cql Cassandra databases CQL INSERT statements

API Reference

IkiDataGenerator(schema)

Initialize the generator with a schema.

Parameters:

  • schema (list): List of field names (strings) or field configs (dicts)

Returns: IkiDataGenerator instance

gen = IkiDataGenerator(["first_name", "email_address"])

.many(n)

Generate n records.

Parameters:

  • n (int): Number of records to generate

Returns: BaseGenerator instance

records = gen.many(100)

.export(name, formats=None)

Export records to file(s).

Parameters:

  • name (str): Output filename (without extension)
  • formats (list, optional): File formats to export. Defaults to ["csv", "json"]

Returns: None (files saved to output/ folder)

gen.many(100).export("users", formats=["csv", "json", "sql"])
# Creates: output/users.csv, output/users.json, output/users.sql

KEY_LABEL_REGISTRY

Global dictionary mapping all 700+ field names to their categories.

from ikidatagen import KEY_LABEL_REGISTRY

print(KEY_LABEL_REGISTRY["email_address"])  # โ†’ "personal"
print(KEY_LABEL_REGISTRY["credit_card_number"])  # โ†’ "commerce"

ProviderFactory

Advanced: dynamically load providers.

from ikidatagen import ProviderFactory

provider = ProviderFactory.create("email_address")
email = provider.generate()

Advanced Examples

Example 1: E-Commerce Dataset with Templates

from ikidatagen import IkiDataGenerator

schema = [
    {"key_label": "row_number", "label": "Order ID"},
    {"key_label": "current_timestamp", "label": "Created At"},
    {"key_label": "customer_name", "label": "Customer"},
    {"key_label": "email_address", "label": "Email"},
    {"key_label": "product_name", "label": "Product"},
    {"key_label": "product_price", "label": "Price"},
    {
        "key_label": "template",
        "label": "Description",
        "options": {
            "template": "Order for {{product_name}} by {{customer_name}} ({{email_address}})"
        }
    },
    {"key_label": "order_status", "label": "Status"},
    {
        "key_label": "ip_address_v4",
        "label": "IP",
        "options": {"blank_percentage": 20}  # 20% missing
    }
]

IkiDataGenerator(schema).many(1000).export("orders", formats=["csv", "json"])

Example 2: Healthcare Records

from ikidatagen import IkiDataGenerator

schema = [
    "row_number",
    "first_name",
    "last_name",
    "date_of_birth",
    "blood_type",
    "disease_name",
    "medication_name",
    "icd10_diagnosis",
    {
        "key_label": "current_timestamp",
        "label": "last_visit",
        "options": {"blank_percentage": 10}
    },
]

IkiDataGenerator(schema).many(500).export("patients", formats=["csv", "json", "sql"])

Example 3: Test Data with Blanks and Validation

from ikidatagen import IkiDataGenerator

schema = [
    {"key_label": "username", "options": {"blank_percentage": 0}},      # No blanks
    {"key_label": "email_address", "options": {"blank_percentage": 0}},  # No blanks
    {
        "key_label": "phone_number",
        "options": {"blank_percentage": 30}  # 30% missing phones
    },
    {
        "key_label": "address_line_1",
        "options": {"blank_percentage": 5}
    },
]

data = IkiDataGenerator(schema).many(10000).export("test_users", formats=["json"])

Example 4: AI/ML Metrics Dataset

from ikidatagen import IkiDataGenerator

schema = [
    "row_number",
    "model_type",
    "model_framework",
    "model_task",
    "model_version",
    "model_latency",
    "model_confidence",
    "cpu_utilization",
    "gpu_utilization",
    "memory_footprint",
    "inference_result",
    "inference_endpoint",
    "current_timestamp",
]

IkiDataGenerator(schema).many(5000).export("ml_metrics", formats=["parquet", "json"])

Example 5: Custom Schema with Explicit Groups

from ikidatagen import IkiDataGenerator

schema = [
    {"key_label": "username", "group": "personal"},
    {"key_label": "email_address", "group": "personal"},
    {"key_label": "product_name", "group": "commerce"},
    {"key_label": "currency", "group": "commerce"},
    {
        "key_label": "regular_expression",
        "group": "advanced",
        "label": "Custom Pattern",
        "options": {"pattern": "[A-Z]{2}[0-9]{4}"}
    }
]

IkiDataGenerator(schema).many(100).export("mixed_data")

Configuration & Options

Blank Percentage

Control how many records have empty values for a field:

{
    "key_label": "phone_number",
    "options": {"blank_percentage": 25}  # 25% of records will have empty phone
}

Templates

Combine fields with {{field_name}} syntax:

{
    "key_label": "template",
    "options": {
        "template": "Full Name: {{first_name}} {{last_name}}, Email: {{email_address}}"
    }
}

Regular Expressions

Generate data matching a pattern:

{
    "key_label": "regular_expression",
    "options": {
        "pattern": "[A-Z]{3}-[0-9]{5}"  # Generates: ABC-12345
    }
}

Custom List

Pick from a list of values:

{
    "key_label": "custom_list",
    "options": {
        "values": ["Active", "Inactive", "Pending"]
    }
}

Number Range

Generate numbers within a range:

{
    "key_label": "number",
    "options": {
        "min": 0,
        "max": 100
    }
}

๐Ÿ“š Examples & Demonstrations

The examples/ folder contains 45+ ready-to-run scripts demonstrating all features and providers across 22 categories.

Quick Start Examples

# Run the absolute simplest example
python examples/00_quick_start.py

# Explore basic fields
python examples/01_basic_fields.py

# Test all export formats
python examples/02_export_formats.py

Run Examples by Category

Personal & Identity (3 examples)

python examples/10_personal_data.py          # Names, gender, dates
python examples/11_contact_info.py           # Email, phone, social
python examples/12_identity_documents.py     # Passports, SSN, IDs

E-Commerce & Shopping (4 examples)

python examples/20_ecommerce_shop.py         # Products with pricing
python examples/21_shopping_cart.py          # Complete orders
python examples/22_inventory_management.py   # Stock & inventory
python examples/23_payment_processing.py     # Payments & invoices

Finance & Banking (5 examples)

python examples/30_bank_accounts.py          # Bank accounts
python examples/31_credit_cards.py           # Credit card data
python examples/32_transactions.py           # Transfers & withdrawals
python examples/33_investment_portfolio.py   # Stocks & investments
python examples/34_crypto_blockchain.py      # Cryptocurrency wallets

Healthcare & Medical (3 examples)

python examples/40_patient_records.py        # Patient demographics
python examples/41_medical_diagnosis.py      # Diagnoses & ICD codes
python examples/42_medications.py            # Prescriptions & dosages

Location & Geography (2 examples)

python examples/50_addresses.py              # Addresses & coordinates
python examples/51_international_locations.py # Countries & cities

Education (1 example)

python examples/60_student_records.py        # Students & enrollment

Automotive (1 example)

python examples/70_car_inventory.py          # Cars, models, pricing

Entertainment & Gaming (2 examples)

python examples/90_gaming_players.py         # Gaming characters & guilds
python examples/92_entertainment.py          # Movies, books, music

Tech & Programming (2 examples)

python examples/100_programming_data.py      # Languages & frameworks
python examples/110_ml_models.py             # ML model metadata
python examples/111_ml_metrics.py            # Model performance metrics

Advanced Features (3 examples)

python examples/200_templates.py             # Combining fields with {{field}}
python examples/201_regex_patterns.py        # Custom regex patterns
python examples/203_blank_percentages.py     # Missing data simulation

Real-World Scenarios (7 complete systems)

python examples/300_saas_users.py            # SaaS with subscriptions
python examples/301_social_network.py        # Social media platform
python examples/302_analytics_events.py      # Event tracking (5000 events)
python examples/303_ecommerce_platform.py    # Complete e-commerce
python examples/304_travel_booking_system.py # Flights, hotels, bookings
python examples/308_hospital_system.py       # Hospital management
python examples/309_school_system.py         # University system

Batch & Large Datasets (3 examples)

python examples/400_mixed_categories.py      # Multiple categories mixed
python examples/401_batch_processing.py      # Batch generate 4 datasets
python examples/403_large_dataset.py         # Generate 1M+ records

Specialized Use Cases (4 examples)

python examples/500_test_data_unit_tests.py  # Unit test fixtures
python examples/501_load_testing_data.py     # Load testing (100K events)
python examples/502_demo_data.py             # Demo/presentation data
python examples/504_api_response_mocking.py  # Mock API responses

Complete Feature Showcase

python examples/999_showcase_all_features.py # All 12 features in one!

Run All Examples at Once

To generate data from all 45+ examples in one command:

# Generate everything
for file in examples/[0-9]*.py; do
    echo "Running $file..."
    python "$file"
done

Or on Windows (PowerShell):

Get-ChildItem examples\*.py -Filter "[0-9]*" | ForEach-Object {
    Write-Host "Running $($_.Name)..."
    python $_.FullName
}

View Generated Output

All examples save data to the output/ folder:

output/
โ”œโ”€โ”€ quick_start.csv
โ”œโ”€โ”€ quick_start.json
โ”œโ”€โ”€ personal_data.csv
โ”œโ”€โ”€ ecommerce_products.parquet
โ”œโ”€โ”€ medical_diagnosis.json
โ”œโ”€โ”€ ml_metrics.parquet
โ”œโ”€โ”€ large_dataset.parquet
โ””โ”€โ”€ ... (40+ more files)

Learning Path

Beginner: Start with simple examples and work up

00_quick_start โ†’ 01_basic_fields โ†’ 02_export_formats โ†’ 10_personal_data โ†’ 20_ecommerce_shop

Intermediate: Explore categories and features

30_bank_accounts โ†’ 40_patient_records โ†’ 50_addresses โ†’ 200_templates โ†’ 201_regex_patterns

Advanced: Complex real-world systems and large datasets

300_saas_users โ†’ 303_ecommerce_platform โ†’ 308_hospital_system โ†’ 400_mixed_categories โ†’ 403_large_dataset

Examples Summary

Category Examples Records Topics
Getting Started 3 50-100 Basics, fields, formats
Personal 3 50-300 Names, IDs, documents
Commerce 4 300-2000 Products, orders, inventory
Finance 5 200-1000 Banking, cards, stocks, crypto
Healthcare 3 300-500 Patients, diagnoses, meds
Location 2 500 Addresses, coordinates, countries
Education 1 400 Students, courses, degrees
Automotive 1 600 Cars, models, registration
Entertainment 2 500-2000 Gaming, movies, books, music
Tech 2 200-5000 Languages, frameworks, ML
Advanced 3 50-500 Templates, regex, blanks
Real-World 7 300-5000 Complete systems
Batch/Large 3 100K-1M Performance, scale
Specialized 4 50-100K Testing, mocking, load tests
Total 45+ 50 to 1M+ All features

Modify Examples for Your Needs

All examples are templatesโ€”feel free to copy and modify:

# Copy an example as a starting point
cp examples/20_ecommerce_shop.py my_custom_dataset.py

# Edit and run your custom version
python my_custom_dataset.py

Example Structure

Every example follows this simple pattern:

from ikidatagen import IkiDataGenerator

# 1. Define schema
schema = [
    "first_name",
    "last_name",
    "email_address",
    # ... more fields
]

# 2. Generate data
IkiDataGenerator(schema).many(100).export("my_data", formats=["csv", "json"])

# 3. Check output/ folder

Project Structure

Iki-Data-Generator/
โ”œโ”€โ”€ examples/                    # 45+ example scripts
โ”‚   โ”œโ”€โ”€ README.md                # Examples guide
โ”‚   โ”œโ”€โ”€ 00_quick_start.py        # Simplest example
โ”‚   โ”œโ”€โ”€ 01_basic_fields.py
โ”‚   โ”œโ”€โ”€ 20_ecommerce_shop.py
โ”‚   โ”œโ”€โ”€ 300_saas_users.py
โ”‚   โ”œโ”€โ”€ 403_large_dataset.py
โ”‚   โ””โ”€โ”€ 999_showcase_all_features.py  # All features!
โ”œโ”€โ”€ src/ikidatagen/              # Main package
โ”‚   โ”œโ”€โ”€ __init__.py              # Public API
โ”‚   โ”œโ”€โ”€ core.py                  # Main IkiDataGenerator class
โ”‚   โ”œโ”€โ”€ base_generator.py        # Data generation logic
โ”‚   โ”œโ”€โ”€ exporters.py             # Export to CSV, JSON, SQL, etc.
โ”‚   โ”œโ”€โ”€ provider_factory.py      # Dynamic provider loading
โ”‚   โ”œโ”€โ”€ schema_registry.py       # Maps field names to categories
โ”‚   โ”œโ”€โ”€ payload.py               # Data payload handling
โ”‚   โ”œโ”€โ”€ dataset_manager.py       # Dataset management
โ”‚   โ”œโ”€โ”€ external_datasets/       # External data files
โ”‚   โ”‚   โ”œโ”€โ”€ csv/                 # 30+ CSV files (countries, airlines, etc.)
โ”‚   โ”‚   โ””โ”€โ”€ json/                # 25+ JSON files (advanced data)
โ”‚   โ””โ”€โ”€ providers/               # Data providers (700+ fields)
โ”‚       โ”œโ”€โ”€ advanced/            # Template, Regex, Lambda, etc.
โ”‚       โ”œโ”€โ”€ ai/                  # ML/AI metrics
โ”‚       โ”œโ”€โ”€ basic/               # Names, dates, colors, etc.
โ”‚       โ”œโ”€โ”€ car/                 # Vehicle data
โ”‚       โ”œโ”€โ”€ commerce/            # Products, orders, payments
โ”‚       โ”œโ”€โ”€ communication/       # Email, phone, social
โ”‚       โ”œโ”€โ”€ construction/        # Building codes, materials
โ”‚       โ”œโ”€โ”€ crypto/              # Cryptocurrency data
โ”‚       โ”œโ”€โ”€ education/           # Schools, degrees, subjects
โ”‚       โ”œโ”€โ”€ finance/             # Credit cards, banking
โ”‚       โ”œโ”€โ”€ gaming/              # Characters, items, guilds
โ”‚       โ”œโ”€โ”€ health/              # Medical codes, symptoms
โ”‚       โ”œโ”€โ”€ it/                  # Programming, frameworks
โ”‚       โ”œโ”€โ”€ legal/               # Laws, contracts
โ”‚       โ”œโ”€โ”€ location/            # Countries, cities, addresses
โ”‚       โ”œโ”€โ”€ marketing/           # Campaigns, channels
โ”‚       โ”œโ”€โ”€ misc/                # Miscellaneous data
โ”‚       โ”œโ”€โ”€ music/               # Artists, albums, genres
โ”‚       โ”œโ”€โ”€ nature/              # Plants, animals, weather
โ”‚       โ”œโ”€โ”€ personal/            # Names, gender, documents
โ”‚       โ”œโ”€โ”€ political/           # Parties, elections
โ”‚       โ”œโ”€โ”€ products/            # Product categories
โ”‚       โ”œโ”€โ”€ sports/              # Athletes, teams, leagues
โ”‚       โ””โ”€โ”€ travel/              # Airlines, hotels, destinations
โ”œโ”€โ”€ output/                      # Generated data (CSV, JSON, etc.)
โ”œโ”€โ”€ main.py                      # Example usage
โ”œโ”€โ”€ pyproject.toml               # Package metadata
โ”œโ”€โ”€ requirements.txt             # Dependencies
โ””โ”€โ”€ README.md                    # This file

How It Works (Behind the Scenes)

  1. Schema Parsing: You provide a list of fields (strings or dicts)
  2. Provider Resolution: Each field name is looked up in KEY_LABEL_REGISTRY to find its category
  3. Dynamic Loading: The appropriate provider class is loaded from providers/{category}/{field}.py
  4. Generation: Each provider generates realistic data for n records
  5. Template Processing: Template fields combine other fields using {{field}} syntax
  6. Blank Handling: Records marked for blanks are cleared based on blank_percentage
  7. Export: Data is serialized to your chosen format(s) and saved to output/

Common Issues & Solutions

โŒ "Unknown key_label 'xxx'"

Problem: You used a field name that doesn't exist.

Solution: Check KEY_LABEL_REGISTRY or review the 22 categories above. Did you spell it correctly? (Use underscores, lowercase.)

# โŒ Wrong
schema = ["firstName"]  # camelCase? No!

# โœ… Correct
schema = ["first_name"]  # snake_case? Yes!

โŒ "No data to export"

Problem: .many(0) or empty schema.

Solution: Generate at least 1 record.

# โŒ Wrong
IkiDataGenerator(schema).many(0).export("data")

# โœ… Correct
IkiDataGenerator(schema).many(100).export("data")

โŒ Export folder not found

Problem: output/ directory doesn't exist.

Solution: The library creates it automatically. Make sure you have write permissions.

โŒ Template field not rendering

Problem: {{field_name}} not being replaced.

Solution: Ensure the referenced field exists in your schema and the spelling matches exactly.

# โŒ Wrong
{
    "key_label": "template",
    "options": {"template": "Name: {{first_name}} {{FirstName}}"}  # FirstName โ‰  first_name
}

# โœ… Correct
{
    "key_label": "template",
    "options": {"template": "Name: {{first_name}} {{last_name}}"}
}

Performance Tips

Generating Large Datasets

  • Use Parquet or DuckDB formats for large datasets (smaller file sizes, faster I/O)
  • DuckDB is perfect for immediate querying: import duckdb; duckdb.sql("SELECT * FROM 'data.duckdb'")
  • For 1M+ records, generate in batches to manage memory
# โœ… Generate in chunks
for i in range(10):
    IkiDataGenerator(schema).many(100_000).export(f"chunk_{i}")

Field Selection

  • Only include fields you need (reduces generation time)
  • Skip fields with expensive generation (e.g., password_hash)

Export Efficiency

# โœ… Smart exports
IkiDataGenerator(schema).many(1_000_000).export("big_data", formats=["parquet"])

# โŒ Avoid exporting to many formats at once
# IkiDataGenerator(schema).many(1_000_000).export("data", formats=["csv", "json", "sql", "excel"])

Contributing

Have ideas? Want to add new providers or categories? Open a PR!

  • New Provider: Add a file to src/ikidatagen/providers/{category}/{field_name}.py
  • New Category: Create a folder in providers/ and add your providers
  • Update Registry: Edit schema_registry.py to register new fields
  • Tests: Add tests for new providers

License

MIT License โ€” use it freely in personal and commercial projects.


Links & Resources


FAQ

Q: Can I use this data for production?

A: This is synthetic dataโ€”perfect for development, testing, and demos. For production, consider anonymizing real data or using this as a base.

Q: Can I extend it with custom fields?

A: Yes! Create a custom provider class in providers/{your_category}/ and register it in KEY_LABEL_REGISTRY.

Q: What's the difference between blank_percentage and nullable?

A: We use blank_percentage (0โ€“100) to control how many records have empty values for a field.

Q: How do I query generated data?

A: Export to DuckDB, then query with SQL:

import duckdb
results = duckdb.sql("SELECT * FROM 'output/users.duckdb' WHERE age > 25").fetchall()

Q: Can I regenerate the exact same data?

A: Not yet. Each run generates different data. (Seed support is planned for future releases.)

Q: What if I need a field that doesn't exist?

A: Use the lambda provider for custom logic:

{
    "key_label": "lambda",
    "options": {
        "function": lambda: f"CUSTOM_{random.randint(1000, 9999)}"
    }
}

Roadmap

  • ๐Ÿ”„ Seed support for reproducible datasets
  • ๐Ÿ”— Foreign key support for relational data
  • ๐Ÿ“Š Better performance for 100M+ records
  • ๐Ÿค– AI-powered schema suggestions
  • ๐ŸŽจ GUI for schema builder
  • ๐Ÿ“ˆ Dataset profiling and statistics

Thanks

Built with โค๏ธ for developers who hate dummy data.

Happy generating! ๐ŸŽฒ


Last updated: June 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iki_data_generator-0.1.4.tar.gz (18.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iki_data_generator-0.1.4-py3-none-any.whl (18.6 MB view details)

Uploaded Python 3

File details

Details for the file iki_data_generator-0.1.4.tar.gz.

File metadata

  • Download URL: iki_data_generator-0.1.4.tar.gz
  • Upload date:
  • Size: 18.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for iki_data_generator-0.1.4.tar.gz
Algorithm Hash digest
SHA256 d0fc88c150d895d87f4e939ce77516a94bf898cc3dbfc49e1af2e3dc21d9d76a
MD5 37cd4acd7862f36bf93b3712426bfc88
BLAKE2b-256 ce58a6ea68362d1d0da067eaad9c62f338943ec64d2550c3130e84efcc744719

See more details on using hashes here.

File details

Details for the file iki_data_generator-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for iki_data_generator-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 67225ac9d76d0f8b45933cf3c4831221c351c2d75fc95778dc693e963e797250
MD5 e2fdd63a769c9fff7d3e1ed49d3cc304
BLAKE2b-256 23f42417bdd835aeb76bccf7de3bcc7b8f8fa9fb5207c753d2b600ad3089c947

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page