Skip to main content

A flexible test data generation toolkit

Project description

TestDataX

Build Status codecov Python Version License

This command-line interface application enables quick and customizable test data generation across various formats. It supports multiple data providers (Mimesis and Faker) for realistic data generation, offers flexible schema configurations, and simplifies output to multiple database dialects or file types. Users can define precise parameters for data volume, types, and constraints for each target data set.

Requirements

  • Python 3.11+

Quick Start

# Install from PyPI
pip install testdatax

# Generate sample data
testdatax --rows 1000 --format json --output data.json

Features

  • Generate realistic test data using multiple data providers (Mimesis, Faker)
  • Support for multiple output formats (CSV, JSON, SQL, etc.)
  • Customizable schema definitions
  • Configurable data generation parameters
  • CLI tool for easy test data generation

Supported Formats

  • JSON
  • CSV
  • ORC
  • Parquet
  • MySQL
  • MSSQL
  • Oracle

CLI Usage

testdatax -o <output_file> -f <format> -s <schema_file> -r <num_rows> -p <provider> [-d]

Options:

  • -o, --output: Output file path (table_name for sql exports)
  • -f, --format: Output format (csv, json, orc, parquet, mysql, mssql, oracle)
  • -r, --rows: Number of rows to generate (default: 10)
  • -s, --schema: Path to schema file
  • -p, --provider: Data provider (mimesis, faker) - default: mimesis
  • --seed: Seed for reproducible output (optional)
  • --null-rate: Default NULL probability (0-1) for nullable fields - default: 0.1
  • -d, --debug: Enable debug output

Reproducibility: passing --seed makes generation deterministic — the same schema, row count, provider and seed produce identical output every run, which is ideal for stable test fixtures.

Usage Examples

Generate 10 rows of CSV data:

testdatax -o users.csv -f csv -s schema.json -r 10

Generate 10 rows of CSV data using Faker provider:

testdatax -o users.csv -f csv -s schema.json -r 10 -p faker

Generate 1000 rows of Parquet data with debug output:

testdatax -o large_dataset.parquet -f parquet -s users_schema.json -r 1000 -d

Generate 1000 rows of Parquet data using Mimesis provider:

testdatax -o large_dataset.parquet -f parquet -s users_schema.json -r 1000 -p mimesis

Generate JSON data with default row count (10):

testdatax -o data.json -f json -s schema.json

Generate ORC file with specific schema:

testdatax -o analytics.orc -f orc -s analytics_schema.json -r 100

Generate MySQL with default row count (1000), table_name as 'default':

testdatax -o default.sql -f mysql -r 1000

Generate MSSQL with default row count (1000), table_name as 'mstest':

testdatax -o mstest.sql -f mssql -r 1000

Generate Oracle with default row count (1000), table_name as 'oracle':

testdatax -o oracle.sql -f oracle -r 1000

Each command consists of:

  • -o, --output: Specify the output file path and name
  • -f, --format: Output format (csv, json, orc, parquet, mysql, mssql, oracle)
  • -s, --schema: Path to your schema definition file
  • -r, --rows: Number of rows to generate (optional, defaults to 10)
  • -p, --provider: Data provider (mimesis, faker) - default: mimesis
  • -d, --debug: Enable debug logging (optional)

Schema Example

{
  "username": {
    "type": "string",
    "provider_field": "name"
  },
  "date_joined": {
    "type": "datetime"
  },
  "date": {
    "type": "date"
  },
  "age": {
    "type": "integer",
    "min": 18,
    "max": 99
  },
  "is_active": {
    "type": "boolean"
  },
  "float": {
    "type": "float"
  },
  "uuid": {
    "type": "uuid"
  },
  "status": {
    "type": "enum",
    "values": ["active", "inactive", "pending"]
  }
}

Schema Configuration

The schema file defines the structure and constraints of your generated data. Each field in the schema can have the following properties:

Basic Field Properties

  • type: (required) The data type of the field
  • nullable: (optional) Boolean to allow null values (default: false)
  • unique: (optional) Boolean to ensure unique values (default: false)

Type-Specific Properties

String Fields

{
  "username": {
    "type": "string",
    "min_length": 5,
    "max_length": 20,
    "provider_field": "user_name"  // Use provider-specific field to generate realistic data
  },
  "description": {
    "type": "text",
    "min_length": 100,
    "max_length": 500
  }
}

Numeric Fields

{
  "age": {
    "type": "integer",
    "min": 18,
    "max": 99
  },
  "score": {
    "type": "float",
    "min": 0.0,
    "max": 100.0,
    "precision": 2
  }
}

Date and Time Fields

{
  "created_at": {
    "type": "datetime",
    "start_date": "2020-01-01",
    "end_date": "2023-12-31"
  },
  "birth_date": {
    "type": "date",
    "format": "%Y-%m-%d"
  }
}

Note: start_date/end_date bound the generated range (inclusive). format applies a strftime pattern to date/datetime values in the CSV and JSON outputs only; the SQL, Parquet and ORC exporters keep native date types and ignore format.

Enum Fields

{
  "status": {
    "type": "enum",
    "values": ["pending", "active", "suspended"],
    "weights": [0.2, 0.7, 0.1]  // Optional probability weights
  }
}

Using Data Providers

Both Mimesis and Faker providers support the same schema format. You can specify provider-specific generators using the provider_field field (works with both providers):

{
  "name": {
    "type": "string",
    "provider_field": "name"
  },
  "email": {
    "type": "string",
    "provider_field": "email"
  },
  "address": {
    "type": "string",
    "provider_field": "address"
  },
  "company": {
    "type": "string",
    "provider_field": "company"
  }
}

Complete Example

{
  "user_id": {
    "type": "uuid",
    "unique": true
  },
  "username": {
    "type": "string",
    "provider_field": "user_name",
    "unique": true
  },
  "email": {
    "type": "string",
    "provider_field": "email",
    "unique": true
  },
  "age": {
    "type": "integer",
    "min": 18,
    "max": 99
  },
  "status": {
    "type": "enum",
    "values": ["active", "inactive"],
    "weights": [0.8, 0.2]
  },
  "created_at": {
    "type": "datetime",
    "start_date": "2020-01-01",
    "end_date": "2023-12-31"
  },
  "is_verified": {
    "type": "boolean",
    "nullable": true
  }
}

Data Providers

TestDataX supports two powerful data providers for generating realistic test data:

Mimesis (Default)

Mimesis is a high-performance Python library for generating synthetic data. It provides:

  • Fast data generation with excellent performance
  • Support for multiple locales and languages
  • Wide variety of data providers for different domains
  • Lightweight and efficient implementation

Faker

Faker is a popular Python library for generating fake data. It offers:

  • Extensive provider ecosystem with community contributions
  • Rich set of localized providers
  • Well-established and widely used in the Python community
  • Comprehensive documentation and examples

You can specify the provider using the -p or --provider option:

# Use Mimesis (default)
testdatax -o data.csv -f csv -p mimesis

# Use Faker
testdatax -o data.csv -f csv -p faker

Both providers support the same schema format and generate compatible data types.

Note: For backward compatibility, the legacy faker field name is still supported, but provider_field is recommended for new schemas.

Supported Data Types

  • string
  • text
  • integer
  • bigint
  • float
  • decimal
  • boolean
  • date
  • datetime
  • blob
  • uuid
  • enum

Database Type Mappings

Generic Type MySQL MSSQL Oracle
string VARCHAR(255) NVARCHAR(255) VARCHAR2(255)
text TEXT NVARCHAR(MAX) CLOB
integer INT INT NUMBER(10)
bigint BIGINT BIGINT NUMBER(19)
float FLOAT FLOAT FLOAT
decimal DECIMAL(18,2) DECIMAL(18,2) NUMBER(18,2)
boolean TINYINT(1) BIT NUMBER(1)
date DATE DATE DATE
datetime DATETIME DATETIME2 TIMESTAMP
blob LONGBLOB VARBINARY(MAX) BLOB
uuid VARCHAR(36) UNIQUEIDENTIFIER VARCHAR2(36)
enum ENUM NVARCHAR(255) VARCHAR2(255)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testdatax-0.3.0.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

testdatax-0.3.0-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file testdatax-0.3.0.tar.gz.

File metadata

  • Download URL: testdatax-0.3.0.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for testdatax-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1336f67a16b1118bc664d152d7cec9533c1c458f80a8aa2aa0437377d1d1da44
MD5 f10f441e9b4bf4c19ab40097e660d1e9
BLAKE2b-256 eda1a9eb439523c84f17eef42c57784ddf810332cd6b2adaed481be9bd735338

See more details on using hashes here.

Provenance

The following attestation bundles were made for testdatax-0.3.0.tar.gz:

Publisher: publish.yml on JamesPBrett/TestDataX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file testdatax-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: testdatax-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for testdatax-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a8dbad634365007b4d6417105cc7a0c1959568e852668ab8b1641ad69fc71bc0
MD5 40fb1cf76799a7010c2c0e6732a499be
BLAKE2b-256 d11eb1a503d2738fc411cf06509265faa8d8978e5b10bc22ccedc46959013d8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for testdatax-0.3.0-py3-none-any.whl:

Publisher: publish.yml on JamesPBrett/TestDataX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page