Generate mock data for Milvus collections based on schema definitions

These details have not been verified by PyPI

Project description

Milvus Fake Data Generator

A powerful Python tool for generating realistic mock data for Milvus vector databases. Create test data quickly and efficiently using schema definitions, with support for all Milvus field types and built-in schemas for common use cases.

✨ Key Features

🎯 Ready-to-use schemas - Pre-built schemas for e-commerce, documents, images, users, news, and videos
📚 Schema management - Add, organize, and reuse custom schemas with metadata
🚀 Flexible generation - Support for JSON/YAML schema files with comprehensive field types
🔧 Complete Milvus support - All field types including vectors, arrays, JSON, and primitive types
✅ Smart validation - Pydantic-based validation with detailed error messages and suggestions
📊 Multiple formats - Output as Parquet, CSV, JSON, or NumPy arrays
🌱 Reproducible results - Seed support for consistent data generation
🎨 Rich customization - Field constraints, nullable fields, auto-generated IDs
🔍 Schema exploration - Validation, help commands, and schema details
🏠 Unified interface - Use custom and built-in schemas interchangeably

Installation

# Install from PyPI (when published)
pip install milvus-fake-data

# Or install from source
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install

🚀 Quick Start

1. Use Built-in Schemas (Recommended)

Get started instantly with pre-built schemas for common use cases:

# List all available built-in schemas
milvus-fake-data schema list

# Generate data using a built-in schema
milvus-fake-data generate --builtin simple --rows 1000 --preview

# Generate e-commerce product data to output directory
milvus-fake-data generate --builtin ecommerce --rows 5000 --out products/

Available Built-in Schemas:

Schema	Description	Use Cases
`simple`	Basic example with common field types	Learning, testing
`ecommerce`	Product catalog with search embeddings	Online stores, recommendations
`documents`	Document search with semantic embeddings	Knowledge bases, document search
`images`	Image gallery with visual similarity	Media platforms, image search
`users`	User profiles with behavioral embeddings	User analytics, personalization
`videos`	Video library with multimodal embeddings	Video platforms, content discovery
`news`	News articles with sentiment analysis	News aggregation, content analysis

2. Create Custom Schemas

Define your own collection structure with JSON or YAML:

{
  "collection_name": "my_collection",
  "fields": [
    {
      "name": "id",
      "type": "Int64",
      "is_primary": true
    },
    {
      "name": "title",
      "type": "VarChar",
      "max_length": 256
    },
    {
      "name": "embedding",
      "type": "FloatVector",
      "dim": 128
    }
  ]
}

# Generate mock data from custom schema
milvus-fake-data generate --schema my_schema.json --rows 1000 --format csv --preview

Note: Output is always a directory containing data files (in the specified format) and a meta.json file with collection metadata.

3. Schema Management

Store and organize your schemas for reuse:

# Add a custom schema to your library
milvus-fake-data schema add my_products product_schema.json

# List all schemas (built-in + custom)
milvus-fake-data schema list

# Use your custom schema like a built-in one
milvus-fake-data generate --builtin my_products --rows 1000

# Show detailed schema information
milvus-fake-data schema show my_products

4. Python API

from milvus_fake_data.generator import generate_mock_data
from milvus_fake_data.schema_manager import get_schema_manager
from milvus_fake_data.builtin_schemas import load_builtin_schema
from tempfile import NamedTemporaryFile
import json

# Use the schema manager to work with schemas
manager = get_schema_manager()

# List all available schemas
all_schemas = manager.list_all_schemas()
print("Available schemas:", list(all_schemas.keys()))

# Load any schema (built-in or custom)
schema = manager.load_schema("ecommerce")  # Built-in
# schema = manager.load_schema("my_products")  # Custom

# Generate data from schema
with NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
    json.dump(schema, f)
    df = generate_mock_data(f.name, rows=1000, seed=42)

print(df.head())

# Add a custom schema programmatically
custom_schema = {
    "collection_name": "my_collection",
    "fields": [
        {"name": "id", "type": "Int64", "is_primary": True},
        {"name": "text", "type": "VarChar", "max_length": 100},
        {"name": "vector", "type": "FloatVector", "dim": 256}
    ]
}

manager.add_schema("my_custom", custom_schema, "Custom schema", ["testing"])
print("Added custom schema!")

📋 Schema Reference

Supported Field Types

Type	Description	Required Parameters	Optional Parameters
Numeric Types
`Int8`, `Int16`, `Int32`, `Int64`	Integer types	-	`min`, `max`
`Float`, `Double`	Floating point	-	`min`, `max`
`Bool`	Boolean values	-	-
Text Types
`VarChar`, `String`	Variable length string	`max_length`	-
`JSON`	JSON objects	-	-
Vector Types
`FloatVector`	32-bit float vectors	`dim`	-
`BinaryVector`	Binary vectors	`dim`	-
`Float16Vector`	16-bit float vectors	`dim`	-
`BFloat16Vector`	Brain float vectors	`dim`	-
`Int8Vector`	8-bit integer vectors	`dim`	-
`SparseFloatVector`	Sparse float vectors	`dim`	-
Complex Types
`Array`	Array of elements	`element_type`, `max_capacity`	`max_length` (for string elements)

Field Properties

Property	Description	Applicable Types
`is_primary`	Mark field as primary key (exactly one required)	All types
`auto_id`	Auto-generate ID values	Int64 primary keys only
`nullable`	Allow null values (10% probability)	All types
`min`, `max`	Value constraints	Numeric types
`max_length`	String/element length limit	String and Array types
`dim`	Vector dimension (1-32768)	Vector types
`element_type`	Array element type	Array type
`max_capacity`	Array capacity (1-4096)	Array type

Complete Example

collection_name: "advanced_catalog"
fields:
  # Primary key with auto-generated IDs
  - name: "id"
    type: "Int64"
    is_primary: true
    auto_id: true
  
  # Text fields with constraints
  - name: "title"
    type: "VarChar"
    max_length: 200
  
  - name: "description"
    type: "VarChar"
    max_length: 1000
    nullable: true
  
  # Numeric fields with ranges
  - name: "price"
    type: "Float"
    min: 0.01
    max: 9999.99
  
  - name: "rating"
    type: "Int8"
    min: 1
    max: 5
  
  # Vector for semantic search
  - name: "embedding"
    type: "FloatVector"
    dim: 768
  
  # Array of tags
  - name: "tags"
    type: "Array"
    element_type: "VarChar"
    max_capacity: 10
    max_length: 50
  
  # Structured metadata
  - name: "metadata"
    type: "JSON"
    nullable: true
  
  # Boolean flags
  - name: "in_stock"
    type: "Bool"

📚 CLI Reference

Command Structure

The CLI uses a clean grouped structure:

# Main command groups
milvus-fake-data generate [options]  # Data generation
milvus-fake-data schema [command]    # Schema management
milvus-fake-data clean [options]     # Utility commands

Data Generation Commands

Command	Description	Example
`--schema PATH`	Generate from custom schema file	`milvus-fake-data generate --schema my_schema.json`
`--builtin SCHEMA_ID`	Use built-in or managed schema	`milvus-fake-data generate --builtin ecommerce`
`--rows INTEGER`	Number of rows to generate	`milvus-fake-data generate --rows 5000`
`--format FORMAT`	Output format (parquet, csv, json, npy)	`milvus-fake-data generate --format csv`
`--out DIRECTORY`	Output directory path	`milvus-fake-data generate --out my_data/`
`--preview`	Show first 5 rows	`milvus-fake-data generate --preview`
`--seed INTEGER`	Random seed for reproducibility	`milvus-fake-data generate --seed 42`
`--validate-only`	Validate schema without generating	`milvus-fake-data generate --validate-only`
`--no-progress`	Disable progress bar display	`milvus-fake-data generate --no-progress`
`--batch-size INTEGER`	Batch size for memory efficiency	`milvus-fake-data generate --batch-size 5000`
`--yes`	Auto-confirm prompts	`milvus-fake-data generate --yes`
`--chunk-size INTEGER`	Chunk size in MB for segments	`milvus-fake-data generate --chunk-size 256`
`--force`	Force overwrite output directory	`milvus-fake-data generate --force`

Schema Management Commands

Command	Description	Example
`schema list`	List all schemas (built-in + custom)	`milvus-fake-data schema list`
`schema show SCHEMA_ID`	Show schema details	`milvus-fake-data schema show ecommerce`
`schema add SCHEMA_ID FILE`	Add custom schema	`milvus-fake-data schema add products schema.json`
`schema remove SCHEMA_ID`	Remove custom schema	`milvus-fake-data schema remove products`
`schema help`	Show schema format help	`milvus-fake-data schema help`

Utility Commands

Command	Description	Example
`clean`	Clean up generated output files	`milvus-fake-data clean --yes`
`--help`	Show help message	`milvus-fake-data --help`

Common Usage Patterns

# Quick start with built-in schema
milvus-fake-data generate --builtin simple --rows 1000 --preview

# Generate large dataset with custom format
milvus-fake-data generate --builtin ecommerce --rows 100000 --format csv --out products/

# Test custom schema
milvus-fake-data generate --schema my_schema.json --validate-only

# Reproducible data generation
milvus-fake-data generate --builtin users --rows 5000 --seed 42 --out users/

# Schema management workflow
milvus-fake-data schema list
milvus-fake-data schema show ecommerce
milvus-fake-data schema add my_ecommerce ecommerce_base.json

# Clean up generated output files
milvus-fake-data clean --yes

🛠️ Development

This project uses PDM for dependency management and follows modern Python development practices.

Setup Development Environment

# Clone and setup
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install  # Install development dependencies

Development Workflow

# Code formatting and linting
pdm run ruff format src tests    # Format code
pdm run ruff check src tests     # Check linting
pdm run mypy src                 # Type checking

# Testing
pdm run pytest                           # Run all tests
pdm run pytest --cov=src --cov-report=html  # With coverage
pdm run pytest tests/test_generator.py   # Specific test file

# Combined quality checks
make lint test                   # Run linting and tests together

Project Structure

src/milvus_fake_data/
├── cli.py              # Command-line interface
├── generator.py        # Core data generation logic
├── models.py           # Pydantic validation models
├── schema_manager.py   # Schema management system
├── builtin_schemas.py  # Built-in schema definitions
├── rich_display.py     # Terminal formatting
├── logging_config.py   # Structured logging
└── schemas/            # Built-in schema files
    ├── simple.json
    ├── ecommerce.json
    └── ...

🤝 Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Ensure quality checks pass: make lint test
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

Contribution Guidelines

Add tests for new functionality
Update documentation for API changes
Follow existing code style (ruff + mypy)
Include helpful error messages for user-facing features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built for the Milvus vector database
Uses Faker for realistic data generation
Powered by Pandas and NumPy

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 30, 2025

This version

0.1.2

Jun 26, 2025

0.1.1

Jun 26, 2025

0.1.0

Jun 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

milvus_fake_data-0.1.2.tar.gz (42.4 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

milvus_fake_data-0.1.2-py3-none-any.whl (38.4 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file milvus_fake_data-0.1.2.tar.gz.

File metadata

Download URL: milvus_fake_data-0.1.2.tar.gz
Upload date: Jun 26, 2025
Size: 42.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0

File hashes

Hashes for milvus_fake_data-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0700b4b6ce8cab9c14c48062956ebc0ffe9cbf719dfc724804fa926297646cb7`
MD5	`653cd58969a0af047649ab8b37354f3d`
BLAKE2b-256	`7a8ca728323eeb5b525e1e1461158a9529d98330d096ddcd6a9f00e7e0884e5b`

See more details on using hashes here.

File details

Details for the file milvus_fake_data-0.1.2-py3-none-any.whl.

File metadata

Download URL: milvus_fake_data-0.1.2-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 38.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0

File hashes

Hashes for milvus_fake_data-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61a0a80cc4ef6f64ebd82f3dda680c2051d282e135eb2012359d0e9b02719d93`
MD5	`e2e734fe3f3170984175bbc4754506e5`
BLAKE2b-256	`5a28e54e8688006a6cc0c5b77ed2ab099c5667e42c6c9b9e4e2cd1df12c3692d`

See more details on using hashes here.

milvus-fake-data 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Milvus Fake Data Generator

✨ Key Features

Installation

🚀 Quick Start

1. Use Built-in Schemas (Recommended)

2. Create Custom Schemas

3. Schema Management

4. Python API

📋 Schema Reference

Supported Field Types

Field Properties

Complete Example

📚 CLI Reference

Command Structure

Data Generation Commands

Schema Management Commands

Utility Commands

Common Usage Patterns

🛠️ Development

Setup Development Environment

Development Workflow

Project Structure

🤝 Contributing

Contribution Guidelines

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes