Skip to main content

Generate mock data for Milvus collections based on schema definitions

Project description

Milvus Fake Data Generator

A powerful Python tool for generating realistic mock data for Milvus vector databases. Create test data quickly and efficiently using schema definitions, with support for all Milvus field types and built-in schemas for common use cases.

โœจ Key Features

  • ๐ŸŽฏ Ready-to-use schemas - Pre-built schemas for e-commerce, documents, images, users, news, and videos
  • ๐Ÿ“š Schema management - Add, organize, and reuse custom schemas with metadata
  • ๐Ÿš€ Flexible generation - Support for JSON/YAML schema files with comprehensive field types
  • ๐Ÿ”ง Complete Milvus support - All field types including vectors, arrays, JSON, and primitive types
  • โœ… Smart validation - Pydantic-based validation with detailed error messages and suggestions
  • ๐Ÿ“Š Multiple formats - Output as Parquet, CSV, JSON, or NumPy arrays
  • ๐ŸŒฑ Reproducible results - Seed support for consistent data generation
  • ๐ŸŽจ Rich customization - Field constraints, nullable fields, auto-generated IDs
  • ๐Ÿ” Schema exploration - Validation, help commands, and schema details
  • ๐Ÿ  Unified interface - Use custom and built-in schemas interchangeably

Installation

# Install from PyPI (when published)
pip install milvus-fake-data

# Or install from source
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install

๐Ÿš€ Quick Start

1. Use Built-in Schemas (Recommended)

Get started instantly with pre-built schemas for common use cases:

# List all available built-in schemas
milvus-fake-data schema list

# Generate data using a built-in schema
milvus-fake-data generate --builtin simple --rows 1000 --preview

# Generate e-commerce product data to output directory
milvus-fake-data generate --builtin ecommerce --rows 5000 --out products/

Available Built-in Schemas:

Schema Description Use Cases
simple Basic example with common field types Learning, testing
ecommerce Product catalog with search embeddings Online stores, recommendations
documents Document search with semantic embeddings Knowledge bases, document search
images Image gallery with visual similarity Media platforms, image search
users User profiles with behavioral embeddings User analytics, personalization
videos Video library with multimodal embeddings Video platforms, content discovery
news News articles with sentiment analysis News aggregation, content analysis

2. Create Custom Schemas

Define your own collection structure with JSON or YAML:

{
  "collection_name": "my_collection",
  "fields": [
    {
      "name": "id",
      "type": "Int64",
      "is_primary": true
    },
    {
      "name": "title",
      "type": "VarChar",
      "max_length": 256
    },
    {
      "name": "embedding",
      "type": "FloatVector",
      "dim": 128
    }
  ]
}
# Generate mock data from custom schema
milvus-fake-data generate --schema my_schema.json --rows 1000 --format csv --preview

Note: Output is always a directory containing data files (in the specified format) and a meta.json file with collection metadata.

3. Schema Management

Store and organize your schemas for reuse:

# Add a custom schema to your library
milvus-fake-data schema add my_products product_schema.json

# List all schemas (built-in + custom)
milvus-fake-data schema list

# Use your custom schema like a built-in one
milvus-fake-data generate --builtin my_products --rows 1000

# Show detailed schema information
milvus-fake-data schema show my_products

4. Python API

from milvus_fake_data.generator import generate_mock_data
from milvus_fake_data.schema_manager import get_schema_manager
from milvus_fake_data.builtin_schemas import load_builtin_schema
from tempfile import NamedTemporaryFile
import json

# Use the schema manager to work with schemas
manager = get_schema_manager()

# List all available schemas
all_schemas = manager.list_all_schemas()
print("Available schemas:", list(all_schemas.keys()))

# Load any schema (built-in or custom)
schema = manager.load_schema("ecommerce")  # Built-in
# schema = manager.load_schema("my_products")  # Custom

# Generate data from schema
with NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
    json.dump(schema, f)
    df = generate_mock_data(f.name, rows=1000, seed=42)

print(df.head())

# Add a custom schema programmatically
custom_schema = {
    "collection_name": "my_collection",
    "fields": [
        {"name": "id", "type": "Int64", "is_primary": True},
        {"name": "text", "type": "VarChar", "max_length": 100},
        {"name": "vector", "type": "FloatVector", "dim": 256}
    ]
}

manager.add_schema("my_custom", custom_schema, "Custom schema", ["testing"])
print("Added custom schema!")

๐Ÿ“‹ Schema Reference

Supported Field Types

Type Description Required Parameters Optional Parameters
Numeric Types
Int8, Int16, Int32, Int64 Integer types - min, max
Float, Double Floating point - min, max
Bool Boolean values - -
Text Types
VarChar, String Variable length string max_length -
JSON JSON objects - -
Vector Types
FloatVector 32-bit float vectors dim -
BinaryVector Binary vectors dim -
Float16Vector 16-bit float vectors dim -
BFloat16Vector Brain float vectors dim -
Int8Vector 8-bit integer vectors dim -
SparseFloatVector Sparse float vectors dim -
Complex Types
Array Array of elements element_type, max_capacity max_length (for string elements)

Field Properties

Property Description Applicable Types
is_primary Mark field as primary key (exactly one required) All types
auto_id Auto-generate ID values Int64 primary keys only
nullable Allow null values (10% probability) All types
min, max Value constraints Numeric types
max_length String/element length limit String and Array types
dim Vector dimension (1-32768) Vector types
element_type Array element type Array type
max_capacity Array capacity (1-4096) Array type

Complete Example

collection_name: "advanced_catalog"
fields:
  # Primary key with auto-generated IDs
  - name: "id"
    type: "Int64"
    is_primary: true
    auto_id: true
  
  # Text fields with constraints
  - name: "title"
    type: "VarChar"
    max_length: 200
  
  - name: "description"
    type: "VarChar"
    max_length: 1000
    nullable: true
  
  # Numeric fields with ranges
  - name: "price"
    type: "Float"
    min: 0.01
    max: 9999.99
  
  - name: "rating"
    type: "Int8"
    min: 1
    max: 5
  
  # Vector for semantic search
  - name: "embedding"
    type: "FloatVector"
    dim: 768
  
  # Array of tags
  - name: "tags"
    type: "Array"
    element_type: "VarChar"
    max_capacity: 10
    max_length: 50
  
  # Structured metadata
  - name: "metadata"
    type: "JSON"
    nullable: true
  
  # Boolean flags
  - name: "in_stock"
    type: "Bool"

๐Ÿ“š CLI Reference

Command Structure

The CLI uses a clean grouped structure:

# Main command groups
milvus-fake-data generate [options]  # Data generation
milvus-fake-data schema [command]    # Schema management
milvus-fake-data clean [options]     # Utility commands

Data Generation Commands

Command Description Example
--schema PATH Generate from custom schema file milvus-fake-data generate --schema my_schema.json
--builtin SCHEMA_ID Use built-in or managed schema milvus-fake-data generate --builtin ecommerce
--rows INTEGER Number of rows to generate milvus-fake-data generate --rows 5000
--format FORMAT Output format (parquet, csv, json, npy) milvus-fake-data generate --format csv
--out DIRECTORY Output directory path milvus-fake-data generate --out my_data/
--preview Show first 5 rows milvus-fake-data generate --preview
--seed INTEGER Random seed for reproducibility milvus-fake-data generate --seed 42
--validate-only Validate schema without generating milvus-fake-data generate --validate-only
--no-progress Disable progress bar display milvus-fake-data generate --no-progress
--batch-size INTEGER Batch size for memory efficiency milvus-fake-data generate --batch-size 5000
--yes Auto-confirm prompts milvus-fake-data generate --yes
--chunk-size INTEGER Chunk size in MB for segments milvus-fake-data generate --chunk-size 256
--force Force overwrite output directory milvus-fake-data generate --force

Schema Management Commands

Command Description Example
schema list List all schemas (built-in + custom) milvus-fake-data schema list
schema show SCHEMA_ID Show schema details milvus-fake-data schema show ecommerce
schema add SCHEMA_ID FILE Add custom schema milvus-fake-data schema add products schema.json
schema remove SCHEMA_ID Remove custom schema milvus-fake-data schema remove products
schema help Show schema format help milvus-fake-data schema help

Utility Commands

Command Description Example
clean Clean up generated output files milvus-fake-data clean --yes
--help Show help message milvus-fake-data --help

Common Usage Patterns

# Quick start with built-in schema
milvus-fake-data generate --builtin simple --rows 1000 --preview

# Generate large dataset with custom format
milvus-fake-data generate --builtin ecommerce --rows 100000 --format csv --out products/

# Test custom schema
milvus-fake-data generate --schema my_schema.json --validate-only

# Reproducible data generation
milvus-fake-data generate --builtin users --rows 5000 --seed 42 --out users/

# Schema management workflow
milvus-fake-data schema list
milvus-fake-data schema show ecommerce
milvus-fake-data schema add my_ecommerce ecommerce_base.json

# Clean up generated output files
milvus-fake-data clean --yes

๐Ÿ› ๏ธ Development

This project uses PDM for dependency management and follows modern Python development practices.

Setup Development Environment

# Clone and setup
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install  # Install development dependencies

Development Workflow

# Code formatting and linting
pdm run ruff format src tests    # Format code
pdm run ruff check src tests     # Check linting
pdm run mypy src                 # Type checking

# Testing
pdm run pytest                           # Run all tests
pdm run pytest --cov=src --cov-report=html  # With coverage
pdm run pytest tests/test_generator.py   # Specific test file

# Combined quality checks
make lint test                   # Run linting and tests together

Project Structure

src/milvus_fake_data/
โ”œโ”€โ”€ cli.py              # Command-line interface
โ”œโ”€โ”€ generator.py        # Core data generation logic
โ”œโ”€โ”€ models.py           # Pydantic validation models
โ”œโ”€โ”€ schema_manager.py   # Schema management system
โ”œโ”€โ”€ builtin_schemas.py  # Built-in schema definitions
โ”œโ”€โ”€ rich_display.py     # Terminal formatting
โ”œโ”€โ”€ logging_config.py   # Structured logging
โ””โ”€โ”€ schemas/            # Built-in schema files
    โ”œโ”€โ”€ simple.json
    โ”œโ”€โ”€ ecommerce.json
    โ””โ”€โ”€ ...

๐Ÿค Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes with tests
  4. Ensure quality checks pass: make lint test
  5. Commit changes: git commit -m 'Add amazing feature'
  6. Push to branch: git push origin feature/amazing-feature
  7. Open a Pull Request

Contribution Guidelines

  • Add tests for new functionality
  • Update documentation for API changes
  • Follow existing code style (ruff + mypy)
  • Include helpful error messages for user-facing features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

milvus_fake_data-0.1.2.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

milvus_fake_data-0.1.2-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file milvus_fake_data-0.1.2.tar.gz.

File metadata

  • Download URL: milvus_fake_data-0.1.2.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0

File hashes

Hashes for milvus_fake_data-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0700b4b6ce8cab9c14c48062956ebc0ffe9cbf719dfc724804fa926297646cb7
MD5 653cd58969a0af047649ab8b37354f3d
BLAKE2b-256 7a8ca728323eeb5b525e1e1461158a9529d98330d096ddcd6a9f00e7e0884e5b

See more details on using hashes here.

File details

Details for the file milvus_fake_data-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: milvus_fake_data-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0

File hashes

Hashes for milvus_fake_data-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 61a0a80cc4ef6f64ebd82f3dda680c2051d282e135eb2012359d0e9b02719d93
MD5 e2e734fe3f3170984175bbc4754506e5
BLAKE2b-256 5a28e54e8688006a6cc0c5b77ed2ab099c5667e42c6c9b9e4e2cd1df12c3692d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page