High-performance mock data generator for Milvus collections with vectorized operations
Project description
High-Performance Milvus Data Generator
๐ Ultra-fast mock data generator for Milvus vector databases - Built for large-scale data generation with vectorized operations, parallel processing, and optimized file I/O. Generate millions of rows in seconds with automatic file partitioning and intelligent memory management.
โก Performance Highlights
- ๐๏ธ 10,000-100,000+ rows/sec - Vectorized NumPy operations for maximum speed
- ๐ Large-scale optimized - Designed for datasets >100K rows with intelligent batching
- ๐ฅ Smart file partitioning - Automatic splitting (256MB chunks, 1M rows/file)
- ๐พ Memory efficient - Streaming generation prevents memory exhaustion
- โก Direct PyArrow I/O - Optimized Parquet writing with Snappy compression
- ๐ Parallel processing - Multi-core CPU utilization with configurable workers
โจ Key Features
- ๐ฏ Ready-to-use schemas - Pre-built schemas for e-commerce, documents, images, users, news, and videos
- ๐ Schema management - Add, organize, and reuse custom schemas with metadata
- ๐ High-performance generation - Vectorized operations optimized for large datasets
- ๐ง Complete Milvus support - All field types including vectors, arrays, JSON, and primitive types
- โ Smart validation - Pydantic-based validation with detailed error messages and suggestions
- ๐ High-performance formats - Parquet (fastest I/O), JSON (structured data)
- ๐ฑ Reproducible results - Seed support for consistent data generation
- ๐จ Rich customization - Field constraints, nullable fields, auto-generated IDs
- ๐ Schema exploration - Validation, help commands, and schema details
- ๐ Unified interface - Use custom and built-in schemas interchangeably
Installation
# Install from PyPI (when published)
pip install milvus-fake-data
# Or install from source
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install
๐ Quick Start
1. Use Built-in Schemas (Recommended)
Get started instantly with pre-built schemas optimized for large-scale generation:
# List all available built-in schemas
milvus-fake-data schema list
# Generate data using a built-in schema (high-performance by default)
milvus-fake-data generate --builtin simple --rows 100000 --preview
# Generate large e-commerce dataset with automatic file partitioning
milvus-fake-data generate --builtin ecommerce --rows 2500000 --out products/
Available Built-in Schemas:
| Schema | Description | Use Cases |
|---|---|---|
simple |
Basic example with common field types | Learning, testing |
ecommerce |
Product catalog with search embeddings | Online stores, recommendations |
documents |
Document search with semantic embeddings | Knowledge bases, document search |
images |
Image gallery with visual similarity | Media platforms, image search |
users |
User profiles with behavioral embeddings | User analytics, personalization |
videos |
Video library with multimodal embeddings | Video platforms, content discovery |
news |
News articles with sentiment analysis | News aggregation, content analysis |
2. Create Custom Schemas
Define your own collection structure with JSON or YAML:
{
"collection_name": "my_collection",
"fields": [
{
"name": "id",
"type": "Int64",
"is_primary": true
},
{
"name": "title",
"type": "VarChar",
"max_length": 256
},
{
"name": "embedding",
"type": "FloatVector",
"dim": 128
}
]
}
# Generate large dataset from custom schema with high-performance mode
milvus-fake-data generate --schema my_schema.json --rows 1000000 --format parquet --preview
Note: Output is always a directory containing data files (in the specified format) and a meta.json file with collection metadata.
3. Schema Management
Store and organize your schemas for reuse:
# Add a custom schema to your library
milvus-fake-data schema add my_products product_schema.json
# List all schemas (built-in + custom)
milvus-fake-data schema list
# Use your custom schema like a built-in one (optimized for large datasets)
milvus-fake-data generate --builtin my_products --rows 500000
# Show detailed schema information
milvus-fake-data schema show my_products
4. Python API
from milvus_fake_data.generator import generate_mock_data
from milvus_fake_data.schema_manager import get_schema_manager
from milvus_fake_data.builtin_schemas import load_builtin_schema
from tempfile import NamedTemporaryFile
import json
# Use the schema manager to work with schemas
manager = get_schema_manager()
# List all available schemas
all_schemas = manager.list_all_schemas()
print("Available schemas:", list(all_schemas.keys()))
# Load any schema (built-in or custom)
schema = manager.load_schema("ecommerce") # Built-in
# schema = manager.load_schema("my_products") # Custom
# Generate data from schema (high-performance optimized)
with NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(schema, f)
df = generate_mock_data(f.name, rows=100000, seed=42)
print(df.head())
# Add a custom schema programmatically
custom_schema = {
"collection_name": "my_collection",
"fields": [
{"name": "id", "type": "Int64", "is_primary": True},
{"name": "text", "type": "VarChar", "max_length": 100},
{"name": "vector", "type": "FloatVector", "dim": 256}
]
}
manager.add_schema("my_custom", custom_schema, "Custom schema", ["testing"])
print("Added custom schema!")
๐ Schema Reference
Supported Field Types
| Type | Description | Required Parameters | Optional Parameters |
|---|---|---|---|
| Numeric Types | |||
Int8, Int16, Int32, Int64 |
Integer types | - | min, max |
Float, Double |
Floating point | - | min, max |
Bool |
Boolean values | - | - |
| Text Types | |||
VarChar, String |
Variable length string | max_length |
- |
JSON |
JSON objects | - | - |
| Vector Types | |||
FloatVector |
32-bit float vectors | dim |
- |
BinaryVector |
Binary vectors | dim |
- |
Float16Vector |
16-bit float vectors | dim |
- |
BFloat16Vector |
Brain float vectors | dim |
- |
Int8Vector |
8-bit integer vectors | dim |
- |
SparseFloatVector |
Sparse float vectors | dim |
- |
| Complex Types | |||
Array |
Array of elements | element_type, max_capacity |
max_length (for string elements) |
Field Properties
| Property | Description | Applicable Types |
|---|---|---|
is_primary |
Mark field as primary key (exactly one required) | All types |
auto_id |
Auto-generate ID values | Int64 primary keys only |
nullable |
Allow null values (10% probability) | All types |
min, max |
Value constraints | Numeric types |
max_length |
String/element length limit | String and Array types |
dim |
Vector dimension (1-32768) | Vector types |
element_type |
Array element type | Array type |
max_capacity |
Array capacity (1-4096) | Array type |
Complete Example
collection_name: "advanced_catalog"
fields:
# Primary key with auto-generated IDs
- name: "id"
type: "Int64"
is_primary: true
auto_id: true
# Text fields with constraints
- name: "title"
type: "VarChar"
max_length: 200
- name: "description"
type: "VarChar"
max_length: 1000
nullable: true
# Numeric fields with ranges
- name: "price"
type: "Float"
min: 0.01
max: 9999.99
- name: "rating"
type: "Int8"
min: 1
max: 5
# Vector for semantic search
- name: "embedding"
type: "FloatVector"
dim: 768
# Array of tags
- name: "tags"
type: "Array"
element_type: "VarChar"
max_capacity: 10
max_length: 50
# Structured metadata
- name: "metadata"
type: "JSON"
nullable: true
# Boolean flags
- name: "in_stock"
type: "Bool"
๐ CLI Reference
Command Structure
The CLI uses a clean grouped structure:
# Main command groups
milvus-fake-data generate [options] # Data generation
milvus-fake-data schema [command] # Schema management
milvus-fake-data clean [options] # Utility commands
Data Generation Commands
| Command | Description | Example |
|---|---|---|
--schema PATH |
Generate from custom schema file | milvus-fake-data generate --schema my_schema.json |
--builtin SCHEMA_ID |
Use built-in or managed schema | milvus-fake-data generate --builtin ecommerce |
--rows INTEGER |
Number of rows to generate | milvus-fake-data generate --rows 5000 |
--format FORMAT |
Output format (parquet, json) | milvus-fake-data generate --format json |
--out DIRECTORY |
Output directory path | milvus-fake-data generate --out my_data/ |
--preview |
Show first 5 rows | milvus-fake-data generate --preview |
--seed INTEGER |
Random seed for reproducibility | milvus-fake-data generate --seed 42 |
--validate-only |
Validate schema without generating | milvus-fake-data generate --validate-only |
--no-progress |
Disable progress bar display | milvus-fake-data generate --no-progress |
--batch-size INTEGER |
Batch size for memory efficiency (default: 50000) | milvus-fake-data generate --batch-size 100000 |
--max-file-size INTEGER |
Maximum size per file in MB (default: 256) | milvus-fake-data generate --max-file-size 100 |
--max-rows-per-file INTEGER |
Maximum rows per file (default: 1000000) | milvus-fake-data generate --max-rows-per-file 500000 |
--yes |
Auto-confirm prompts | milvus-fake-data generate --yes |
--force |
Force overwrite output directory | milvus-fake-data generate --force |
Schema Management Commands
| Command | Description | Example |
|---|---|---|
schema list |
List all schemas (built-in + custom) | milvus-fake-data schema list |
schema show SCHEMA_ID |
Show schema details | milvus-fake-data schema show ecommerce |
schema add SCHEMA_ID FILE |
Add custom schema | milvus-fake-data schema add products schema.json |
schema remove SCHEMA_ID |
Remove custom schema | milvus-fake-data schema remove products |
schema help |
Show schema format help | milvus-fake-data schema help |
Utility Commands
| Command | Description | Example |
|---|---|---|
clean |
Clean up generated output files | milvus-fake-data clean --yes |
--help |
Show help message | milvus-fake-data --help |
Common Usage Patterns
# Quick start with built-in schema (high-performance by default)
milvus-fake-data generate --builtin simple --rows 100000 --preview
# Generate massive datasets with automatic file partitioning
milvus-fake-data generate --builtin ecommerce --rows 5000000 --format parquet --out products/
# Test custom schema validation
milvus-fake-data generate --schema my_schema.json --validate-only
# Reproducible large-scale data generation
milvus-fake-data generate --builtin users --rows 2000000 --seed 42 --out users/
# Control file partitioning (smaller files for easier handling)
milvus-fake-data generate --builtin ecommerce --rows 5000000 --max-file-size 128 --max-rows-per-file 500000
# Schema management workflow
milvus-fake-data schema list
milvus-fake-data schema show ecommerce
milvus-fake-data schema add my_ecommerce ecommerce_base.json
# Clean up generated output files
milvus-fake-data clean --yes
๐ ๏ธ Development
This project uses PDM for dependency management and follows modern Python development practices.
Setup Development Environment
# Clone and setup
git clone https://github.com/your-org/milvus-fake-data.git
cd milvus-fake-data
pdm install # Install development dependencies
Development Workflow
# Code formatting and linting
pdm run ruff format src tests # Format code
pdm run ruff check src tests # Check linting
pdm run mypy src # Type checking
# Testing
pdm run pytest # Run all tests
pdm run pytest --cov=src --cov-report=html # With coverage
pdm run pytest tests/test_generator.py # Specific test file
# Combined quality checks
make lint test # Run linting and tests together
Project Structure
src/milvus_fake_data/
โโโ cli.py # High-performance CLI interface
โโโ generator.py # Core data generation logic
โโโ optimized_writer.py # Vectorized data generation with file partitioning
โโโ models.py # Pydantic validation models
โโโ schema_manager.py # Schema management system
โโโ builtin_schemas.py # Built-in schema definitions
โโโ rich_display.py # Terminal formatting
โโโ logging_config.py # Structured logging
โโโ schemas/ # Built-in schema files
โโโ simple.json
โโโ ecommerce.json
โโโ ...
๐ Performance Benchmarks
The high-performance engine delivers exceptional speed for large-scale data generation:
| Dataset Size | Time | Throughput | Memory Usage | File Output |
|---|---|---|---|---|
| 100K rows | ~13s | 7,500 rows/sec | <1GB | Single file |
| 1M rows | ~87s | 11,500 rows/sec | <2GB | Single file |
| 2.5M rows | ~217s | 11,500 rows/sec | <3GB | 5 files (auto-partitioned) |
| 10M rows | ~870s | 11,500 rows/sec | <4GB | 10 files (auto-partitioned) |
Key Performance Features:
- Vectorized Operations: NumPy-based batch processing for maximum CPU efficiency
- Smart Memory Management: Streaming generation prevents memory exhaustion
- Automatic File Partitioning: Files split at 256MB/1M rows for optimal handling
- Optimized I/O: Direct PyArrow integration with Snappy compression
- Parallel Processing: Multi-core utilization for vector generation and normalization
Recommended Settings for Large Datasets:
- Use
--format parquetfor fastest I/O (default) - Batch size 50K-100K rows for optimal memory/speed balance
- Enable automatic file partitioning for datasets >1M rows
๐ค Contributing
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with tests
- Ensure quality checks pass:
make lint test - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
Contribution Guidelines
- Add tests for new functionality
- Update documentation for API changes
- Follow existing code style (ruff + mypy)
- Include helpful error messages for user-facing features
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file milvus_fake_data-0.2.0.tar.gz.
File metadata
- Download URL: milvus_fake_data-0.2.0.tar.gz
- Upload date:
- Size: 49.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6deaadebd58dbc34d399b1a7310eef7cb20f22794c6955dd3ce2370e2404289c
|
|
| MD5 |
266e686d90e58fa0fb191c09dad9ecc8
|
|
| BLAKE2b-256 |
6491a0ba093672d556793339cad1db9760d7f1782dc8fcd00ef8d1d2cbd07526
|
File details
Details for the file milvus_fake_data-0.2.0-py3-none-any.whl.
File metadata
- Download URL: milvus_fake_data-0.2.0-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.22.3 CPython/3.13.4 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f197d4f5b8a2dd8c903b06572acff3cf5dffba9bcd5a19c87293f671550f9ed0
|
|
| MD5 |
94ee9fa4844ba13fed82feb31a4a873f
|
|
| BLAKE2b-256 |
cd8fd9a962a0efecb466428f1d3ba4a22486a03731d76d4d46cf08e73ce4abab
|