Default template for PDM package

Project description

Milvus Dataset

Milvus Dataset is a versatile Python library for efficient management and processing of large-scale datasets. While optimized for seamless integration with Milvus vector database, it also serves as a powerful standalone dataset management tool. The library provides a simple yet powerful interface for creating, writing, reading, and managing datasets, particularly excelling in handling large-scale vector data and general-purpose data management tasks.

Key Features

Flexible Storage Support
- Local storage support
- Object storage support (S3/MinIO)
- Easy migration between different storage types
Rich Data Type Support
- Basic data types (INT64, VARCHAR, etc.)
- Vector data types (FLOAT_VECTOR)
- JSON fields
- Sparse vectors
- Binary vectors
Dataset Management
- Training and test set split support
- Dataset metadata management
- Dataset statistics and analytics
- Schema definition and validation
Integration Capabilities
- Import to Milvus database
- Upload to Hugging Face Hub
- Seamless pandas DataFrame integration
- Built-in nearest neighbor computation
- Built-in mock data generation

Installation

pip install milvus-dataset

Quick Start Guide

1. Basic Configuration

from milvus_dataset import ConfigManager, StorageType

# Initialize local storage
ConfigManager().init_storage(
    root_path="./data/my-dataset",
    storage_type=StorageType.LOCAL,
)

# Initialize S3 storage
ConfigManager().init_storage(
    root_path="s3://bucket/path",
    storage_type=StorageType.S3,
    options={
        "aws_access_key_id": "your_key",
        "aws_secret_access_key": "your_secret",
        "endpoint_url": "your_endpoint"  # Optional, for MinIO
    }
)

2. Creating a Dataset

from pymilvus import CollectionSchema, DataType, FieldSchema
from milvus_dataset import load_dataset

# Define Schema
schema = CollectionSchema(
    fields=[
        FieldSchema("id", DataType.INT64, is_primary=True),
        FieldSchema("text", DataType.VARCHAR, max_length=65535),
        FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1024)
    ],
    description="Text vector dataset"
)

# Load dataset
dataset = load_dataset("my-dataset", schema=schema)

3. Writing Data

import pandas as pd
import numpy as np

# Prepare data
df = pd.DataFrame({
    "id": range(1000),
    "text": ["text_" + str(i) for i in range(1000)],
    "embedding": [np.random.rand(1024) for _ in range(1000)]
})

# Write to training set
with dataset["train"].get_writer(mode="append") as writer:
    writer.write(df)

4. Dataset Operations

# View dataset information
print(dataset.summary())

# Compute neighbors
dataset.compute_neighbors(
    vector_field_name="embedding",
    pk_field_name="id",
    top_k=100
)

# import to Milvus
dataset.to_milvus(
    milvus_config={
        "host": "localhost",
        "port": 19530
    },
    milvus_storage=StorageConfig(
        root_path="s3://bucket/path",
        storage_type=StorageType.S3,
        options={
            "aws_access_key_id": "your_key",
            "aws_secret_access_key": "your_secret",
            "endpoint_url": "your_endpoint"  # Optional, for MinIO
        }
    )

)

# Upload to Hugging Face
dataset.to_hf(repo_name="username/dataset-name")

Advanced Usage

Performance Optimization

File Size Configuration

with dataset["train"].get_writer(
    mode="append",
    target_file_size_mb=512,  # Adjust file size
    num_buffers=15,           # Adjust buffer number
    queue_size=30             # Adjust queue size
) as writer:
    writer.write(df)

Batch Processing

# Read in batches
for batch in dataset["train"].read(mode="batch", batch_size=1000):
    process_batch(batch)

Storage Migration

# Move data from local to S3
dataset.to_storage(StorageConfig(
    storage_type=StorageType.S3,
    root_path="s3://bucket/path",
    options={...}
))

Common Issues and Solutions

Storage Type Selection
- Use local storage for development and testing
- Use object storage for production environments
Handling Large-Scale Data
- Use batch writing
- Set appropriate buffer size and queue size
- Consider parallel processing
Ensuring Data Quality
- Define comprehensive schema
- Enable schema validation
- Regularly check dataset statistics
Performance Optimization Tips
- Set reasonable file size (target_file_size_mb)
- Adjust buffer parameters (num_buffers, queue_size)
- Process data in batches instead of one by one

Contributing

We welcome contributions! Please feel free to submit a Pull Request.

Project details

Release history Release notifications | RSS feed

1.0.0.post47

Sep 5, 2025

1.0.0.post45

Sep 5, 2025

1.0.0.post44

Sep 5, 2025

1.0.0.post43

Sep 5, 2025

1.0.0.post42

Aug 29, 2025

1.0.0.post41

Aug 29, 2025

1.0.0.post40

Aug 29, 2025

1.0.0.post36

Aug 29, 2025

1.0.0.post35

Aug 29, 2025

1.0.0.post34

Aug 29, 2025

1.0.0.post33

Aug 28, 2025

1.0.0.post32

Aug 28, 2025

1.0.0.post31

Aug 28, 2025

1.0.0.post30

Aug 28, 2025

1.0.0.post21

Jan 7, 2025

This version

1.0.0.post20

Jan 7, 2025

1.0.0.post19

Jan 7, 2025

1.0.0.post18

Jan 7, 2025

1.0.0.post17

Jan 7, 2025

0.1.0

Sep 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

milvus_dataset-1.0.0.post20.tar.gz (39.7 kB view details)

Uploaded Jan 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

milvus_dataset-1.0.0.post20-py3-none-any.whl (38.7 kB view details)

Uploaded Jan 7, 2025 Python 3

File details

Details for the file milvus_dataset-1.0.0.post20.tar.gz.

File metadata

Download URL: milvus_dataset-1.0.0.post20.tar.gz
Upload date: Jan 7, 2025
Size: 39.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.21.0 CPython/3.13.0 Darwin/22.6.0

File hashes

Hashes for milvus_dataset-1.0.0.post20.tar.gz
Algorithm	Hash digest
SHA256	`4b976484c2bdd6d4f67d93c8ae1951798e82ead82ab34213114f2b3d8f74e3ee`
MD5	`dca043fc22560c60c34b5d844bc3f661`
BLAKE2b-256	`e8d1236f4dd946f7ef7f87f048be07af324e392bc38733841e1e59846c9741d6`

See more details on using hashes here.

File details

Details for the file milvus_dataset-1.0.0.post20-py3-none-any.whl.

File metadata

Download URL: milvus_dataset-1.0.0.post20-py3-none-any.whl
Upload date: Jan 7, 2025
Size: 38.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.21.0 CPython/3.13.0 Darwin/22.6.0

File hashes

Hashes for milvus_dataset-1.0.0.post20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82d2bcb2267e77647ed390a6e02d9f4eb0c20902227e667d363f59af2f42dfca`
MD5	`9d5cf0ecf3d1cab045b5201ca0c6e310`
BLAKE2b-256	`c884c004b90e6bb661bd9a73853e26c990845067fbbf239b4c58dd73904a1a02`

See more details on using hashes here.

milvus-dataset 1.0.0.post20

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Milvus Dataset

Key Features

Installation

Quick Start Guide

1. Basic Configuration

2. Creating a Dataset

3. Writing Data

4. Dataset Operations

Advanced Usage

Performance Optimization

Storage Migration

Common Issues and Solutions

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes