Skip to main content

A package for managing datasets with Milvus integration

Project description

Milvus Dataset

Milvus Dataset is a Python library designed for efficient management and processing of large-scale datasets, specifically tailored for integration with Milvus vector database. It provides a simple yet powerful interface for creating, writing, reading, and managing datasets, particularly suited for handling large-scale vector data.

Key Features

  • Intelligent File Splitting: Automatically splits large datasets into appropriately sized files, optimizing storage and query efficiency.
  • Flexible Data Format Support: Supports various data formats including pandas DataFrame, PyArrow Table, dictionaries, and lists of dictionaries.
  • Efficient Data Writing: Utilizes Dask for parallelized data writing, significantly enhancing large-scale data processing speed.
  • Dynamic File Size Adjustment: Automatically adjusts file sizes to ensure optimal storage and query performance.
  • Seamless Milvus Integration: Designed specifically for Milvus vector database, supporting efficient vector data management and querying.
  • Multiple Reading Modes: Supports streaming, batch, and full data reading, adapting to different use cases.
  • Data Validation: Offers optional schema validation for training datasets, ensuring data quality.

Installation

Install Milvus Dataset using pip:

pip install milvus-dataset

Quick Start

Here's a simple usage example:

from milvus_dataset import Dataset, configure_logger

# Configure logging level
configure_logger(level="INFO")

# Initialize the dataset
dataset = Dataset("my_dataset", root_path="/path/to/data")

# Write data
data = {...}  # Your data, can be a DataFrame, dictionary, etc.
dataset.write(data, mode='append')

# Read data
train_data = dataset.read(split='train')

Detailed Usage

Writing Data

# Use DatasetWriter for more granular control
from milvus_dataset import DatasetWriter

writer = DatasetWriter(dataset, target_file_size_mb=5)
writer.write(data, mode='append')

Reading Data

# Full read
full_data = dataset.read(mode='full')

# Stream read
for batch in dataset.read(mode='stream'):
    process_batch(batch)

# Batch read
for batch in dataset.read(mode='batch', batch_size=1000):
    process_batch(batch)

Schema Validation

# Set schema for training data
dataset.set_schema({
    "id": (int, ...),
    "vector": ([float], 128),  # 128-dimensional vector
    "label": (str, ...)
})

Configuration

Milvus Dataset can be configured through environment variables or a configuration file. Key configuration items include:

  • MILVUS_DATASET_ROOT: Root directory for datasets
  • MILVUS_DATASET_LOG_LEVEL: Logging level

Contributing

We welcome contributions of all forms! If you find a bug or have a feature suggestion, please create an issue. If you'd like to contribute code, please submit a pull request.

License

Milvus Dataset is licensed under the Apache 2.0 License.

Contact Us

If you have any questions or suggestions, please contact us through GitHub Issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

milvus-dataset-0.1.0.tar.gz (139.8 kB view details)

Uploaded Source

Built Distribution

milvus_dataset-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file milvus-dataset-0.1.0.tar.gz.

File metadata

  • Download URL: milvus-dataset-0.1.0.tar.gz
  • Upload date:
  • Size: 139.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.18.1 CPython/3.12.5 Darwin/22.6.0

File hashes

Hashes for milvus-dataset-0.1.0.tar.gz
Algorithm Hash digest
SHA256 253827f82adb5683e4ac744ff40946cc4415a2caa11aa3d5294155dd76496a7d
MD5 1295b459d6294c7d9808b39c86dcb032
BLAKE2b-256 e6c9b1e332a700694262fec5e15f51ce5182c80753ce3cc12a7a33ad16a738fd

See more details on using hashes here.

File details

Details for the file milvus_dataset-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: milvus_dataset-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.18.1 CPython/3.12.5 Darwin/22.6.0

File hashes

Hashes for milvus_dataset-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ecdef9f93cbfd7f1f6042383e94cf7a7572eb4b01e54917e5bb4ab7ce0c2f60
MD5 005f31e1910af544b94d927b1868ef3e
BLAKE2b-256 588c719b252602cc2ca1d8c1e2ebc79425af41839decd4f7849b4cc6f79a0af9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page