A package for managing datasets with Milvus integration
Project description
Milvus Dataset
Milvus Dataset is a Python library designed for efficient management and processing of large-scale datasets, specifically tailored for integration with Milvus vector database. It provides a simple yet powerful interface for creating, writing, reading, and managing datasets, particularly suited for handling large-scale vector data.
Key Features
- Intelligent File Splitting: Automatically splits large datasets into appropriately sized files, optimizing storage and query efficiency.
- Flexible Data Format Support: Supports various data formats including pandas DataFrame, PyArrow Table, dictionaries, and lists of dictionaries.
- Efficient Data Writing: Utilizes Dask for parallelized data writing, significantly enhancing large-scale data processing speed.
- Dynamic File Size Adjustment: Automatically adjusts file sizes to ensure optimal storage and query performance.
- Seamless Milvus Integration: Designed specifically for Milvus vector database, supporting efficient vector data management and querying.
- Multiple Reading Modes: Supports streaming, batch, and full data reading, adapting to different use cases.
- Data Validation: Offers optional schema validation for training datasets, ensuring data quality.
Installation
Install Milvus Dataset using pip:
pip install milvus-dataset
Quick Start
Here's a simple usage example:
from milvus_dataset import Dataset, configure_logger
# Configure logging level
configure_logger(level="INFO")
# Initialize the dataset
dataset = Dataset("my_dataset", root_path="/path/to/data")
# Write data
data = {...} # Your data, can be a DataFrame, dictionary, etc.
dataset.write(data, mode='append')
# Read data
train_data = dataset.read(split='train')
Detailed Usage
Writing Data
# Use DatasetWriter for more granular control
from milvus_dataset import DatasetWriter
writer = DatasetWriter(dataset, target_file_size_mb=5)
writer.write(data, mode='append')
Reading Data
# Full read
full_data = dataset.read(mode='full')
# Stream read
for batch in dataset.read(mode='stream'):
process_batch(batch)
# Batch read
for batch in dataset.read(mode='batch', batch_size=1000):
process_batch(batch)
Schema Validation
# Set schema for training data
dataset.set_schema({
"id": (int, ...),
"vector": ([float], 128), # 128-dimensional vector
"label": (str, ...)
})
Configuration
Milvus Dataset can be configured through environment variables or a configuration file. Key configuration items include:
MILVUS_DATASET_ROOT
: Root directory for datasetsMILVUS_DATASET_LOG_LEVEL
: Logging level
Contributing
We welcome contributions of all forms! If you find a bug or have a feature suggestion, please create an issue. If you'd like to contribute code, please submit a pull request.
License
Milvus Dataset is licensed under the Apache 2.0 License.
Contact Us
If you have any questions or suggestions, please contact us through GitHub Issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file milvus-dataset-0.1.0.tar.gz
.
File metadata
- Download URL: milvus-dataset-0.1.0.tar.gz
- Upload date:
- Size: 139.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.1 CPython/3.12.5 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 253827f82adb5683e4ac744ff40946cc4415a2caa11aa3d5294155dd76496a7d |
|
MD5 | 1295b459d6294c7d9808b39c86dcb032 |
|
BLAKE2b-256 | e6c9b1e332a700694262fec5e15f51ce5182c80753ce3cc12a7a33ad16a738fd |
File details
Details for the file milvus_dataset-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: milvus_dataset-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.1 CPython/3.12.5 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ecdef9f93cbfd7f1f6042383e94cf7a7572eb4b01e54917e5bb4ab7ce0c2f60 |
|
MD5 | 005f31e1910af544b94d927b1868ef3e |
|
BLAKE2b-256 | 588c719b252602cc2ca1d8c1e2ebc79425af41839decd4f7849b4cc6f79a0af9 |