A tool for managing and uploading datasets to Argilla
Project description
Argilla Dataset Manager
A Python-based tool for managing and uploading datasets to Argilla, specifically designed for handling various types of text datasets with advanced configuration options.
Features
- Easy dataset creation with predefined templates
- Flexible dataset configuration for different use cases
- Dataset migration and versioning
- Workspace management
- Robust error handling and logging
Installation
From PyPI (Recommended)
pip install argilla-dataset-manager
From Source
- Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
- Install in development mode:
pip install -e .
Configuration
Create a .env file in your project directory with your Argilla credentials:
ARGILLA_API_URL=your_argilla_api_url
ARGILLA_API_KEY=your_api_key
Quick Start
1. Create a Text Classification Dataset
from argilla_dataset_manager import DatasetManager, get_argilla_client, SettingsManager
# Initialize
client = get_argilla_client()
dataset_manager = DatasetManager(client)
settings_manager = SettingsManager()
# Create settings for text classification
settings = settings_manager.create_text_classification(
labels=['positive', 'negative', 'neutral'],
guidelines="Sentiment analysis dataset",
include_metadata=True,
metadata_fields=['source', 'confidence']
)
# Create dataset
dataset = dataset_manager.create_dataset(
workspace="my_workspace",
dataset="sentiment_analysis",
settings=settings
)
# Add records
record = rg.Record(
fields={
"text": "This product is amazing!"
},
metadata={
"source": "reviews",
"confidence": 0.95
}
)
dataset.records.log([record])
2. Create a Q&A Dataset
# Create settings for Q&A dataset
settings = settings_manager.create_qa_dataset(
include_context=True,
include_keywords=True,
include_references=True,
guidelines="Customer support Q&A dataset"
)
# Create dataset
dataset = dataset_manager.create_dataset(
workspace="support_workspace",
dataset="customer_qa",
settings=settings
)
# Add a Q&A record
record = rg.Record(
fields={
"question": "How do I reset my password?",
"answer": "Click on 'Forgot Password' and follow the instructions.",
"context": "User authentication flow",
"keywords": "password,reset,auth",
"references": "docs/auth.md"
},
metadata={
"source": "support_tickets",
"date": "2023-12-01"
}
)
dataset.records.log([record])
3. Dataset Migration and Versioning
# Create new version of existing dataset with updated settings
new_version = dataset_manager.update_dataset_settings(
workspace="my_workspace",
dataset="customer_qa",
new_settings=updated_settings,
create_new_version=True
)
# Clone dataset to different workspace
cloned_dataset = dataset_manager.clone_dataset(
workspace="development",
dataset="customer_qa",
new_name="customer_qa_prod",
new_workspace="production"
)
Available Dataset Templates
The SettingsManager provides several predefined templates:
-
Text Classification
- Basic text classification with customizable labels
- Optional metadata fields
-
Q&A Datasets
- Question and answer fields
- Optional context, keywords, and references
- Configurable metadata
-
Text Generation
- Prompt and response fields
- Optional prompt templates
- Model-specific metadata
-
Text Summarization
- Text and summary fields
- Length and compression ratio tracking
- Source tracking
-
Custom Datasets
- Create datasets with custom fields
- Flexible metadata configuration
Development
Setup Development Environment
- Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install development dependencies:
pip install -e ".[dev]"
Running Tests
pytest tests/
Code Style
This project uses:
- Black for code formatting
- isort for import sorting
- mypy for type checking
To format code:
black .
isort .
mypy .
Project Structure
argilla_dataset_manager/
├── __init__.py # Package initialization
├── utils/
│ ├── argilla_client.py # Argilla API interaction
│ ├── dataset_manager.py # Dataset management
│ └── logger.py # Logging configuration
└── datasets/
└── settings_manager.py # Dataset settings and templates
Error Handling
The library includes comprehensive error handling:
- Connection validation
- Workspace existence checks
- Dataset creation validation
- Record format validation
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License - see the LICENSE file for details
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file argilla_dataset_manager-0.1.6.tar.gz.
File metadata
- Download URL: argilla_dataset_manager-0.1.6.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9f0883cbaf6fef7281b9e7d31267505602e36da2582f614100fe6a3acdc180d
|
|
| MD5 |
c818a7ad0f86c352c864b3822708da74
|
|
| BLAKE2b-256 |
83d09dc9bfbd5ea47b7a614c92ea8bdd47f609b0deb49ddf09918c424ab65ca3
|
File details
Details for the file argilla_dataset_manager-0.1.6-py3-none-any.whl.
File metadata
- Download URL: argilla_dataset_manager-0.1.6-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9da8181531546414b890b19f0ca63c605fd05b0462608c98a83e640ded377726
|
|
| MD5 |
2c6e43221ac9360f2d53113f156c2d05
|
|
| BLAKE2b-256 |
8ef69e7a73fba5f43b24d734ab5cbd7af23cb51a6e52b943daf89ac90d6fd4ea
|