Skip to main content

A tool for managing and uploading datasets to Argilla

Project description

Argilla Dataset Manager

A Python-based tool for managing and uploading datasets to Argilla, specifically designed for handling various types of text datasets with advanced configuration options.

Features

  • Easy dataset creation with predefined templates
  • Flexible dataset configuration for different use cases
  • Dataset migration and versioning
  • Workspace management
  • Robust error handling and logging

Installation

From PyPI (Recommended)

pip install argilla-dataset-manager

From Source

  1. Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
  1. Install in development mode:
pip install -e .

Configuration

Create a .env file in your project directory with your Argilla credentials:

ARGILLA_API_URL=your_argilla_api_url
ARGILLA_API_KEY=your_api_key

Quick Start

1. Create a Text Classification Dataset

from argilla_dataset_manager import DatasetManager, get_argilla_client, SettingsManager

# Initialize
client = get_argilla_client()
dataset_manager = DatasetManager(client)
settings_manager = SettingsManager()

# Create settings for text classification
settings = settings_manager.create_text_classification(
    labels=['positive', 'negative', 'neutral'],
    guidelines="Sentiment analysis dataset",
    include_metadata=True,
    metadata_fields=['source', 'confidence']
)

# Create dataset
dataset = dataset_manager.create_dataset(
    workspace="my_workspace",
    dataset="sentiment_analysis",
    settings=settings
)

# Add records
record = rg.Record(
    fields={
        "text": "This product is amazing!"
    },
    metadata={
        "source": "reviews",
        "confidence": 0.95
    }
)
dataset.records.log([record])

2. Create a Q&A Dataset

# Create settings for Q&A dataset
settings = settings_manager.create_qa_dataset(
    include_context=True,
    include_keywords=True,
    include_references=True,
    guidelines="Customer support Q&A dataset"
)

# Create dataset
dataset = dataset_manager.create_dataset(
    workspace="support_workspace",
    dataset="customer_qa",
    settings=settings
)

# Add a Q&A record
record = rg.Record(
    fields={
        "question": "How do I reset my password?",
        "answer": "Click on 'Forgot Password' and follow the instructions.",
        "context": "User authentication flow",
        "keywords": "password,reset,auth",
        "references": "docs/auth.md"
    },
    metadata={
        "source": "support_tickets",
        "date": "2023-12-01"
    }
)
dataset.records.log([record])

3. Dataset Migration and Versioning

# Create new version of existing dataset with updated settings
new_version = dataset_manager.update_dataset_settings(
    workspace="my_workspace",
    dataset="customer_qa",
    new_settings=updated_settings,
    create_new_version=True
)

# Clone dataset to different workspace
cloned_dataset = dataset_manager.clone_dataset(
    workspace="development",
    dataset="customer_qa",
    new_name="customer_qa_prod",
    new_workspace="production"
)

Available Dataset Templates

The SettingsManager provides several predefined templates:

  1. Text Classification

    • Basic text classification with customizable labels
    • Optional metadata fields
  2. Q&A Datasets

    • Question and answer fields
    • Optional context, keywords, and references
    • Configurable metadata
  3. Text Generation

    • Prompt and response fields
    • Optional prompt templates
    • Model-specific metadata
  4. Text Summarization

    • Text and summary fields
    • Length and compression ratio tracking
    • Source tracking
  5. Custom Datasets

    • Create datasets with custom fields
    • Flexible metadata configuration

Development

Setup Development Environment

  1. Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install development dependencies:
pip install -e ".[dev]"

Running Tests

pytest tests/

Code Style

This project uses:

  • Black for code formatting
  • isort for import sorting
  • mypy for type checking

To format code:

black .
isort .
mypy .

Project Structure

argilla_dataset_manager/
├── __init__.py            # Package initialization
├── utils/
│   ├── argilla_client.py  # Argilla API interaction
│   ├── dataset_manager.py # Dataset management
│   └── logger.py          # Logging configuration
└── datasets/
    └── settings_manager.py # Dataset settings and templates

Error Handling

The library includes comprehensive error handling:

  • Connection validation
  • Workspace existence checks
  • Dataset creation validation
  • Record format validation

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT License - see the LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argilla_dataset_manager-0.1.6.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argilla_dataset_manager-0.1.6-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file argilla_dataset_manager-0.1.6.tar.gz.

File metadata

  • Download URL: argilla_dataset_manager-0.1.6.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.18

File hashes

Hashes for argilla_dataset_manager-0.1.6.tar.gz
Algorithm Hash digest
SHA256 d9f0883cbaf6fef7281b9e7d31267505602e36da2582f614100fe6a3acdc180d
MD5 c818a7ad0f86c352c864b3822708da74
BLAKE2b-256 83d09dc9bfbd5ea47b7a614c92ea8bdd47f609b0deb49ddf09918c424ab65ca3

See more details on using hashes here.

File details

Details for the file argilla_dataset_manager-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for argilla_dataset_manager-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9da8181531546414b890b19f0ca63c605fd05b0462608c98a83e640ded377726
MD5 2c6e43221ac9360f2d53113f156c2d05
BLAKE2b-256 8ef69e7a73fba5f43b24d734ab5cbd7af23cb51a6e52b943daf89ac90d6fd4ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page