Skip to main content

A tool for managing and uploading datasets to Argilla

Project description

Argilla Dataset Manager

A Python-based tool for managing and uploading datasets to Argilla, specifically designed for handling various types of text datasets with advanced configuration options.

Features

  • Easy dataset creation with predefined templates
  • Flexible dataset configuration for different use cases
  • Dataset migration and versioning
  • Workspace management
  • Robust error handling and logging

Installation

From PyPI (Recommended)

pip install argilla-dataset-manager

From Source

  1. Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
  1. Install in development mode:
pip install -e .

Configuration

Create a .env file in your project directory with your Argilla credentials:

ARGILLA_API_URL=your_argilla_api_url
ARGILLA_API_KEY=your_api_key

Quick Start

1. Create a Text Classification Dataset

from argilla_dataset_manager import DatasetManager, get_argilla_client, SettingsManager

# Initialize
client = get_argilla_client()
dataset_manager = DatasetManager(client)
settings_manager = SettingsManager()

# Create settings for text classification
settings = settings_manager.create_text_classification(
    labels=['positive', 'negative', 'neutral'],
    guidelines="Sentiment analysis dataset",
    include_metadata=True,
    metadata_fields=['source', 'confidence']
)

# Create dataset
dataset = dataset_manager.create_dataset(
    workspace="my_workspace",
    dataset="sentiment_analysis",
    settings=settings
)

# Add records
record = rg.Record(
    fields={
        "text": "This product is amazing!"
    },
    metadata={
        "source": "reviews",
        "confidence": 0.95
    }
)
dataset.records.log([record])

2. Create a Q&A Dataset

# Create settings for Q&A dataset
settings = settings_manager.create_qa_dataset(
    include_context=True,
    include_keywords=True,
    include_references=True,
    guidelines="Customer support Q&A dataset"
)

# Create dataset
dataset = dataset_manager.create_dataset(
    workspace="support_workspace",
    dataset="customer_qa",
    settings=settings
)

# Add a Q&A record
record = rg.Record(
    fields={
        "question": "How do I reset my password?",
        "answer": "Click on 'Forgot Password' and follow the instructions.",
        "context": "User authentication flow",
        "keywords": "password,reset,auth",
        "references": "docs/auth.md"
    },
    metadata={
        "source": "support_tickets",
        "date": "2023-12-01"
    }
)
dataset.records.log([record])

3. Dataset Migration and Versioning

# Create new version of existing dataset with updated settings
new_version = dataset_manager.update_dataset_settings(
    workspace="my_workspace",
    dataset="customer_qa",
    new_settings=updated_settings,
    create_new_version=True
)

# Clone dataset to different workspace
cloned_dataset = dataset_manager.clone_dataset(
    workspace="development",
    dataset="customer_qa",
    new_name="customer_qa_prod",
    new_workspace="production"
)

Available Dataset Templates

The SettingsManager provides several predefined templates:

  1. Text Classification

    • Basic text classification with customizable labels
    • Optional metadata fields
  2. Q&A Datasets

    • Question and answer fields
    • Optional context, keywords, and references
    • Configurable metadata
  3. Text Generation

    • Prompt and response fields
    • Optional prompt templates
    • Model-specific metadata
  4. Text Summarization

    • Text and summary fields
    • Length and compression ratio tracking
    • Source tracking
  5. Custom Datasets

    • Create datasets with custom fields
    • Flexible metadata configuration

Development

Setup Development Environment

  1. Clone the repository:
git clone https://github.com/jordanrburger/argilla_dataset_manager.git
cd argilla-dataset-manager
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install development dependencies:
pip install -e ".[dev]"

Running Tests

pytest tests/

Code Style

This project uses:

  • Black for code formatting
  • isort for import sorting
  • mypy for type checking

To format code:

black .
isort .
mypy .

Project Structure

argilla_dataset_manager/
├── __init__.py            # Package initialization
├── utils/
│   ├── argilla_client.py  # Argilla API interaction
│   ├── dataset_manager.py # Dataset management
│   └── logger.py          # Logging configuration
└── datasets/
    └── settings_manager.py # Dataset settings and templates

Error Handling

The library includes comprehensive error handling:

  • Connection validation
  • Workspace existence checks
  • Dataset creation validation
  • Record format validation

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT License - see the LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argilla_dataset_manager-0.1.5.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argilla_dataset_manager-0.1.5-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file argilla_dataset_manager-0.1.5.tar.gz.

File metadata

  • Download URL: argilla_dataset_manager-0.1.5.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.18

File hashes

Hashes for argilla_dataset_manager-0.1.5.tar.gz
Algorithm Hash digest
SHA256 ac3845cdfc2f578bc60c08f8604a1caebd396b64c271e7b6a79cf7b784125e5a
MD5 739a743d9c53f983540c24d70bb6f74f
BLAKE2b-256 1b027225b6dd8b9f4f80a8de892a9558e6f61661da76b882fabebe642295b107

See more details on using hashes here.

File details

Details for the file argilla_dataset_manager-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for argilla_dataset_manager-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 91acb980e6af91038764a6a5716c3b244231628610ab93507fc0baf8515d7fad
MD5 c73e06109770f4adf0f44fc182aeb9b5
BLAKE2b-256 03b7a2e5efa9a618a702d0fdb4989c7cabd92ee74f8e9e8da758e5062406fd08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page