Skip to main content

A bunch of steps for distilabel.

Project description

Distilabel Steps Library

Note: This README was automatically generated by Claude-3.5-Sonnet. If you spot any errors or confusing sections, please file an issue or submit a PR!

A collection of utility steps for processing and manipulating data in distilabel pipelines.

Installation

pip install distilabel-steps-library

Available Steps

Chat Processing Steps

FormatPlaintextChatTranscript

Formats chat messages into a plaintext transcript format where each message is represented as ": " on a new line.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries with 'role' and 'content' keys

Output Columns:

  • transcript (str): Plaintext representation of the chat messages

Example:

from distilabel_steps_library.chat import FormatPlaintextChatTranscript

format_transcript = FormatPlaintextChatTranscript()
result = next(
    format_transcript.process([{
        "messages": [
            {"role": "user", "content": "What's 2+2?"},
            {"role": "assistant", "content": "4"}
        ]
    }])
)
# Result includes: 'transcript': 'user: What's 2+2?\nassistant: 4'

FlipMessageRoles

Flips the roles in chat messages between 'user' and 'assistant' while preserving system messages.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries

Output Columns:

  • flipped_messages (List[Dict[str, str]]): Messages with swapped roles

Example:

from distilabel_steps_library.chat import FlipMessageRoles

flip_roles = FlipMessageRoles()
result = next(
    flip_roles.process([{
        "messages": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi"}
        ]
    }])
)
# Result includes flipped roles: user->assistant, assistant->user

InsertMessage

Inserts a new message into the chat messages at a specified index.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries
  • content (str): Content for the message to be inserted

Output Columns:

  • messages (List[Dict[str, str]]): Modified list with the new message

Example:

from distilabel_steps_library.chat import InsertMessage

insert = InsertMessage(index=0, role="system")
result = next(
    insert.process([{
        "messages": [
            {"role": "user", "content": "Hi"}
        ],
        "content": "Be helpful"
    }])
)
# Inserts system message at the beginning

Data Cleaning Steps

DropEmpty

Filters out rows containing empty values in specified columns.

Input Columns:

  • Any columns specified in the columns parameter (or all columns if none specified)

Output Columns:

  • All input columns (for non-empty rows)

Example:

from distilabel_steps_library import DropEmpty

# Drop rows with empty values in specific columns
drop_step = DropEmpty(columns=["instruction", "response"])
result = next(
    drop_step.process([
        {"instruction": "Task", "response": ""},  # Will be dropped
        {"instruction": "Task 2", "response": "Answer"}  # Will be kept
    ])
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

WTFPL.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distilabel_steps_library-0.1.2.tar.gz (57.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distilabel_steps_library-0.1.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file distilabel_steps_library-0.1.2.tar.gz.

File metadata

File hashes

Hashes for distilabel_steps_library-0.1.2.tar.gz
Algorithm Hash digest
SHA256 544c5e09d615f7d2eba86504f8760a9ea535e0765bb6b06a7a3da8c2fe47feb2
MD5 74a3ca07a52d27c962ccfa3dbc940256
BLAKE2b-256 c43156d40ad81099222606a9c07bacb44214f088b4a85f7a720f43394a3f3164

See more details on using hashes here.

File details

Details for the file distilabel_steps_library-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for distilabel_steps_library-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a4ca5575643465288b96da361d321f329eb17125a6679b1aed9ce4979ec36595
MD5 8934f0991d73378b89ef87de20331eac
BLAKE2b-256 1f761f0ea79ef1eeff8c953c6f9acb618f672858ab75cfb0e7c12c1b024f9d5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page