Skip to main content

A bunch of steps for distilabel.

Project description

Distilabel Steps Library

Note: This README was automatically generated by Claude-3.5-Sonnet. If you spot any errors or confusing sections, please file an issue or submit a PR!

A collection of utility steps for processing and manipulating data in distilabel pipelines.

Installation

pip install distilabel-steps-library

Available Steps

Chat Processing Steps

FormatPlaintextChatTranscript

Formats chat messages into a plaintext transcript format where each message is represented as ": " on a new line.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries with 'role' and 'content' keys

Output Columns:

  • transcript (str): Plaintext representation of the chat messages

Example:

from distilabel_steps_library.chat import FormatPlaintextChatTranscript

format_transcript = FormatPlaintextChatTranscript()
result = next(
    format_transcript.process([{
        "messages": [
            {"role": "user", "content": "What's 2+2?"},
            {"role": "assistant", "content": "4"}
        ]
    }])
)
# Result includes: 'transcript': 'user: What's 2+2?\nassistant: 4'

FlipMessageRoles

Flips the roles in chat messages between 'user' and 'assistant' while preserving system messages.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries

Output Columns:

  • flipped_messages (List[Dict[str, str]]): Messages with swapped roles

Example:

from distilabel_steps_library.chat import FlipMessageRoles

flip_roles = FlipMessageRoles()
result = next(
    flip_roles.process([{
        "messages": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi"}
        ]
    }])
)
# Result includes flipped roles: user->assistant, assistant->user

InsertMessage

Inserts a new message into the chat messages at a specified index.

Input Columns:

  • messages (List[Dict[str, str]]): List of message dictionaries
  • content (str): Content for the message to be inserted

Output Columns:

  • messages (List[Dict[str, str]]): Modified list with the new message

Example:

from distilabel_steps_library.chat import InsertMessage

insert = InsertMessage(index=0, role="system")
result = next(
    insert.process([{
        "messages": [
            {"role": "user", "content": "Hi"}
        ],
        "content": "Be helpful"
    }])
)
# Inserts system message at the beginning

Data Cleaning Steps

DropEmpty

Filters out rows containing empty values in specified columns.

Input Columns:

  • Any columns specified in the columns parameter (or all columns if none specified)

Output Columns:

  • All input columns (for non-empty rows)

Example:

from distilabel_steps_library import DropEmpty

# Drop rows with empty values in specific columns
drop_step = DropEmpty(columns=["instruction", "response"])
result = next(
    drop_step.process([
        {"instruction": "Task", "response": ""},  # Will be dropped
        {"instruction": "Task 2", "response": "Answer"}  # Will be kept
    ])
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

WTFPL.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distilabel_steps_library-0.1.1.tar.gz (56.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distilabel_steps_library-0.1.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file distilabel_steps_library-0.1.1.tar.gz.

File metadata

File hashes

Hashes for distilabel_steps_library-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a3456e8df6cba392fa26bcea34e64c09f9b3c2565ad95d47fea80c10c3d533f3
MD5 3b1b1c5eea4c3c5c8afe1f7b4484a26c
BLAKE2b-256 e5a52fa6369cdcbd2e839c40b17221fd48d283b7519fc6843fdb626c5cf6da38

See more details on using hashes here.

File details

Details for the file distilabel_steps_library-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for distilabel_steps_library-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c08c1c38256d287b5707df182201f11bd1ef3f2018446d1766ae9e59defcc765
MD5 e413e4d6d123d6bfb872bbc6552e6fff
BLAKE2b-256 ee4edc31b02b1abcf458e2545d3edf68a96807068994c939b94f32d0f87ed850

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page