A bunch of steps for distilabel.
Project description
Distilabel Steps Library
Note: This README was automatically generated by Claude-3.5-Sonnet. If you spot any errors or confusing sections, please file an issue or submit a PR!
A collection of utility steps for processing and manipulating data in distilabel pipelines.
Installation
pip install distilabel-steps-library
Available Steps
Chat Processing Steps
FormatPlaintextChatTranscript
Formats chat messages into a plaintext transcript format where each message is represented as ": " on a new line.
Input Columns:
messages(List[Dict[str, str]]): List of message dictionaries with 'role' and 'content' keys
Output Columns:
transcript(str): Plaintext representation of the chat messages
Example:
from distilabel_steps_library.chat import FormatPlaintextChatTranscript
format_transcript = FormatPlaintextChatTranscript()
result = next(
format_transcript.process([{
"messages": [
{"role": "user", "content": "What's 2+2?"},
{"role": "assistant", "content": "4"}
]
}])
)
# Result includes: 'transcript': 'user: What's 2+2?\nassistant: 4'
FlipMessageRoles
Flips the roles in chat messages between 'user' and 'assistant' while preserving system messages.
Input Columns:
messages(List[Dict[str, str]]): List of message dictionaries
Output Columns:
flipped_messages(List[Dict[str, str]]): Messages with swapped roles
Example:
from distilabel_steps_library.chat import FlipMessageRoles
flip_roles = FlipMessageRoles()
result = next(
flip_roles.process([{
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi"}
]
}])
)
# Result includes flipped roles: user->assistant, assistant->user
InsertMessage
Inserts a new message into the chat messages at a specified index.
Input Columns:
messages(List[Dict[str, str]]): List of message dictionariescontent(str): Content for the message to be inserted
Output Columns:
messages(List[Dict[str, str]]): Modified list with the new message
Example:
from distilabel_steps_library.chat import InsertMessage
insert = InsertMessage(index=0, role="system")
result = next(
insert.process([{
"messages": [
{"role": "user", "content": "Hi"}
],
"content": "Be helpful"
}])
)
# Inserts system message at the beginning
Data Cleaning Steps
DropEmpty
Filters out rows containing empty values in specified columns.
Input Columns:
- Any columns specified in the
columnsparameter (or all columns if none specified)
Output Columns:
- All input columns (for non-empty rows)
Example:
from distilabel_steps_library import DropEmpty
# Drop rows with empty values in specific columns
drop_step = DropEmpty(columns=["instruction", "response"])
result = next(
drop_step.process([
{"instruction": "Task", "response": ""}, # Will be dropped
{"instruction": "Task 2", "response": "Answer"} # Will be kept
])
)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
WTFPL.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distilabel_steps_library-0.1.2.tar.gz.
File metadata
- Download URL: distilabel_steps_library-0.1.2.tar.gz
- Upload date:
- Size: 57.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.27
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
544c5e09d615f7d2eba86504f8760a9ea535e0765bb6b06a7a3da8c2fe47feb2
|
|
| MD5 |
74a3ca07a52d27c962ccfa3dbc940256
|
|
| BLAKE2b-256 |
c43156d40ad81099222606a9c07bacb44214f088b4a85f7a720f43394a3f3164
|
File details
Details for the file distilabel_steps_library-0.1.2-py3-none-any.whl.
File metadata
- Download URL: distilabel_steps_library-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.27
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4ca5575643465288b96da361d321f329eb17125a6679b1aed9ce4979ec36595
|
|
| MD5 |
8934f0991d73378b89ef87de20331eac
|
|
| BLAKE2b-256 |
1f761f0ea79ef1eeff8c953c6f9acb618f672858ab75cfb0e7c12c1b024f9d5a
|