Instructify 📝 for easy Fine-Tuning preparation

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Instructify 📝

Instructify is a Python library designed to convert CSV files into Hugging Face Datasets, specifically formatted for fine-tuning large language models (LLMs). Inspired by the instruction-based dataset approach described in OpenAI's InstructGPT paper (2203.02155), this package helps prepare your data for instruction-based tasks using a chat-like format.

Features ✨

CSV to Hugging Face Dataset: Convert CSV files into Hugging Face Dataset objects ready for model fine-tuning.
Customizable Message Formatting: Supports user, assistant, and system messages with flexible column names.
Tokenizer Integration: Automatically integrates with a pre-trained tokenizer to format messages.
Easy Fine-Tuning Preparation: Prepares data for instruction tuning, similar to the InstructGPT format.

Installation 📦

pip install instructify

Usage 🚀

import pandas as pd
from instructify import to_train_dataset

# Example custom template
custom_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Example data
data = {
    "input": ["When was the Library of Alexandria burned down?", "What is the capital of France?"],
    "output": ["I-I think that was in 48 BC, b-but I'm not sure.", "The capital of France is Paris."],
    "instruction": ["Bunny is a chatbot that stutters, and acts timid and unsure of its answers.", None]
}

# Convert data to CSV
df = pd.DataFrame(data)
df.to_csv("data.csv", index=False)

# Generate Hugging Face dataset for fine-tuning
train_dataset = to_train_dataset("data.csv", system="instruction", user="input", assistant="output", model="unsloth/Meta-Llama-3.1-8B-Instruct", custom_template=custom_template)

# Inspect the formatted dataset
print(train_dataset["text"])

Output Example 📄

The function formats csv files to a structured template ready for fine-tuning:

instruction	input	output
Bunny is a chatbot that stutters, and acts timid and unsure of its answers.	When was the Library of Alexandria burned down?	I-I think that was in 48 BC, b-but I'm not sure.
None	What is the capital of France?	The capital of France is Paris.

The train_dataset["text"] will output the following instruction-style dataset format by default:

[
    "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nBunny is a chatbot that stutters, and acts timid and unsure of its answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhen was the Library of Alexandria burned down?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI-I think that was in 48 BC, b-but I'm not sure.<|eot_id|>",
    
    "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe capital of France is Paris.<|eot_id|>"
]

The train_dataset["text"] will output the following dataset format if custom template is used:

[
    "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nBunny is a chatbot that stutters, and acts timid and unsure of its answers.\n\n### Input:\nWhen was the Library of Alexandria burned down?\n\n### Response:\nI-I think that was in 48 BC, b-but I'm not sure.<|eot_id|>",
    
    "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n\n\n### Input:\nWhat is the capital of France?\n\n### Response:\nThe capital of France is Paris.<|eot_id|>"
]

Functionality Overview 🔍

`to_train_dataset`

This function is the core of the library, enabling CSV-to-dataset conversion for LLM fine-tuning.

Parameters:

csv_path: Path to the input CSV file.
system (optional): Column name for system messages (e.g., instructions for the model).
user: Column name for user messages (default: 'user').
assistant: Column name for assistant messages (default: 'assistant').
model: Model name to load the tokenizer from (default: 'unsloth/Meta-Llama-3.1-8B-Instruct').
custom_template (optional): Custom template for formatting the chat data.

Returns:

Dataset: A Hugging Face Dataset, ready for LLM fine-tuning.

License ⚖️

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing 🤝

We welcome contributions! Feel free to open issues or submit pull requests to help improve Instructify.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.4

Sep 8, 2024

0.0.3

Sep 8, 2024

This version

0.0.2

Sep 7, 2024

0.0.1

Sep 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructify-0.0.2.tar.gz (8.5 kB view details)

Uploaded Sep 7, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

instructify-0.0.2-py3-none-any.whl (9.5 kB view details)

Uploaded Sep 7, 2024 Python 3

File details

Details for the file instructify-0.0.2.tar.gz.

File metadata

Download URL: instructify-0.0.2.tar.gz
Upload date: Sep 7, 2024
Size: 8.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for instructify-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`65a470e63e676520603ea6624bb85311276b710a2050d064b10dab1c2ce4b120`
MD5	`4ff04a0e9ac12ab0fbc7cdf7d9d01c90`
BLAKE2b-256	`70d8c84ef4ccf196b57c99a0c65949a903a0de945eccdb92e4c391103920db3a`

See more details on using hashes here.

File details

Details for the file instructify-0.0.2-py3-none-any.whl.

File metadata

Download URL: instructify-0.0.2-py3-none-any.whl
Upload date: Sep 7, 2024
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for instructify-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78976faa1af921f2492cbccad04d473de0cb2f22d893b34c26a42bda671c8eec`
MD5	`5f907bd20731c20c3c2402068c505f9b`
BLAKE2b-256	`c95a38dca5a333f97ab1f0f924931e8f908eaa879aa29c32517afa7705f1983b`

See more details on using hashes here.

instructify 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Instructify 📝

Features ✨

Installation 📦

Usage 🚀

Output Example 📄

Functionality Overview 🔍

`to_train_dataset`

Parameters:

Returns:

License ⚖️

Contributing 🤝

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes