Skip to main content

Instructify 📝 for easy Fine-Tuning preparation

Project description

Instructify 📝

Instructify is a Python library designed to convert CSV files or Hugging Face datasets into Hugging Face Dataset objects, specifically formatted for fine-tuning large language models (LLMs). Inspired by the instruction-based dataset approach described in OpenAI's InstructGPT paper (2203.02155), this package helps prepare your data for instruction-based tasks using a chat-like format.

Features ✨

  • CSV or Hugging Face Dataset Support: Automatically detect whether the input is a CSV file or a Hugging Face dataset.
  • Customizable Message Formatting: Supports user, assistant, and system messages with flexible column names.
  • Tokenizer Integration: Automatically integrates with a pre-trained tokenizer to format messages.
  • Custom Templates: Apply a custom template or use the tokenizer's default chat format.
  • Tokenizer Visualization: You can use the tool below to understand how a piece of text might be tokenized by a language model, and the total count of tokens in that piece of text.
  • Easy Fine-Tuning Preparation: Prepares data for instruction tuning, similar to the InstructGPT format.

Installation 📦

pip install instructify

Usage 🚀

CSV Input

import pandas as pd
from instructify import to_train_dataset

# Example custom template
custom_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Example data
data = {
    "input": ["When was the Library of Alexandria burned down?", "What is the capital of France?"],
    "output": ["I-I think that was in 48 BC, b-but I'm not sure.", "The capital of France is Paris."],
    "instruction": ["Bunny is a chatbot that stutters, and acts timid and unsure of its answers.", None]
}

# Convert data to CSV
df = pd.DataFrame(data)
df.to_csv("data.csv", index=False)

# Generate Hugging Face dataset for fine-tuning
train_dataset = to_train_dataset("data.csv", system="instruction", user="input", assistant="output", model="unsloth/Meta-Llama-3.1-8B-Instruct", custom_template=custom_template)

# Inspect the formatted dataset
print(train_dataset["text"])

Hugging Face Dataset Input

from instructify import to_train_dataset

# Example custom template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Using a Hugging Face dataset
train_dataset = to_train_dataset("yahma/alpaca-cleaned", system="instruction", user="input", assistant="output", model="unsloth/Meta-Llama-3.1-8B-Instruct", custom_template=alpaca_prompt)

# Inspect the formatted dataset
print(train_dataset["text"])

Tokenizer Visualization 👁️

In addition to converting datasets, you can now visualize how different tokenizers process chat messages. The visualization displays the tokenized text with the following special symbols:

  • 🤜 (Right-Facing Fist): Represents spaces between words.
  • 💧 (Droplet): Represents newline characters.
  • 💔 (Broken Heart): Marks token boundaries.
from instructify import compare_tokenizers

# Compare tokenizers from different models
compare_tokenizers(["unsloth/Meta-Llama-3.1-8B-Instruct", "unsloth/gemma-2-9b-it"])

This will help you understand how a piece of text might be tokenized by a language model, with the total count of tokens displayed.

Output Example 📄

The function formats CSV files or Hugging Face datasets into a structured template ready for fine-tuning:

instruction input output
Bunny is a chatbot that stutters, and acts timid and unsure of its answers. When was the Library of Alexandria burned down? I-I think that was in 48 BC, b-but I'm not sure.
None What is the capital of France? The capital of France is Paris.

Default Output Format

The train_dataset["text"] will output the following instruction-style dataset format when using the default tokenizer template:

[
    "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nBunny is a chatbot that stutters, and acts timid and unsure of its answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhen was the Library of Alexandria burned down?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI-I think that was in 48 BC, b-but I'm not sure.<|eot_id|>",
    
    "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe capital of France is Paris.<|eot_id|>"
]

Custom Template Output

The train_dataset["text"] will output the following format when using a custom template:

[
    "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nBunny is a chatbot that stutters, and acts timid and unsure of its answers.\n\n### Input:\nWhen was the Library of Alexandria burned down?\n\n### Response:\nI-I think that was in 48 BC, b-but I'm not sure.<|eot_id|>",
    
    "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n\n\n### Input:\nWhat is the capital of France?\n\n### Response:\nThe capital of France is Paris.<|eot_id|>"
]

Functionality Overview 🔍

to_train_dataset

This function is the core of the library, enabling both CSV and Hugging Face dataset conversion for LLM fine-tuning.

Parameters:

  • data_source: Path to the input CSV file or Hugging Face dataset identifier.
  • system (optional): Column name for system messages (e.g., instructions for the model).
  • user: Column name for user messages (default: 'user').
  • assistant: Column name for assistant messages (default: 'assistant').
  • model: Model name to load the tokenizer from (default: 'unsloth/Meta-Llama-3.1-8B-Instruct').
  • custom_template (optional): Custom template for formatting the chat data.

Returns:

  • Dataset: A Hugging Face Dataset, ready for LLM fine-tuning.

License ⚖️

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing 🤝

We welcome contributions! Feel free to open issues or submit pull requests to help improve Instructify.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructify-0.0.4.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

instructify-0.0.4-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file instructify-0.0.4.tar.gz.

File metadata

  • Download URL: instructify-0.0.4.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for instructify-0.0.4.tar.gz
Algorithm Hash digest
SHA256 c6126e827b4d62040986f896442b121351d706a9673b9de53af6bd209c752bde
MD5 2df3d5a6d946ffa22a75a1aa1c8eff01
BLAKE2b-256 9afcce9da8947cb0b2c8a578621e146150d3e8175175ac398c55f5903e7e4e0b

See more details on using hashes here.

File details

Details for the file instructify-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: instructify-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for instructify-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 33db29797b5f041522dcbcb23ec3536eab91c5150dd388a5dc1ae0ad8f598111
MD5 5087214437464df7b9c6bf94b5806a7a
BLAKE2b-256 b54dbeb3acfd6a90e9c17fbbab59dd00b763eed0ed652febbf7c78cbc9bdf869

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page