Skip to main content

The sample and useful data process tool for LLM finetuning, process your json and jsonline

Project description

Data4LLM

The sample and useful data process tool for LLM finetuning now including: process for json & jsonline data and output jsonlines

it runs well in million number level

install

pip install data4llm

API

1.For file level

(1) merge files

merge all the jsonlines files with shuffle

import glob
from data4llm import Data4LLM

files = glob("dir/*.jsonl")
Data4LLM.merge_files(files=files)

(2) split files to train and test file

from data4llm.Data4LLM import Data4LLM

Data4LLM.split_train_test(file_input="data/test.json", train_test_ratio=3 / 5)

2. For sample level

Every sample is a json with key-value form dict[str:str],like

 {"input":"hello!","output":"Hi, I'm an AI assistant, how can I help you?"}

(1) shuffle

shuffle all the json in a file, it doesn't optimize the memory usage now, requiring to load all the data to memory

from data4llm.Data4LLM import Data4LLM

Data4LLM.shuffle(file_input="data/test.jsonl", file_output="result/sh_test.jsonl")
def shuffle(cls, file_input, file_output):
    shuffle: shuffle all the data in input file. warning: it loads all the data in memory
    ile_input: input file path
    file_output: output file path

(2) remove duplicated data

remove duplicate data by sim_hash. There are two function remove_duplicate_BloomFilter and remove_duplicate.

remove_duplicate : remove duplicate data by sim_hash, which removes data by bloom filter, very fast

from data4llm.Data4LLM import Data4LLM
Data4LLM.remove_duplicate_BloomFilter(file_input="data/test.json", file_output="result/rm_dup_test.json", length=64)
def remove_duplicate_BloomFilter(cls, file_input, file_output, max_row_limit=1000, skip_hash=False, length=64,
                                 log_path="result.log"):
    '''
        remove_duplicate : remove duplicate data by sim_hash, which removes data by bloom filter, very fast
        file_input: input file path with duplicated data
        file_output: result file path
        max_row_limit: the max data number in memory which is useful to save memory
        skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
        length: the simhash length
        log_path: log file path
        :return: result data number , removed data number
    '''

remove_duplicate : remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time

from data4llm.Data4LLM import Data4LLM

Data4LLM.remove_duplicate(file_input="data/test.json", file_output="result/rm_dup_test.json", length=64)

def remove_duplicate(cls, file_input, file_output, ratio=1, max_row_limit=1000, skip_hash=False, length=64,
                 log_path="result.log"):
    remove_duplicate : remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time
    file_input: input file path with duplicated data
    file_output: result file path
    ratio: threshold for duplication, which is actually the distance of the two simhash value
    max_row_limit: the max data number in memory which is useful to save memory
    skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
    length: the simhash length
    log_path: log file path
    :return: result data number , removed data number

(3) process property in json

process the json row one by one, including: rename property, remove property, process content(remove chars, replace chars)

from data4llm.Data4LLM import Data4LLM, F


# define a process function to process every json row
def process_fn(row: dict[str:str]):
    '''
        row is a json in dict[str:str] form, you can process it with dict function by yourself, we also define some useful functions in Data2LLM.F
        replace chars
    '''
    F.replace(row, "#", "")   # use regrex to replace all the '#' to '' / remove all the '#'
    F.replace(row, "https?://\S+", "")  # use reg to remove url
    '''
        rename chas
        rename json property ,'input' to 'prompt', 'output' to 'chosen'
        {"input":"hello!","output":"Hi, I'm an AI assistant, how can I help you?"}=>{"prompt":"hello!","chosen":"Hi, I'm an AI assistant, how can I help you?"}
    '''
    F.rename(row, {"input": "prompt", "output": "chosen"})
    '''
        you can also process the row: dict[str:str] by yourself:
        row['key']='value'
        row['key'] = row.pop('key1')+row.pop('key2')
        ...
    '''
    return row


Data4LLM.process_property(file_input="data/test.jsonl", file_output="result/result_test.jsonl", process_fun=process_fn)
def process_property(cls, file_input, file_output, process_fun, max_row_limit=1000, json=None):
    process_property: process the json row one by one, including: rename property, remove property, process content(remove chars, replace chars)
     file_input: input file path
     file_output: output file path
     process_fun: process function
     max_row_limit: default=1000, every step to write file and max data num in memory
     json: default=None, it determines json or jsonline, or True/False

(4) show_example

it is very useful to show the result before actually conduct by using show_example:

from data4llm.Data4LLM import Data4LLM

Data4LLM.show_example(file_input="data/test.jsonl", process_fun=process_fn)

examples:

##### No 1 #####
== Before ==
{'input': 'welcome to https://www.baidu.com #LLM world', 'output': 'I like #LLM'}
== After ==
{'prompt': 'welcome to  LLM world', 'chosen': 'I like LLM'}
##### No 2 #####
== Before ==
{'input': 'hello!', 'output': "Hi, I'm an AI assistant, how can I help you?"}
== After ==
{'prompt': 'hello!', 'chosen': "Hi, I'm an AI assistant, how can I help you?"}
def show_example(cls, file_input, process_fun, json=None, s=0, e=5):
    file_input: 
    process_fun: 
    json: if the file is json or jsonline, default None means it decided by the postfix of th file_input 
    s: default 0 the start row num
    e: default 5 the end row num
    :return: None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4llm-0.1.2.tar.gz (15.1 kB view hashes)

Uploaded Source

Built Distribution

data4llm-0.1.2-py3-none-any.whl (13.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page