The sample and useful data process tool for LLM finetuning, process your json and jsonline
Project description
Data4LLM
The sample and useful data process tool for LLM finetuning now including: process for json & jsonline data and output jsonlines
it runs well in million number level
install
pip install data4llm
API
1.For file level
(1) merge files
merge all the jsonlines files with shuffle
import glob
from data4llm import Data4LLM
files = glob("dir/*.jsonl")
Data4LLM.merge_files(files=files)
(2) split files to train and test file
from data4llm.Data4LLM import Data4LLM
Data4LLM.split_train_test(file_input="data/test.json", train_test_ratio=3 / 5)
2. For sample level
Every sample is a json with key-value form dict[str:str],like
{"input":"hello!","output":"Hi, I'm an AI assistant, how can I help you?"}
(1) shuffle
shuffle all the json in a file, it doesn't optimize the memory usage now, requiring to load all the data to memory
from data4llm.Data4LLM import Data4LLM
Data4LLM.shuffle(file_input="data/test.jsonl", file_output="result/sh_test.jsonl")
def shuffle(cls, file_input, file_output):
shuffle: shuffle all the data in input file. warning: it loads all the data in memory
ile_input: input file path
file_output: output file path
(2) remove duplicated data
remove duplicate data by sim_hash. There are two function remove_duplicate_BloomFilter
and remove_duplicate
.
remove_duplicate
: remove duplicate data by sim_hash, which removes data by bloom filter, very fast
from data4llm.Data4LLM import Data4LLM
Data4LLM.remove_duplicate_BloomFilter(file_input="data/test.json", file_output="result/rm_dup_test.json", length=64)
def remove_duplicate_BloomFilter(cls, file_input, file_output, max_row_limit=1000, skip_hash=False, length=64,
log_path="result.log"):
'''
remove_duplicate : remove duplicate data by sim_hash, which removes data by bloom filter, very fast
file_input: input file path with duplicated data
file_output: result file path
max_row_limit: the max data number in memory which is useful to save memory
skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
length: the simhash length
log_path: log file path
:return: result data number , removed data number
'''
remove_duplicate
: remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time
from data4llm.Data4LLM import Data4LLM
Data4LLM.remove_duplicate(file_input="data/test.json", file_output="result/rm_dup_test.json", length=64)
def remove_duplicate(cls, file_input, file_output, ratio=1, max_row_limit=1000, skip_hash=False, length=64,
log_path="result.log"):
remove_duplicate : remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time
file_input: input file path with duplicated data
file_output: result file path
ratio: threshold for duplication, which is actually the distance of the two simhash value
max_row_limit: the max data number in memory which is useful to save memory
skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
length: the simhash length
log_path: log file path
:return: result data number , removed data number
(3) process property in json
process the json row one by one, including: rename property, remove property, process content(remove chars, replace chars)
from data4llm.Data4LLM import Data4LLM, F
# define a process function to process every json row
def process_fn(row: dict[str:str]):
'''
row is a json in dict[str:str] form, you can process it with dict function by yourself, we also define some useful functions in Data2LLM.F
replace chars
'''
F.replace(row, "#", "") # use regrex to replace all the '#' to '' / remove all the '#'
F.replace(row, "https?://\S+", "") # use reg to remove url
'''
rename chas
rename json property ,'input' to 'prompt', 'output' to 'chosen'
{"input":"hello!","output":"Hi, I'm an AI assistant, how can I help you?"}=>{"prompt":"hello!","chosen":"Hi, I'm an AI assistant, how can I help you?"}
'''
F.rename(row, {"input": "prompt", "output": "chosen"})
'''
you can also process the row: dict[str:str] by yourself:
row['key']='value'
row['key'] = row.pop('key1')+row.pop('key2')
...
'''
return row
Data4LLM.process_property(file_input="data/test.jsonl", file_output="result/result_test.jsonl", process_fun=process_fn)
def process_property(cls, file_input, file_output, process_fun, max_row_limit=1000, json=None):
process_property: process the json row one by one, including: rename property, remove property, process content(remove chars, replace chars)
file_input: input file path
file_output: output file path
process_fun: process function
max_row_limit: default=1000, every step to write file and max data num in memory
json: default=None, it determines json or jsonline, or True/False
(4) show_example
it is very useful to show the result before actually conduct by using show_example:
from data4llm.Data4LLM import Data4LLM
Data4LLM.show_example(file_input="data/test.jsonl", process_fun=process_fn)
examples:
##### No 1 #####
== Before ==
{'input': 'welcome to https://www.baidu.com #LLM world', 'output': 'I like #LLM'}
== After ==
{'prompt': 'welcome to LLM world', 'chosen': 'I like LLM'}
##### No 2 #####
== Before ==
{'input': 'hello!', 'output': "Hi, I'm an AI assistant, how can I help you?"}
== After ==
{'prompt': 'hello!', 'chosen': "Hi, I'm an AI assistant, how can I help you?"}
def show_example(cls, file_input, process_fun, json=None, s=0, e=5):
file_input:
process_fun:
json: if the file is json or jsonline, default None means it decided by the postfix of th file_input
s: default 0 the start row num
e: default 5 the end row num
:return: None
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.