Add apply_multi and other bugs fix
Project description
Data4LLM
The simple and useful data process tool for LLM data4llm
data4llm
is a json & jsonline process tool, which runs well in millions number level, which facilitates the construction procession of millions of data to continue-pretrain and finetune your LLM. The current framework show below:
install
pip install data4llm
API
SFT
from data4llm.Data4LLM import SFT
1.For file level
(1) merge files
merge all the jsonlines files with shuffle
import glob
from data4llm import Data4LLM
files = glob("dir/*.jsonl")
Data4LLM.merge_files(files=files)
(2) split files to train and test file
from data4llm.Data4LLM import SFT
SFT.split_train_test(file_input="data/test.jsonl", train_ratio=3 / 5)
2. For sample level
Every sample is a json with key-value like dict[str:str], for example:
{"input":"hello!","output":"Hi, I'm an AI assistant, how can I help you?"}
(1) shuffle
shuffle all the samples in a file, it doesn't optimize the memory usage now, requiring to load all the data to memory in one time
from data4llm.Data4LLM import SFT
SFT.shuffle(file_input="data/test.txt", file_output="result/sh_test.jsonl")
def shuffle(cls, file_input, file_output):
shuffle: shuffle all the data in input file. warning: it loads all the data in memory
ile_input: input file path
file_output: output file path
(2) remove duplicated data
remove duplicate data by sim_hash. There are two function remove_duplicate_BloomFilter
and remove_duplicate
.
remove_duplicate_BloomFilter
: remove duplicate data by sim_hash, which removes data by bloom filter, very fast
from data4llm.Data4LLM import SFT
SFT.remove_duplicate_BloomFilter(file_input="data/test.jsonl", file_output="result/rm_dup_test.json", length=64)
def remove_duplicate_BloomFilter(cls, file_input, file_output, max_row_limit=1000, skip_hash=False, length=64,
log_path="result.log"):
'''
remove_duplicate : remove duplicate data by sim_hash, which removes data by bloom filter, very fast
file_input: input file path with duplicated data
file_output: result file path
max_row_limit: the max data number in memory which is useful to save memory
skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
length: the simhash length
log_path: log file path
:return: result data number , removed data number
'''
remove_duplicate
: remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time
from data4llm.Data4LLM import SFT
SFT.remove_duplicate(file_input="data/test.jsonl", file_output="result/rm_dup_test.json", length=64)
def remove_duplicate(cls, file_input, file_output, ratio=1, max_row_limit=1000, skip_hash=False, length=64,
log_path="result.log"):
remove_duplicate : remove duplicate data by sim_hash, which compares data one by one, getting more accurate and finely result but costing massive time
file_input: input file path with duplicated data
file_output: result file path
ratio: threshold for duplication, which is actually the distance of the two simhash value
max_row_limit: the max data number in memory which is useful to save memory
skip_hash: default false. it needed when call the function in first time, which is used to get the simhash in all the data
length: the simhash length
log_path: log file path
:return: result data number , removed data number
(3) apply or apply_multi
The most powerful function in this project, you can apply any process rule by it ,including: process property
(rename
, remove
, add
), process content
(remove chars
, replace chars
), filter
sample by some rules, derived
serval samples from a sample.
There are three typical ways: I.filter
, II.process attributes
, III.from one to serval
:
I. filtered by length
Filter sample by returning None
def fn(row: dict[str:str]) -> dict[str:str]:
if F.len(row) > 1000:
return None
return row
SFT.apply(file_input="data/test.txt", file_output="result/result_test.jsonl", fn=fn)
# or apply_multi, which uses three threads for process a file that is useful when the process function contains with network or time costed operation like waiting for LLM's response
SFT.apply_multi(file_input="data/test.txt", file_output="result/result_test.jsonl", fn=fn, num_work=3)
II. concat two properties into one
apply process to every sample
from data4llm.Data4LLM import SFT, F
def fn(row: dict[str:str]) -> dict[str:str]:
row['input'] = row['instruction']+row['prompt']
row.pop("instruction")
row.pop("prompt")
return row
SFT.apply(file_input="data/test.txt", file_output="result/result_test.jsonl", fn=fn)
III. from one to several
Generate more samples from one sample by returning a list consisting of dict
def fn(row: dict[str:str]) -> List[dict[str:str]]:
arrs = row['input'].split(";")[:-1]
rows = []
temp_str = ""
for i, item in enumerate(arrs):
if i % 2 != 0:
output = item.split(":")[0]
temp_row = {"input": temp_str + "assistant:", "output": output}
rows.append(temp_row)
else:
temp_str += item + ";"
return rows
SFT.apply(file_input="data/test.txt", file_output="result/result_test.jsonl", fn=fn)
The apply function
def apply(cls, file_input, file_output, fn, max_row_limit=1000, json=None):
apply_property: apply the json row one by one, including: rename property, remove property, apply content(remove chars, replace chars)
file_input: input file path
file_output: output file path
fn: apply function
max_row_limit: default=1000, every step to write file and max data num in memory
json: default=None, it determines json or jsonline, or True/False
(4) show_example
It is very useful to show the result before actually conduct it by show_example
:
from data4llm.Data4LLM import SFT
def fn(row):
row['input'] = row['input'].replace("https://www.baidu.com")
return row
SFT.show_example(file_input="data/test.txt", fn=fn)
examples:
##### No 1 #####
== Before ==
{'input': 'welcome to https://www.baidu.com #LLM world', 'output': 'I like #LLM'}
== After ==
{'prompt': 'welcome to LLM world', 'chosen': 'I like LLM'}
##### No 2 #####
== Before ==
{'input': 'hello!', 'output': "Hi, I'm an AI assistant, how can I help you?"}
== After ==
{'prompt': 'hello!', 'chosen': "Hi, I'm an AI assistant, how can I help you?"}
def show_example(cls, file_input, fn, json=None, s=0, e=5):
'''
file_input:
fn:
json: if the file is json or jsonline, default None means it decided by the postfix of th file_input
s: default 0 the start row num
e: default 5 the end row num
:return: None
'''
PT
from data4llm.Data4LLM import PT
(1) show_properties
show the json structure
def show_properties(cls, files, s=0, e=5):
'''
show the json structure
:param files:
:param s:
:param e:
:return:
'''
(2) parse_pages
parse the semi structure json and parse all the token needed together for PT
def parse_pages(cls, files, fn, output_dir):
'''
parse the semi structure json and parse all the token needed together fot PT
:param files:
:param fn:
:param output_dir:
:return:
'''
(3) merge_files
merge all the txt files
def merge_files(cls, files, output_file="merge_file.txt", max_limit_num=100):
'''
merge all the txt files
:param files:
:param output_file:
:param max_limit_num:
:return:
'''
(4) split_train_test
split a file into train and test files
def split_train_test(cls, file_input, train_test_ratio, file_train_output="train.txt", file_test_output="test.txt"):
'''
split a file into train and test files
:param file_input:
:param train_test_ratio:
:param file_train_output:
:param file_test_output:
:return:
'''
(5) sample
sample
F
A tool class with some useful functions
from data4llm.Data4LLM import F
(1) count
get the sample number of a file
def count(cls, file_input):
"""
get the sample number of a file
:param file_input:
:return:
"""
(2) functions used in apply
fn
rename()
: rename the property of every sample
repalce()
: replace the chars in a json or a property in the json
len()
: get the length of the json (only values) of part of json (specify the property like "chosen" only {"chosen"})
def rename(cls, row, mapping: dict[str:str]) -> None
def replace(cls, row, pattern, repl, property=None) -> None
def len(cls, row, property=None) -> int:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data4llm-0.4.1.tar.gz
.
File metadata
- Download URL: data4llm-0.4.1.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54940dab9141fe31313bc6004e514e60741e1909ff36f563810801115e7b849b |
|
MD5 | 0b3c416b6eb9011998f5f524c87aa9c6 |
|
BLAKE2b-256 | 85c0523f579580f99ca8601ff876f74c103ce2b389eca8dcdcc4ea7048a03fe0 |
File details
Details for the file data4llm-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: data4llm-0.4.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38ee3a0ab82f35f49e8650d283d41c3547981b27e11713ada8d9c9536d1e81ac |
|
MD5 | d1b3aa0bcfb3bc9346ecf274f8e4a168 |
|
BLAKE2b-256 | 087c50cdfb371a960c92a5ef33fe0e26c28658eb3bd252fbf8af06c286de8324 |