scarab: llm training paradigm
Project description
scarabs平台: 一款基于 transformers 的 通用模型训练框架,
可以tabular data训练,text data训练,image data训练,LLM训练
📘 core:
- ✅ Training of tabular data, For example, CTR used in recommendation systems
- Training of text data, For example, text classification
- Training of image data, For example, image classification
- Training of LLM, For example, llm pretrain
📘 very easy to use
pip install scarabs
📘 In detail
✅ 1. Tabular Data You can refer to tabular_ctr in the examples folder
-
Text Data You can refer to llm_classification in the examples folder
-
LLM You can refer to llm_pretrain in the examples folder
-
refer to github https://github.com/zhu2856061/scarabs
📘 arguments
ℹ️ task_name_or_path: 任务名,所有训练产生的中间结果和最终结果都会在该目录下
ℹ️ data_format: 数据的格式,包含[text, csv, json, parquet], tabular数据推荐用parquet格式-平时将自己的数据准备成parquet格式, 文本类数据推荐采用json格式
ℹ️ train_file: 训练数据的路径,可以给数据的文件夹(会读取文件夹内的文件),也可以给数据的文件路径,
ℹ️ valid_file: 评估数据的路径,可以给数据的文件夹(会读取文件夹内的文件),也可以给数据的文件路径,
ℹ️ test_file: 训练数据的路径,可以给数据的文件夹(会读取文件夹内的文件),也可以给数据的文件路径,
ℹ️ preprocessing_num_workers: 对数据进行处理的时候,启动几个进程worker进行并行处理数据
ℹ️ labels: 数据的Y标,⚠️是一个列表,方便-【多目标的模型】
ℹ️ load_resume_from_checkpoint: 检查点的路径-文件夹,用于导入检查点,并继续训练,会先加载模型-> 再进行训练
ℹ️ incremental_resume_from_checkpoint: 对embedding层进行增量训练,基于先前的模型,其中的特征值/token数量是固定的,一旦基于先前模型进行下次的继续训练的时候,出现全新的特征值/token的时候,就会出现无法识别,被当作UNK对待了,故需要设置这个检查点的路径,会启动增量训练
🔔 ctr训练 [update]
正常训练
[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ config.json 模型参数 |_ main.py 训练主程序
其中 arguments.yaml 文件中参数设置如下:
task_name_or_path: "encode"
overwrite_output_dir: true
output_dir: "model"
# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2
# model
# load_resume_from_checkpoint: "./encode/model/checkpoint-1029"
# incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"
# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true
# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs:
mode: "max"
factor: 0.1
patience: 1
verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1
# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4
# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True
config.json文件设置参考具体模型的config[scarabs/nova/models]
main.py文件如下:
from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr
def feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained("config.json")
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.load_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def incremental_continue_feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.incremental_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def incremental_continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def eval():
# Predict
task = TaskFactoryWithTabularCtr()
model_path = "./encode/model"
task.inference_with_load_model(model_path, CtrWithDNN)
import pandas as pd
from sklearn.metrics import roc_auc_score
preds = []
label = []
ds = pd.read_csv("../../data/movielens/valid/valid.csv")
for line in ds.to_dict("records"):
label.append(line["label"])
res = task.inference(X=line)
preds.append(res["logits"][0].item())
print(roc_auc_score(label, preds))
if __name__ == "__main__":
parser = HfArgumentParser(TaskArguments) # type: ignore
args = parser.parse_yaml_file("arguments.yaml")[0]
# # feature
feature_engineering(args)
# # Train
train(args)
继续训练
[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ model/ 模型文件夹 - 检查点checkpoint-**** - config.json 和 models.safetensors |_ main.py 训练主程序
其中 arguments.yaml 文件中参数设置如下:
task_name_or_path: "encode"
overwrite_output_dir: true
output_dir: "model"
# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2
# model
load_resume_from_checkpoint: "./model/checkpoint-1029"
# incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"
# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true
# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs:
mode: "max"
factor: 0.1
patience: 1
verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1
# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4
# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True
main.py文件如下:
from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr
def feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained("config.json")
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.load_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def incremental_continue_feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.incremental_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def incremental_continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def eval():
# Predict
task = TaskFactoryWithTabularCtr()
model_path = "./encode/model"
task.inference_with_load_model(model_path, CtrWithDNN)
import pandas as pd
from sklearn.metrics import roc_auc_score
preds = []
label = []
ds = pd.read_csv("../../data/movielens/valid/valid.csv")
for line in ds.to_dict("records"):
label.append(line["label"])
res = task.inference(X=line)
preds.append(res["logits"][0].item())
print(roc_auc_score(label, preds))
if __name__ == "__main__":
parser = HfArgumentParser(TaskArguments) # type: ignore
args = parser.parse_yaml_file("arguments.yaml")[0]
# # Train
continue_train(args)
增量inputs ids embedding继续训练
[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ model/ 模型文件夹 - 检查点checkpoint-**** - config.json 和 models.safetensors |_ main.py 训练主程序
其中 arguments.yaml 文件中参数设置如下:
task_name_or_path: "encode"
overwrite_output_dir: true
output_dir: "model"
# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2
# model
# load_resume_from_checkpoint: "./model/checkpoint-1029"
incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"
# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true
# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs:
mode: "max"
factor: 0.1
patience: 1
verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1
# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4
# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True
main.py文件如下:
from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr
def feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained("config.json")
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.load_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def incremental_continue_feature_engineering(args):
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.incremental_resume_from_checkpoint,
"config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config=config)
task.create_feature2meta_in_config()
def incremental_continue_train(args):
# # Train
config = CtrWithDNNConfig.from_pretrained(
os.path.join(
args.task_name_or_path,
"data/meta/config.json",
)
)
task = TaskFactoryWithTabularCtr(args, config)
task.train(model=CtrWithDNN(config))
def eval():
# Predict
task = TaskFactoryWithTabularCtr()
model_path = "./encode/model"
task.inference_with_load_model(model_path, CtrWithDNN)
import pandas as pd
from sklearn.metrics import roc_auc_score
preds = []
label = []
ds = pd.read_csv("../../data/movielens/valid/valid.csv")
for line in ds.to_dict("records"):
label.append(line["label"])
res = task.inference(X=line)
preds.append(res["logits"][0].item())
print(roc_auc_score(label, preds))
if __name__ == "__main__":
parser = HfArgumentParser(TaskArguments) # type: ignore
args = parser.parse_yaml_file("arguments.yaml")[0]
# # Train
incremental_continue_feature_engineering(args)
incremental_continue_train(args)
进行增量训练,在训练的日志部分会有增量模型部分矩阵改变的日志打印,请留意
logger.warning(f"{v} shape mismatched, current: {model_dict[v].shape} != history:{state_dict[k].shape}")
给出当前模型矩阵和历史模型矩阵的形状不一致,请留意
logger.warning(f"{key} is updated from history:{history_size} to current:{current_size}")
给出历史模型矩阵已经修正成新的矩阵大小
🔔 大模型训练 [update]
1 纯预训练, 从0-1,另起一座山峰 , 以训练一个qwen3-0.1b的模型为例
第一步,先选定一个模型,比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例,找到模型文件中 tokenizer.json 和 tokenizer_config.json 和 config.json 文件
第二步,创建一个文件夹比如: qwen3-0.1b , 然后修改config.json 文件,将其中的一些影响模型大小的参数改为小一些,比如:
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 32,
"hidden_act": "silu",
"hidden_size": 128,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 4096,
"max_window_layers": 28,
"model_type": "qwen3",
"num_attention_heads": 8,
"num_hidden_layers": 6,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
第三步: 准备数据,数据样式如下: {"text": "根据描述,..."} {"text": "对于一名60岁男性患者,..."}
第四步: 准备训练参数文件,参照 arguments.yaml
第五步:写训练脚本,参照 train.py
第六步: 执行训练,
torchrun --standalone --nnodes=1 --nproc_per_node=1 main.py
2 继续预训练, 以训练一个qwen3-0.1b的模型为例
第一步,先选定一个模型,比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例,需要下载模型的所有文件,并保存在指定的目录下,比如:这里可以拿纯预训练的那个模型来进行继续预训练qwen3-0.1b 其他步骤(去掉上述的第二步)同上 特别 需要对train.py进行修改,参照 train.py
2 微调训练,以训练一个qwen3-0.1b的模型为例
第一步,先选定一个模型,比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例,需要下载模型的所有文件,并保存在指定的目录下,比如:这里可以拿纯预训练的那个模型来进行继续预训练qwen3-0.1b
第二步: 准备数据,数据样式如下- 这里采用 prompt + completion 样式(该方式最好管理): {"prompt": [{"role": "user", "content": "What color is the sky?"}],"completion": [{"role": "assistant", "content": "It is blue."}]} {"prompt": [{"role": "user", "content": "What color is the sky?"}],"completion": [{"role": "assistant", "content": "It is blue."}]}
第四步: 准备训练参数文件,参照 arguments.yaml
第五步:写训练脚本,参照 train.py
第六步: 执行训练,
torchrun --standalone --nnodes=1 --nproc_per_node=1 main.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scarabs-0.1.2-py3-none-any.whl.
File metadata
- Download URL: scarabs-0.1.2-py3-none-any.whl
- Upload date:
- Size: 306.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d155112094ed2ca73b61a1b3ab8ac75768493065c0aa2f058abaa3de50d3b487
|
|
| MD5 |
c1928bd02c181c1fe68fa57a011abe08
|
|
| BLAKE2b-256 |
829c0f87be70b7b173dc3ac06160c40ab420313a743047757750f25b26addcdf
|