scarab: llm training paradigm

These details have not been verified by PyPI

Project links

Homepage

Project description

scarabs平台: 一款基于 transformers 的通用模型训练框架，

可以tabular data训练，text data训练，image data训练，LLM训练

scarabs平台

📘 core:

✅ Training of tabular data, For example, CTR used in recommendation systems
Training of text data, For example, text classification
Training of image data, For example, image classification
Training of LLM, For example, llm pretrain

📘 very easy to use

pip install scarabs

📘 In detail

✅ 1. Tabular Data You can refer to tabular_ctr in the examples folder

Text Data You can refer to llm_classification in the examples folder
LLM You can refer to llm_pretrain in the examples folder
refer to github https://github.com/zhu2856061/scarabs

📘 arguments

ℹ️ task_name_or_path: 任务名，所有训练产生的中间结果和最终结果都会在该目录下

ℹ️ data_format: 数据的格式，包含[text, csv, json, parquet], tabular数据推荐用parquet格式-平时将自己的数据准备成parquet格式，文本类数据推荐采用json格式

ℹ️ train_file: 训练数据的路径，可以给数据的文件夹(会读取文件夹内的文件)，也可以给数据的文件路径，

ℹ️ valid_file: 评估数据的路径，可以给数据的文件夹(会读取文件夹内的文件)，也可以给数据的文件路径，

ℹ️ test_file: 训练数据的路径，可以给数据的文件夹(会读取文件夹内的文件)，也可以给数据的文件路径，

ℹ️ preprocessing_num_workers: 对数据进行处理的时候，启动几个进程worker进行并行处理数据

ℹ️ labels: 数据的Y标，⚠️是一个列表，方便-【多目标的模型】

ℹ️ load_resume_from_checkpoint: 检查点的路径-文件夹，用于导入检查点，并继续训练，会先加载模型-> 再进行训练

ℹ️ incremental_resume_from_checkpoint: 对embedding层进行增量训练，基于先前的模型，其中的特征值/token数量是固定的，一旦基于先前模型进行下次的继续训练的时候，出现全新的特征值/token的时候，就会出现无法识别，被当作UNK对待了，故需要设置这个检查点的路径，会启动增量训练

🔔 ctr训练 [update]

正常训练

[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ config.json 模型参数 |_ main.py 训练主程序

其中 arguments.yaml 文件中参数设置如下：

task_name_or_path: "encode"

overwrite_output_dir: true
output_dir: "model"

# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2

# model
# load_resume_from_checkpoint: "./encode/model/checkpoint-1029"
# incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"

# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true

# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs: 
  mode: "max"
  factor: 0.1
  patience: 1
  verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1

# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4

# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True

config.json文件设置参考具体模型的config[scarabs/nova/models]

main.py文件如下：

from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr


def feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained("config.json")
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.load_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def incremental_continue_feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.incremental_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def incremental_continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def eval():
    # Predict
    task = TaskFactoryWithTabularCtr()
    model_path = "./encode/model"
    task.inference_with_load_model(model_path, CtrWithDNN)

    import pandas as pd
    from sklearn.metrics import roc_auc_score

    preds = []
    label = []
    ds = pd.read_csv("../../data/movielens/valid/valid.csv")
    for line in ds.to_dict("records"):
        label.append(line["label"])
        res = task.inference(X=line)
        preds.append(res["logits"][0].item())
    print(roc_auc_score(label, preds))

if __name__ == "__main__":
    parser = HfArgumentParser(TaskArguments)  # type: ignore
    args = parser.parse_yaml_file("arguments.yaml")[0]
    # # feature
    feature_engineering(args)
    # # Train
    train(args)

继续训练

[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ model/ 模型文件夹 - 检查点checkpoint-**** - config.json 和 models.safetensors |_ main.py 训练主程序

其中 arguments.yaml 文件中参数设置如下：

task_name_or_path: "encode"

overwrite_output_dir: true
output_dir: "model"

# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2

# model
load_resume_from_checkpoint: "./model/checkpoint-1029"
# incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"

# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true

# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs: 
  mode: "max"
  factor: 0.1
  patience: 1
  verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1

# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4

# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True

main.py文件如下：

from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr


def feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained("config.json")
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.load_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def incremental_continue_feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.incremental_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def incremental_continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def eval():
    # Predict
    task = TaskFactoryWithTabularCtr()
    model_path = "./encode/model"
    task.inference_with_load_model(model_path, CtrWithDNN)

    import pandas as pd
    from sklearn.metrics import roc_auc_score

    preds = []
    label = []
    ds = pd.read_csv("../../data/movielens/valid/valid.csv")
    for line in ds.to_dict("records"):
        label.append(line["label"])
        res = task.inference(X=line)
        preds.append(res["logits"][0].item())
    print(roc_auc_score(label, preds))

if __name__ == "__main__":
    parser = HfArgumentParser(TaskArguments)  # type: ignore
    args = parser.parse_yaml_file("arguments.yaml")[0]
    # # Train
    continue_train(args)

增量inputs ids embedding继续训练

[参考examples/tabular] |_ arguments.yaml 训练所需设置的参数 |_ model/ 模型文件夹 - 检查点checkpoint-**** - config.json 和 models.safetensors |_ main.py 训练主程序

其中 arguments.yaml 文件中参数设置如下：

task_name_or_path: "encode"

overwrite_output_dir: true
output_dir: "model"

# data
data_format: "csv"
train_file: "../data/movielens/train"
valid_file: "../data/movielens/valid"
preprocessing_num_workers: 2

# model
# load_resume_from_checkpoint: "./model/checkpoint-1029"
incremental_resume_from_checkpoint: "./encode/model/checkpoint-1029"

# runtimes metric
do_train: true
seed: 2025
use_cpu: false
report_to: "tensorboard"
save_safetensors: true
save_total_limit: 1
early_stopping_patience: 3
early_stopping_threshold: 1.0e-7
remove_unused_columns: false
metric_for_best_model: "eval_roc_auc"
greater_is_better: true

# optim
optim: "adamw_torch"
learning_rate: 1.0e-3
lr_scheduler_type: "reduce_lr_on_plateau"
lr_scheduler_kwargs: 
  mode: "max"
  factor: 0.1
  patience: 1
  verbose: true
weight_decay: 0
max_grad_norm: 10.0
gradient_accumulation_steps: 1

# data
label_names: ["label"]
per_device_train_batch_size: 4096
per_device_eval_batch_size: 4096
dataloader_num_workers: 4

# view
eval_strategy: "epoch"
logging_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: True

main.py文件如下：

from __future__ import absolute_import, division, print_function
import os
from transformers.hf_argparser import HfArgumentParser
from scarabs.nova.models.ctr_with_dnn import CtrWithDNN, CtrWithDNNConfig
from scarabs.task_factory import TaskArguments, TaskFactoryWithTabularCtr


def feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained("config.json")
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.load_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def incremental_continue_feature_engineering(args):
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.incremental_resume_from_checkpoint,
            "config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config=config)
    task.create_feature2meta_in_config()

def incremental_continue_train(args):
    # # Train
    config = CtrWithDNNConfig.from_pretrained(
        os.path.join(
            args.task_name_or_path,
            "data/meta/config.json",
        )
    )
    task = TaskFactoryWithTabularCtr(args, config)
    task.train(model=CtrWithDNN(config))

def eval():
    # Predict
    task = TaskFactoryWithTabularCtr()
    model_path = "./encode/model"
    task.inference_with_load_model(model_path, CtrWithDNN)

    import pandas as pd
    from sklearn.metrics import roc_auc_score

    preds = []
    label = []
    ds = pd.read_csv("../../data/movielens/valid/valid.csv")
    for line in ds.to_dict("records"):
        label.append(line["label"])
        res = task.inference(X=line)
        preds.append(res["logits"][0].item())
    print(roc_auc_score(label, preds))

if __name__ == "__main__":
    parser = HfArgumentParser(TaskArguments)  # type: ignore
    args = parser.parse_yaml_file("arguments.yaml")[0]
    # # Train
    incremental_continue_feature_engineering(args)
    incremental_continue_train(args)

进行增量训练，在训练的日志部分会有增量模型部分矩阵改变的日志打印，请留意

logger.warning(f"{v} shape mismatched, current: {model_dict[v].shape} != history:{state_dict[k].shape}")

给出当前模型矩阵和历史模型矩阵的形状不一致，请留意

logger.warning(f"{key} is updated from history:{history_size} to current:{current_size}")

给出历史模型矩阵已经修正成新的矩阵大小

🔔 大模型训练 [update]

1 纯预训练，从0-1，另起一座山峰，以训练一个qwen3-0.1b的模型为例

第一步，先选定一个模型，比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例，找到模型文件中 tokenizer.json 和 tokenizer_config.json 和 config.json 文件

第二步，创建一个文件夹比如： qwen3-0.1b ，然后修改config.json 文件，将其中的一些影响模型大小的参数改为小一些，比如：

{
  "architectures": [
    "Qwen3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "head_dim": 32,
  "hidden_act": "silu",
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 4096,
  "max_window_layers": 28,
  "model_type": "qwen3",
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

第三步：准备数据，数据样式如下： {"text": "根据描述，..."} {"text": "对于一名60岁男性患者，..."}

第四步：准备训练参数文件，参照 arguments.yaml

第五步：写训练脚本，参照 train.py

第六步：执行训练，

torchrun --standalone --nnodes=1 --nproc_per_node=1 main.py

2 继续预训练，以训练一个qwen3-0.1b的模型为例

第一步，先选定一个模型，比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例，需要下载模型的所有文件，并保存在指定的目录下，比如：这里可以拿纯预训练的那个模型来进行继续预训练qwen3-0.1b 其他步骤（去掉上述的第二步）同上特别需要对train.py进行修改，参照 train.py

2 微调训练，以训练一个qwen3-0.1b的模型为例

第一步，先选定一个模型，比如 qwen3-0.6b 或者 qwen3-7b都可以,以 qwen3-0.6b] 为例，需要下载模型的所有文件，并保存在指定的目录下，比如：这里可以拿纯预训练的那个模型来进行继续预训练qwen3-0.1b

第二步：准备数据，数据样式如下- 这里采用 prompt + completion 样式（该方式最好管理）： {"prompt": [{"role": "user", "content": "What color is the sky?"}],"completion": [{"role": "assistant", "content": "It is blue."}]} {"prompt": [{"role": "user", "content": "What color is the sky?"}],"completion": [{"role": "assistant", "content": "It is blue."}]}

第四步：准备训练参数文件，参照 arguments.yaml

第五步：写训练脚本，参照 train.py

第六步：执行训练，

torchrun --standalone --nnodes=1 --nproc_per_node=1 main.py

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

Jul 7, 2025

0.1.1

Jul 7, 2025

0.1.0

Jul 7, 2025

0.0.6

Apr 25, 2025

0.0.5

Mar 31, 2025

0.0.4

Mar 14, 2025

0.0.3

Feb 26, 2025

0.0.2

Sep 19, 2024

0.0.1

Sep 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scarabs-0.1.2-py3-none-any.whl (306.6 kB view details)

Uploaded Jul 7, 2025 Python 3

File details

Details for the file scarabs-0.1.2-py3-none-any.whl.

File metadata

Download URL: scarabs-0.1.2-py3-none-any.whl
Upload date: Jul 7, 2025
Size: 306.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for scarabs-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d155112094ed2ca73b61a1b3ab8ac75768493065c0aa2f058abaa3de50d3b487`
MD5	`c1928bd02c181c1fe68fa57a011abe08`
BLAKE2b-256	`829c0f87be70b7b173dc3ac06160c40ab420313a743047757750f25b26addcdf`

See more details on using hashes here.

scarabs 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scarabs平台: 一款基于 transformers 的通用模型训练框架，

📘 core:

📘 very easy to use

📘 In detail

📘 arguments

🔔 ctr训练 [update]

正常训练

继续训练

增量inputs ids embedding继续训练

🔔 大模型训练 [update]

1 纯预训练，从0-1，另起一座山峰，以训练一个qwen3-0.1b的模型为例

2 继续预训练，以训练一个qwen3-0.1b的模型为例

2 微调训练，以训练一个qwen3-0.1b的模型为例

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

scarabs 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scarabs平台: 一款基于 transformers 的 通用模型训练框架，

📘 core:

📘 very easy to use

📘 In detail

📘 arguments

🔔 ctr训练 [update]

正常训练

继续训练

增量inputs ids embedding继续训练

🔔 大模型训练 [update]

1 纯预训练， 从0-1，另起一座山峰 ， 以训练一个qwen3-0.1b的模型为例

2 继续预训练， 以训练一个qwen3-0.1b的模型为例

2 微调训练，以训练一个qwen3-0.1b的模型为例

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

scarabs平台: 一款基于 transformers 的通用模型训练框架，

1 纯预训练，从0-1，另起一座山峰，以训练一个qwen3-0.1b的模型为例

2 继续预训练，以训练一个qwen3-0.1b的模型为例