將中文純文字切割並透過 Ollama LLM 自動生成 QA 問答對,供 LLM 微調使用
Project description
zhqa-generator
將中文純文字切割並透過 Ollama LLM 自動生成 QA 問答對,供 LLM 微調使用。
安裝
pip install zhqa-generator
前置需求:需要在本機執行 Ollama 並拉取所需模型:
ollama pull gemma3:4b
快速開始
Python API
from zhqa_generator import chunk_text, generate_qa_pairs
# 讀取文章
text = open("article.txt", encoding="utf-8").read()
# Step 1:切割文本(每塊 200~500 字)
chunks = chunk_text(text, chunk_min=200, chunk_max=500)
print(f"共切割成 {len(chunks)} 個區塊")
# Step 2:呼叫 Ollama 生成 QA 對
pairs = generate_qa_pairs(chunks, model="gemma3:4b")
# Step 3:儲存為 JSONL
import json
with open("output.jsonl", "w", encoding="utf-8") as f:
for pair in pairs:
f.write(json.dumps(pair, ensure_ascii=False) + "\n")
print(f"共生成 {len(pairs)} 組問答對")
輸出範例:
{"prompt": "手沖咖啡需要哪些器具?", "completion": "手沖咖啡需要濾杯、濾紙、手沖壺、電子秤等器具。"}
{"prompt": "如何控制手沖咖啡的水溫?", "completion": "建議水溫在 80-90°C 之間,淺焙用較高溫,深焙用較低溫。"}
命令列(CLI)
# 基本用法
zhqa-generate --input article.txt --output output.jsonl
# 自訂模型與切割大小
zhqa-generate --input article.txt --output output.jsonl \
--model llama3:8b \
--chunk-min 300 --chunk-max 600
CLI 參數一覽
| 參數 | 短名 | 預設值 | 說明 |
|---|---|---|---|
--input |
-i |
(必填) | 輸入純文字檔 |
--output |
-o |
output.jsonl |
輸出 JSONL 檔路徑 |
--model |
-m |
gemma3:4b |
Ollama 模型名稱 |
--chunk-min |
— | 200 |
區塊最小字元數 |
--chunk-max |
— | 500 |
區塊最大字元數 |
API 參考
chunk_text(text, chunk_min=200, chunk_max=500) → list[str]
將長文本依段落、句子智慧切割為適合 LLM 的區塊。
generate_qa_for_chunk(chunk, model="gemma3:4b", system_prompt=...) → list[dict]
對單一區塊呼叫 Ollama,回傳 [{"prompt": ..., "completion": ...}]。
generate_qa_pairs(chunks, model="gemma3:4b", show_progress=True) → list[dict]
批次處理所有區塊,附帶 tqdm 進度條。
使用情境
- 將書籍、文章、論文轉換為 LLM 微調資料集(JSONL 格式)
- 搭配
transformers+peft做 LoRA 微調 - 快速產生繁體中文問答對做評測資料
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zhqa_generator-0.1.0.tar.gz
(7.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zhqa_generator-0.1.0.tar.gz.
File metadata
- Download URL: zhqa_generator-0.1.0.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74906e3e780a629ed679fab060b6c203c688b2d0e8b6ecf3f9395c05d8a044e1
|
|
| MD5 |
3d03fe5c19884f7173f60f72fba2eb2d
|
|
| BLAKE2b-256 |
ca66d36c5185bc49af8a137fe3de373238db7413554a476e886d29ab06770df4
|
File details
Details for the file zhqa_generator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: zhqa_generator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
855d259925e06f6a4df1c59843d35920ba43bb80c6c351a2a80d8c2d93045d88
|
|
| MD5 |
5181853f3d014fb7d2d1d55e46fbbbec
|
|
| BLAKE2b-256 |
e4d8549ce23a84044e00fca6d09b04c6fe2935c375cd596e89b90bbe144960e5
|