Skip to main content

將中文純文字切割並透過 Ollama LLM 自動生成 QA 問答對,供 LLM 微調使用

Project description

zhqa-generator

將中文純文字切割並透過 Ollama LLM 自動生成 QA 問答對,供 LLM 微調使用。

PyPI version Python License: MIT

安裝

pip install zhqa-generator

前置需求:需要在本機執行 Ollama 並拉取所需模型:

ollama pull gemma3:4b

快速開始

Python API

from zhqa_generator import chunk_text, generate_qa_pairs

# 讀取文章
text = open("article.txt", encoding="utf-8").read()

# Step 1:切割文本(每塊 200~500 字)
chunks = chunk_text(text, chunk_min=200, chunk_max=500)
print(f"共切割成 {len(chunks)} 個區塊")

# Step 2:呼叫 Ollama 生成 QA 對
pairs = generate_qa_pairs(chunks, model="gemma3:4b")

# Step 3:儲存為 JSONL
import json
with open("output.jsonl", "w", encoding="utf-8") as f:
    for pair in pairs:
        f.write(json.dumps(pair, ensure_ascii=False) + "\n")

print(f"共生成 {len(pairs)} 組問答對")

輸出範例:

{"prompt": "手沖咖啡需要哪些器具?", "completion": "手沖咖啡需要濾杯、濾紙、手沖壺、電子秤等器具。"}
{"prompt": "如何控制手沖咖啡的水溫?", "completion": "建議水溫在 80-90°C 之間,淺焙用較高溫,深焙用較低溫。"}

命令列(CLI)

# 基本用法
zhqa-generate --input article.txt --output output.jsonl

# 自訂模型與切割大小
zhqa-generate --input article.txt --output output.jsonl \
              --model llama3:8b \
              --chunk-min 300 --chunk-max 600

CLI 參數一覽

參數 短名 預設值 說明
--input -i (必填) 輸入純文字檔
--output -o output.jsonl 輸出 JSONL 檔路徑
--model -m gemma3:4b Ollama 模型名稱
--chunk-min 200 區塊最小字元數
--chunk-max 500 區塊最大字元數

API 參考

chunk_text(text, chunk_min=200, chunk_max=500) → list[str]

將長文本依段落、句子智慧切割為適合 LLM 的區塊。

generate_qa_for_chunk(chunk, model="gemma3:4b", system_prompt=...) → list[dict]

對單一區塊呼叫 Ollama,回傳 [{"prompt": ..., "completion": ...}]

generate_qa_pairs(chunks, model="gemma3:4b", show_progress=True) → list[dict]

批次處理所有區塊,附帶 tqdm 進度條。


使用情境

  • 將書籍、文章、論文轉換為 LLM 微調資料集(JSONL 格式)
  • 搭配 transformers + peft 做 LoRA 微調
  • 快速產生繁體中文問答對做評測資料

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhqa_generator-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhqa_generator-0.1.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file zhqa_generator-0.1.0.tar.gz.

File metadata

  • Download URL: zhqa_generator-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for zhqa_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 74906e3e780a629ed679fab060b6c203c688b2d0e8b6ecf3f9395c05d8a044e1
MD5 3d03fe5c19884f7173f60f72fba2eb2d
BLAKE2b-256 ca66d36c5185bc49af8a137fe3de373238db7413554a476e886d29ab06770df4

See more details on using hashes here.

File details

Details for the file zhqa_generator-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zhqa_generator-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for zhqa_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 855d259925e06f6a4df1c59843d35920ba43bb80c6c351a2a80d8c2d93045d88
MD5 5181853f3d014fb7d2d1d55e46fbbbec
BLAKE2b-256 e4d8549ce23a84044e00fca6d09b04c6fe2935c375cd596e89b90bbe144960e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page