Skip to main content

A configuration-driven management manual generation framework based on Kedro pipelines with Polars and Typst.

Project description

ManualForge

Configuration-driven management manual generation framework. 配置驱动的管理手册生成框架。 Define your data sources, fields, and templates in YAML — get a formatted report. 在 YAML 中定义数据源、字段和模板,即可生成格式化报告。

PyPI version Python

Built on Kedro pipelines with Polars for data processing and Typst for document rendering. 基于 Kedro 流水线 + Polars 数据处理 + Typst 文档渲染。

ManualForge is a reusable Python package (pip install manualforge). For a real-world downstream application, see modelmanual — a Chinese regulatory manual generator that extends ManualForge with rule-code field mapping and reconciliation reporting. ManualForge 是一个可复用的 Python 包pip install manualforge)。实际下游应用示例见 modelmanual —— 一个基于 ManualForge 的中文法规手册生成器,扩展了规则代码字段映射和核对报告功能。

Philosophy / 设计理念

ManualForge separates what you want to produce from how it's produced. ManualForge 将「要生成什么」与「如何生成」解耦。

  • What / 内容: Defined in conf/base/parameters_manualforge.yml — your data sources, expected columns, standardization rules, sort orders, summary dimensions, and report templates. 在配置文件中定义数据源、期望列、标准化规则、排序、汇总维度和报告模板。
  • How / 方法: Implemented by the pipeline nodes — reusable data processing functions that read from your config. 由流水线节点实现——可复用的数据处理函数,读取配置驱动行为。

To create a new manual for a different domain, you only need to edit the config file (and optionally provide new templates). No Python code changes required. 要为新领域创建手册,只需编辑配置文件(可选提供新模板),无需修改 Python 代码。

Features / 功能

Capability / 能力 Description / 说明
Multi-sheet Excel ingestion / 多表 Excel 读取 Auto-detect headers, filter cover sheets, merge into structured DataFrames. 自动检测表头,过滤封面页,合并为结构化 DataFrame。
Field standardization / 字段标准化 Mapping files + exact matching + fuzzy matching (difflib / duckdb). 映射文件 + 精确匹配 + 模糊匹配。
Config-driven summaries / 配置驱动汇总 Define group-by dimensions, sort orders, ability categories, and output paths in YAML. 在 YAML 中定义分组维度、排序、能力类别和输出路径。
Typst report generation / Typst 报告生成 Jinja2 templates → Typst source → PDF compilation. Jinja2 模板 → Typst 源码 → PDF 编译。
Pipeline hooks / 流水线钩子 Shell command hooks at pipeline/node granularity for pre/post processing. 流水线/节点粒度的 shell 命令钩子,用于前后处理。
Auto-backup / 自动备份 Pre-run config snapshot + post-run data backup via hooks. 跑前配置快照 + 跑后数据备份,通过 hooks 自动触发。
Config deploy / 配置部署 cfg-backup / cfg-deploy — backup, restore, and deploy configs from templates. 备份、恢复和从模板部署配置文件。

Quick Start / 快速开始

# 1. Install dependencies / 安装依赖
pip install -r requirements.txt

# 2. Copy and customize configuration / 复制并自定义配置
#    Option A: interactive deployment / 交互式部署
./scripts/cfg-deploy --from-examples

#    Option B: manual copy / 手动复制
cp conf/examples/parameters_manualforge.yml.example conf/base/parameters_manualforge.yml
cp conf/examples/catalog.yml.example          conf/base/catalog.yml
cp conf/examples/hooks.yml.example            conf/base/hooks.yml
cp conf/examples/parameters.yml.example       conf/base/parameters.yml
cp conf/examples/credentials.yml.example      conf/local/credentials.yml

# 3. Edit the config files to point to your data sources
#    编辑配置文件,指向你的数据源
#    (conf/base/ is gitignored — your real configs stay local)
#    (conf/base/ 已 gitignore — 实际配置保存在本地)

# 4. Run the pipeline / 运行流水线
kedro run

# Run specific node groups / 运行特定节点组
kedro run --tags conversion        # Excel → Parquet only / 仅 Excel → Parquet
kedro run --tags standardization   # Standardization only / 仅标准化
kedro run --tags csv               # Summary tables only / 仅汇总表

Backup & Config Management / 备份与配置管理

Auto-backup via hooks (runs on every kedro run): 通过 hooks 自动备份(每次 kedro run 自动触发):

kedro run
  ├─ [before_pipeline]  cfg-backup      ← snapshot conf/base/
  └─ [after_pipeline]   backup_data.sh  ← snapshot pipeline output data

Manual backup/restore/deploy: 手动备份/恢复/部署:

# Config backup / 配置备份
./scripts/cfg-backup              # backup conf/base/ → conf/.backups/
./scripts/cfg-backup -l           # list existing backups

# Config restore / deploy from examples / 配置恢复 / 从模板部署
./scripts/cfg-deploy                        # interactive menu | 交互菜单
./scripts/cfg-deploy -l                     # list config backups
./scripts/cfg-deploy -r 20260617_105645     # restore specific backup | 恢复指定备份
./scripts/cfg-deploy --from-examples         # deploy fresh templates | 从模板部署
./scripts/cfg-deploy --from-examples --dry-run  # preview | 预览

# Data backup / 数据备份
./scripts/backup_data.sh          # backup pipeline output → data/.backups/
./scripts/backup_data.sh -k 5     # keep only last 5 backups

Project Structure / 项目结构

├── conf/
│   ├── base/                          # ★ Gitignored — copy from examples/ | 从 examples/ 复制
│   │   ├── parameters_manualforge.yml # Central project configuration | 项目中心配置
│   │   ├── catalog.yml                # Kedro data catalog | 数据目录
│   │   ├── hooks.yml                  # Pipeline hooks (shell commands) | 流水线钩子
│   │   └── parameters.yml             # Pipeline parameters | 流水线参数
│   ├── examples/                      # ★ Tracked example templates | 版本追踪的示例模板
│   │   ├── parameters_manualforge.yml.example
│   │   ├── catalog.yml.example
│   │   ├── hooks.yml.example
│   │   ├── parameters.yml.example
│   │   └── credentials.yml.example
│   ├── local/                         # Local-only (gitignored) | 仅本地 (gitignored)
│   │   └── credentials.yml
│   └── logging.yml
├── data/                              # Gitignored except .gitkeep | 除 .gitkeep 外均 gitignored
│   ├── 01_raw/                        # Raw Excel/CSV + mapping files | 原始数据 + 映射文件
│   ├── 02_intermediate/              # Parquet, reconcile reports | 中间数据、核对报告
│   ├── 03_primary/                   # Standardized data | 标准化后数据
│   ├── 04_feature/                   # Summary tables (CSV + Markdown) | 汇总表
│   └── 08_reporting/                 # Typst sources & compiled PDFs | Typst 源码和 PDF
├── scripts/                          # Auxiliary scripts | 辅助脚本
│   ├── backup_data.sh                # ★ Backup pipeline output data | 备份管道输出数据
│   ├── cfg-backup                    # ★ Backup conf/base/ config | 备份配置文件
│   ├── cfg-deploy                    # ★ Deploy/restore configs | 部署/恢复配置
│   ├── main.sh                       # ★ Kedro runner with auto-backup | 带自动备份的启动脚本
│   ├── convert_csv_to_md.py          # CSV → Markdown conversion | 转换
│   ├── extract_rule_field_mapping.py # Rule field extraction | 规则字段提取
│   ├── extract_rule_overview.py      # Rule overview extraction | 规则概览提取
│   └── render_with_forge.py          # Markdown → DOCX/PDF rendering | 渲染
├── src/manualforge/                  # Framework source code | 框架源码
│   ├── config.py                     # Configuration helper utilities | 配置工具
│   ├── hooks.py                      # Kedro pipeline hooks (PipelineHooks base class) | 流水线钩子基类
│   ├── io/                           # Custom Kedro datasets (PolarsExcelDataset) | 自定义数据集
│   ├── pipelines/
│   │   └── data_processing_pl/       # Core pipeline: 12 reusable nodes | 核心流水线:12 个可复用节点
│   │       ├── nodes.py              #   Node functions | 节点函数
│   │       ├── pipeline.py           #   Pipeline definition | 流水线定义
│   │       ├── rulecsv2typ.py        #   CSV → Typst/Jinja conversion | CSV → Typst/Jinja 转换
│   │       └── standardize_fields.py #   Field standardization engine | 字段标准化引擎
│   ├── pipeline_registry.py          # Pipeline registration | 流水线注册
│   ├── settings.py                   # Kedro project settings | 项目设置
│   └── __main__.py                   # CLI entry point | CLI 入口
├── templates/                        # Jinja2 Typst templates | Jinja2 Typst 模板
│   └── recipe.typ.j2
├── pyproject.toml                    # Project metadata & dependencies | 项目元数据和依赖
└── requirements.txt

Configuration Guide / 配置指南

The central configuration file is conf/base/parameters_manualforge.yml. Copy from conf/examples/ and customize. 核心配置文件为 conf/base/parameters_manualforge.yml。从 conf/examples/ 复制后进行自定义。

1. Data Sources / 数据源

Define your Excel files, expected headers, and sheet filtering rules. 定义 Excel 文件、期望表头和 Sheet 过滤规则:

datasources:
  primary_data:
    filepath: "data/01_raw/your_data.xlsx"
    sheet:
      exclude_names: ["封面", "封皮"]
      name_becomes_column: "sheet_name"
    header_detection:
      mode: keyword_match
      expected_headers:
        - "column_a"
        - "column_b"
    cleaning:
      drop_rows_where:
        column_a: ["column_a"]   # drop residual header rows | 删除残留表头行
      fill_null: forward
      deduplicate: true

2. Field Standardization / 字段标准化

Define which fields to standardize, their mapping files, and special corrections. 定义需要标准化的字段、映射文件和特殊修正:

standardization:
  fields:
    - name: "dept_name"
      mapping_file: "data/01_raw/dept_list"
      case_corrections:
        wrong_name: "correct_name"
      special_mappings:
        alias: "canonical_name"
      fuzzy:
        enabled: true
        threshold: 0.8
        method: difflib             # difflib | duckdb

3. Sort Orders / 排序

Define reusable sort order lists referenced by summaries. 定义汇总引用的可复用排序列表:

sort_orders:
  model_names:
    - "Model A"
    - "Model B"
  dep_names:
    - "HR"
    - "Finance"

4. Summaries / 汇总

Define what summary tables to generate. 定义要生成的汇总表:

summaries:
  my_summary:
    description: "Fields grouped by model and department"
    group_by: ["model", "department"]
    struct_columns: ["module", "system", "field_name"]
    sort_by:
      department: dep_names
    output:
      csv: "data/04_feature/my_summary.csv"

5. Reports / 报告

Define report templates and output. 定义报告模板和输出:

reports:
  my_report:
    description: "Rules cookbook"
    template_source: inline
    data_source: rules_data
    output_typ: "data/08_reporting/output.typ"
    typst_compile:
      enabled: true

Data Layers / 数据分层

Layer / 层级 Directory / 目录 Description / 说明
Raw / 原始 data/01_raw/ Source Excel/CSV files, mapping files / 源文件与映射文件
Intermediate / 中间 data/02_intermediate/ Parquet, reconcile reports / Parquet 与核对报告
Primary / 主数据 data/03_primary/ Standardized data / 标准化后数据
Feature / 特征 data/04_feature/ Summary tables (CSV + Markdown) / 汇总表
Reporting / 报告 data/08_reporting/ Typst sources & PDF output / Typst 源码与 PDF

Requirements / 环境要求

  • Python >= 3.10
  • Typst CLI (for PDF compilation / 用于 PDF 编译)

Recent Changes / 近期变更

2026-06-29

  • Template extraction (rulecsv2typ.py): Extracted inline RECIPE_TEMPLATE_LEVEL2 into standalone templates/recipe.typ.j2; loaded via _load_recipe_template() using Path(__file__).parents[4]. Removed orphaned templates/report.typ.j2.

2026-06-26

  • Rule code parsing (nodes.py): Added _parse_rule_codes and _RULE_CODE_RE for GZW rule code extraction; rule-code column saved as List(String) without forward-fill; added "规则代码" to EXPECTED_HEADER in convert_excel_to_parquet_fj1 and process_attachment1_excel.
  • Model name standardization (nodes.py): Added _load_model_mapping() and _normalize_model_name() for fuzzy-matching model names against a reference table; 概览 sheet now uses its own 模型名称 column instead of the sheet name.
  • Department field (nodes.py): Added 主研部门 to fj2 EXPECTED_HEADER.
  • Path resolution (nodes.py, standardize_fields.py): Changed Path(__file__).parents[4] to Path.cwd() so paths resolve correctly when ManualForge is used as an installed package (e.g., from modelmanual).
  • Direct data access (rulecsv2typ.py): convert_rules_to_typst_jinja now reads from the Kedro catalog as a Polars DataFrame directly instead of round-tripping through CSV on disk.

Development / 开发

pip install -e ".[dev]"
ruff check src/
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manualforge-0.3.1.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

manualforge-0.3.1-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file manualforge-0.3.1.tar.gz.

File metadata

  • Download URL: manualforge-0.3.1.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for manualforge-0.3.1.tar.gz
Algorithm Hash digest
SHA256 99f316d3eb976cafe6bbea7e0c3d514c0f06ca502e627dc8a600403ae47a5d09
MD5 2ce7f9442ab970d0b5d5f130fee84116
BLAKE2b-256 77ac9d9bd391154227aac3b8c3460b3615bd579c489551897e0d0e6ee617c855

See more details on using hashes here.

File details

Details for the file manualforge-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: manualforge-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for manualforge-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e591d9a6eace9d6e3e8e6499e23513d1987256ba6d829e6c0bc42105cb194d10
MD5 6cd2830412a5298bb201f65a95c3c4b0
BLAKE2b-256 f988614630f14d17843da30beaf26e5a0b977717a03f1ff96e103c6edf6ffc2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page