Skip to main content

A configuration-driven management manual generation framework based on Kedro pipelines with Polars and Typst.

Project description

ManualForge

Configuration-driven management manual generation framework. 配置驱动的管理手册生成框架。 Define your data sources, fields, and templates in YAML — get a formatted report. 在 YAML 中定义数据源、字段和模板,即可生成格式化报告。

PyPI version Python

Built on Kedro pipelines with Polars for data processing and Typst for document rendering. 基于 Kedro 流水线 + Polars 数据处理 + Typst 文档渲染。

ManualForge is a reusable Python package (pip install manualforge). For a real-world downstream application, see modelmanual — a Chinese regulatory manual generator that extends ManualForge with rule-code field mapping and reconciliation reporting. ManualForge 是一个可复用的 Python 包pip install manualforge)。实际下游应用示例见 modelmanual —— 一个基于 ManualForge 的中文法规手册生成器,扩展了规则代码字段映射和核对报告功能。

Philosophy / 设计理念

ManualForge separates what you want to produce from how it's produced. ManualForge 将「要生成什么」与「如何生成」解耦。

  • What / 内容: Defined in conf/base/parameters_manualforge.yml — your data sources, expected columns, standardization rules, sort orders, summary dimensions, and report templates. 在配置文件中定义数据源、期望列、标准化规则、排序、汇总维度和报告模板。
  • How / 方法: Implemented by the pipeline nodes — reusable data processing functions that read from your config. 由流水线节点实现——可复用的数据处理函数,读取配置驱动行为。

To create a new manual for a different domain, you only need to edit the config file (and optionally provide new templates). No Python code changes required. 要为新领域创建手册,只需编辑配置文件(可选提供新模板),无需修改 Python 代码。

Features / 功能

Capability / 能力 Description / 说明
Multi-sheet Excel ingestion / 多表 Excel 读取 Auto-detect headers, filter cover sheets, merge into structured DataFrames. 自动检测表头,过滤封面页,合并为结构化 DataFrame。
Field standardization / 字段标准化 Mapping files + exact matching + fuzzy matching (difflib / duckdb). 映射文件 + 精确匹配 + 模糊匹配。
Config-driven summaries / 配置驱动汇总 Define group-by dimensions, sort orders, ability categories, and output paths in YAML. 在 YAML 中定义分组维度、排序、能力类别和输出路径。
Typst report generation / Typst 报告生成 Jinja2 templates → Typst source → PDF compilation. Jinja2 模板 → Typst 源码 → PDF 编译。
Pipeline hooks / 流水线钩子 Shell command hooks at pipeline/node granularity for pre/post processing. 流水线/节点粒度的 shell 命令钩子,用于前后处理。

Quick Start / 快速开始

# 1. Install dependencies / 安装依赖
pip install -r requirements.txt

# 2. Copy and customize configuration / 复制并自定义配置
cp conf/examples/parameters_manualforge.yml.example conf/base/parameters_manualforge.yml
cp conf/examples/catalog.yml.example          conf/base/catalog.yml
cp conf/examples/hooks.yml.example            conf/base/hooks.yml
cp conf/examples/parameters.yml.example       conf/base/parameters.yml
cp conf/examples/credentials.yml.example      conf/local/credentials.yml

# 3. Edit the config files to point to your data sources
#    编辑配置文件,指向你的数据源
#    (conf/base/ is gitignored — your real configs stay local)
#    (conf/base/ 已 gitignore — 实际配置保存在本地)

# 4. Run the pipeline / 运行流水线
kedro run

# Run specific node groups / 运行特定节点组
kedro run --tags conversion        # Excel → Parquet only / 仅 Excel → Parquet
kedro run --tags standardization   # Standardization only / 仅标准化
kedro run --tags csv               # Summary tables only / 仅汇总表

Project Structure / 项目结构

├── conf/
│   ├── base/                          # ★ Gitignored — copy from examples/ | 从 examples/ 复制
│   │   ├── parameters_manualforge.yml # Central project configuration | 项目中心配置
│   │   ├── catalog.yml                # Kedro data catalog | 数据目录
│   │   ├── hooks.yml                  # Pipeline hooks (shell commands) | 流水线钩子
│   │   └── parameters.yml             # Pipeline parameters | 流水线参数
│   ├── examples/                      # ★ Tracked example templates | 版本追踪的示例模板
│   │   ├── parameters_manualforge.yml.example
│   │   ├── catalog.yml.example
│   │   ├── hooks.yml.example
│   │   ├── parameters.yml.example
│   │   └── credentials.yml.example
│   ├── local/                         # Local-only (gitignored) | 仅本地 (gitignored)
│   │   └── credentials.yml
│   └── logging.yml
├── data/                              # Gitignored except .gitkeep | 除 .gitkeep 外均 gitignored
│   ├── 01_raw/                        # Raw Excel/CSV + mapping files | 原始数据 + 映射文件
│   ├── 02_intermediate/              # Parquet, reconcile reports | 中间数据、核对报告
│   ├── 03_primary/                   # Standardized data | 标准化后数据
│   ├── 04_feature/                   # Summary tables (CSV + Markdown) | 汇总表
│   └── 08_reporting/                 # Typst sources & compiled PDFs | Typst 源码和 PDF
├── scripts/                          # Auxiliary scripts | 辅助脚本
│   ├── convert_csv_to_md.py          # CSV → Markdown conversion | 转换
│   ├── extract_rule_field_mapping.py # Rule field extraction | 规则字段提取
│   ├── extract_rule_overview.py      # Rule overview extraction | 规则概览提取
│   └── render_with_forge.py          # Markdown → DOCX/PDF rendering | 渲染
├── src/manualforge/                  # Framework source code | 框架源码
│   ├── config.py                     # Configuration helper utilities | 配置工具
│   ├── hooks.py                      # Kedro pipeline hooks (PipelineHooks base class) | 流水线钩子基类
│   ├── io/                           # Custom Kedro datasets (PolarsExcelDataset) | 自定义数据集
│   ├── pipelines/
│   │   └── data_processing_pl/       # Core pipeline: 12 reusable nodes | 核心流水线:12 个可复用节点
│   │       ├── nodes.py              #   Node functions | 节点函数
│   │       ├── pipeline.py           #   Pipeline definition | 流水线定义
│   │       ├── rulecsv2typ.py        #   CSV → Typst/Jinja conversion | CSV → Typst/Jinja 转换
│   │       └── standardize_fields.py #   Field standardization engine | 字段标准化引擎
│   ├── pipeline_registry.py          # Pipeline registration | 流水线注册
│   ├── settings.py                   # Kedro project settings | 项目设置
│   └── __main__.py                   # CLI entry point | CLI 入口
├── templates/                        # Jinja2 Typst templates | Jinja2 Typst 模板
│   └── report.typ.j2
├── pyproject.toml                    # Project metadata & dependencies | 项目元数据和依赖
└── requirements.txt

Configuration Guide / 配置指南

The central configuration file is conf/base/parameters_manualforge.yml. Copy from conf/examples/ and customize. 核心配置文件为 conf/base/parameters_manualforge.yml。从 conf/examples/ 复制后进行自定义。

1. Data Sources / 数据源

Define your Excel files, expected headers, and sheet filtering rules. 定义 Excel 文件、期望表头和 Sheet 过滤规则:

datasources:
  primary_data:
    filepath: "data/01_raw/your_data.xlsx"
    sheet:
      exclude_names: ["封面", "封皮"]
      name_becomes_column: "sheet_name"
    header_detection:
      mode: keyword_match
      expected_headers:
        - "column_a"
        - "column_b"
    cleaning:
      drop_rows_where:
        column_a: ["column_a"]   # drop residual header rows | 删除残留表头行
      fill_null: forward
      deduplicate: true

2. Field Standardization / 字段标准化

Define which fields to standardize, their mapping files, and special corrections. 定义需要标准化的字段、映射文件和特殊修正:

standardization:
  fields:
    - name: "dept_name"
      mapping_file: "data/01_raw/dept_list"
      case_corrections:
        wrong_name: "correct_name"
      special_mappings:
        alias: "canonical_name"
      fuzzy:
        enabled: true
        threshold: 0.8
        method: difflib             # difflib | duckdb

3. Sort Orders / 排序

Define reusable sort order lists referenced by summaries. 定义汇总引用的可复用排序列表:

sort_orders:
  model_names:
    - "Model A"
    - "Model B"
  dep_names:
    - "HR"
    - "Finance"

4. Summaries / 汇总

Define what summary tables to generate. 定义要生成的汇总表:

summaries:
  my_summary:
    description: "Fields grouped by model and department"
    group_by: ["model", "department"]
    struct_columns: ["module", "system", "field_name"]
    sort_by:
      department: dep_names
    output:
      csv: "data/04_feature/my_summary.csv"

5. Reports / 报告

Define report templates and output. 定义报告模板和输出:

reports:
  my_report:
    description: "Rules cookbook"
    template_source: inline
    data_source: rules_data
    output_typ: "data/08_reporting/output.typ"
    typst_compile:
      enabled: true

Data Layers / 数据分层

Layer / 层级 Directory / 目录 Description / 说明
Raw / 原始 data/01_raw/ Source Excel/CSV files, mapping files / 源文件与映射文件
Intermediate / 中间 data/02_intermediate/ Parquet, reconcile reports / Parquet 与核对报告
Primary / 主数据 data/03_primary/ Standardized data / 标准化后数据
Feature / 特征 data/04_feature/ Summary tables (CSV + Markdown) / 汇总表
Reporting / 报告 data/08_reporting/ Typst sources & PDF output / Typst 源码与 PDF

Requirements / 环境要求

  • Python >= 3.10
  • Typst CLI (for PDF compilation / 用于 PDF 编译)

Development / 开发

pip install -e ".[dev]"
ruff check src/
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manualforge-0.2.0.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

manualforge-0.2.0-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file manualforge-0.2.0.tar.gz.

File metadata

  • Download URL: manualforge-0.2.0.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for manualforge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 91e46830533e1db910d5d2ef1b9f044229284660920872a6e148a6dcdfa6a25a
MD5 1771ac85a5473e15368089d3aaa8bd4e
BLAKE2b-256 1604975fd88dfcc4358867770393cc155ae688589427cbb27e1c5687367c2ce3

See more details on using hashes here.

File details

Details for the file manualforge-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: manualforge-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for manualforge-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8566744102382757b1b1458ca370166ff6e12c3a9d04e29e613d3aa4100dd34
MD5 7608487ba9474b7da5c247ad6b33a49e
BLAKE2b-256 c8605ad221def04d14a2173c7c48a3b423d99996b439e67cb73c8f73113ccf66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page