Skip to main content

A customized pandoc filters set that can be used to generate a useful pandoc python filter.

Project description

PyPI - Python Version PyPI - Version

GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub Actions Workflow Status codecov

DOI GitHub License

pandoc-filter

This project is a customized pandoc filters set that can be used to generate a useful pandoc python filter. Recently, it only supports some features of markdown-to-markdown (normalizing markdown files) and markdown-to-html (generating web pages). But more features will be added later as my scenario and the user's feedback.

Backgrounds

I'm used to taking notes with markdown and clean markdown syntax. Then, I usually post these notes on my site as web pages. So, I need to convert markdown to html. There were many tools to achieve the converting and I chose pandoc at last due to its powerful features.

But sometimes, I need many more features when converting from md to html, where pandoc filters are needed. I have written some pandoc python filters with some advanced features by panflute and many other tools. And now, I think it's time to gather these filters into a combined toolset as this project.

Please see Main Features for the concrete features.

Please see Usage for the recommend usage.

Main Features

Mainly for converting markdown to html, I divided this process into two processes, i.e., markdown-to-markdown (normalizing markdown files) and markdown-to-html (generating web pages).

  • markdown-to-markdown supports:
    • math filter
      • Adapt AMS rule for math formula. (Auto numbering markdown formulations within \begin{equation} \end{equation}, as in Typora)
      • Allow multiple tags, but only take the first one.
      • Allow multiple labels, but only take the first one.
    • figure filter
      • Manager local pictures, sync them to Aliyun OSS, and replace the original src with the new one.
    • footnote filter
      • Normalize footnotes. (Remove \n in the footnote content.)
    • internal link filter
      • Normalize internal links with a very special rule. (Decode the URL-encoded links)
  • markdown-to-html
    • anchor filter
      • Normalize anchors with a very special rule. (replace its id with its hash as Notion does, and numbering it with -x)
    • internal link recorder and filter
      • Globally manage and normalize internal links. (Make it match the behavior of anchor filter)
    • link like filter
      • Process a string that may be like a link. (Make it a link)

Note: The division of filters is just my opinion on code organization, it doesn't mean they can only be used for a certain class. As long as the user understands the effect of the filter, all filters are not restricted to use in any scenario. So, it is recommended to read a filter's source codes directly when using it.

Installation

pip install -i https://pypi.org/simple/ --pre -U pandoc-filter

Usage

Here are 2 basic examples

Convert markdown to markdown (Normalization)

Normalize internal link

  • Inputs(./input.md): refer to test_md2md_internal_link.md.

    ## 带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试        空格
    
    ### aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz [xx]  (yy)
    
    [带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试        空格](#####带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试        空格)
    
    [aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz [xx]  (yy)](#####aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz [xx]  (yy))
    
    <a href="###带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试        空格">带空格 和`特殊字符`...</a>
    
    <a href="#aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz [xx]  (yy)">aAa-b...</a>
    
  • Coding:

    import pathlib
    import logging
    import panflute as pf
    
    from pandoc_filter.utils import TracingLogger
    from pandoc_filter.md2md_filters import internal_link_filter
    
    pathlib.Path("./logs").mkdir(parents=True, exist_ok=True)
    tracing_logger = TracingLogger(name="./logs/pf_log",level=logging.INFO)
    
    file_path = pathlib.Path("./input.md")
    with open(file_path,'r',encoding='utf-8') as f:
        markdown_content = f.read()
    output_path = pathlib.Path("./output.md")
    
    doc = pf.convert_text(markdown_content,input_format='markdown',output_format='panflute',standalone=True)
    doc = pf.run_filter(action=internal_link_filter,doc=doc,tracing_logger=tracing_logger)
    
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(pf.convert_text(doc,input_format='panflute',output_format='gfm',standalone=True))
    
  • Outputs(./output.md): refer to test_md2md_internal_link.md.

    ## 带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试 空格
    
    ### aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz \[xx\] (yy)
    
    [带空格 和`特殊字符` \[链接\](http://typora.io) 用于%%%%¥¥¥¥跳转测试
    空格](#带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试 空格)
    
    [aAa-b cC `Dd`, a#%&\[xxx\](yyy) Zzz \[xx\]
    (yy)](#aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz \[xx\] (yy))
    
    <a href="#带空格 和`特殊字符` [链接](http://typora.io) 用于%%%%¥¥¥¥跳转测试 空格">带空格
    和`特殊字符`…</a>
    
    <a href="#aAa-b cC `Dd`, a#%&[xxx](yyy) Zzz \[xx\] (yy)">aAa-b…</a>
    

Normalize footnotes

  • Inputs(./input.md): refer to test_md2md_footnote.md.

    which1.[^1]
    
    which2.[^2]
    
    which3.[^3]
    
    [^1]: Deep Learning with Intel® AVX-512 and Intel® DL Boost
    https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html
    www.intel.cn
    
    [^2]: Deep Learning with Intel® AVX-512222 and Intel® DL Boost https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html www.intel.cn
    
    [^3]: Deep Learning with Intel®     AVX-512 and Intel® DL Boost https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html www.intel.cn
    
  • Coding:

    import pathlib
    import logging
    import panflute as pf
    
    from pandoc_filter.utils import TracingLogger
    from pandoc_filter.md2md_filters import footnote_filter
    
    pathlib.Path("./logs").mkdir(parents=True, exist_ok=True)
    tracing_logger = TracingLogger(name="./logs/pf_log",level=logging.INFO)
    
    file_path = pathlib.Path("./input.md")
    with open(file_path,'r',encoding='utf-8') as f:
        markdown_content = f.read()
    output_path = pathlib.Path("./output.md")
    
    doc = pf.convert_text(markdown_content,input_format='markdown',output_format='panflute',standalone=True)
    doc = pf.run_filter(action=footnote_filter,doc=doc,tracing_logger=tracing_logger)
    
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(pf.convert_text(doc,input_format='panflute',output_format='gfm',standalone=True))
    
  • Outputs(./output.md): refer to test_md2md_footnote.md.

    which1.[^1]
    
    which2.[^2]
    
    which3.[^3]
    
    [^1]: Deep Learning with Intel® AVX-512 and Intel® DL Boost https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html www.intel.cn
    
    [^2]: Deep Learning with Intel® AVX-512222 and Intel® DL Boost https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html www.intel.cn
    
    [^3]: Deep Learning with Intel® AVX-512 and Intel® DL Boost https://www.intel.cn/content/www/cn/zh/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html www.intel.cn
    

Adapt AMS rule for math formula

  • Inputs(./input.md): refer to test_md2md_math.md.

    $$
    \begin{equation}\tag{abcd}\label{lalla}
    e=mc^2
    \end{equation}
    $$
    
    $$
    \begin{equation}
    e=mc^2
    \end{equation}
    $$
    
    $$
    e=mc^2
    $$
    
    $$
    \begin{equation}\label{eq1}
    e=mc^2
    \end{equation}
    $$
    
  • Coding:

    import pathlib
    import logging
    import panflute as pf
    
    from pandoc_filter.utils import TracingLogger
    from pandoc_filter.md2md_filters import math_filter
    
    
    pathlib.Path("./logs").mkdir(parents=True, exist_ok=True)
    tracing_logger = TracingLogger(name="./logs/pf_log",level=logging.INFO)
    
    file_path = pathlib.Path("./input.md")
    with open(file_path,'r',encoding='utf-8') as f:
        markdown_content = f.read()
    output_path = pathlib.Path("./output.md")
    
    doc = pf.convert_text(markdown_content,input_format='markdown',output_format='panflute',standalone=True)
    doc = pf.run_filter(action=math_filter,doc=doc,tracing_logger=tracing_logger)
    
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(pf.convert_text(doc,input_format='panflute',output_format='gfm',standalone=True))
    
  • Outputs(./output.md): refer to test_md2md_math.md.

    $$
    \begin{equation}\label{lalla}\tag{abcd}
    e=mc^2
    \end{equation}
    $$
    
    $$
    \begin{equation}\tag{1}
    e=mc^2
    \end{equation}
    $$
    
    $$
    e=mc^2
    $$
    
    $$
    \begin{equation}\label{eq1}\tag{2}
    e=mc^2
    \end{equation}
    $$
    

Sync local images to Aliyun OSS

  • Prerequisites:

    • Consider the bucket domain is raw.little-train.com

    • Consider the environment variables have been given:

      • OSS_ENDPOINT_NAME = "oss-cn-taiwan.aliyuncs.com"

      • OSS_BUCKET_NAME = "test"

      • OSS_ACCESS_KEY_ID = "123456781234567812345678"

      • OSS_ACCESS_KEY_SECRET = "123456123456123456123456123456"

    • Consider images located in ./input.assets/

  • Inputs(./input.md): refer to test_md2md_figure.md.

    ![自定义头像](./input.assets/自定义头像.png)
    
    ![Level-of-concepts](./input.assets/Level-of-concepts.svg)
    
  • Coding:

    import pathlib
    import logging
    import panflute as pf
    
    from pandoc_filter.utils import TracingLogger
    from pandoc_filter.utils import OssHelper
    from pandoc_filter.md2md_filters import figure_filter
    
    pathlib.Path("./logs").mkdir(parents=True, exist_ok=True)
    tracing_logger = TracingLogger(name="./logs/pf_log",level=logging.INFO)
    
    file_path = pathlib.Path("./input.md")
    with open(file_path,'r',encoding='utf-8') as f:
        markdown_content = f.read()
    output_path = pathlib.Path("./output.md")
    
    import os
    oss_endpoint_name = os.environ['OSS_ENDPOINT_NAME']
    oss_bucket_name = os.environ['OSS_BUCKET_NAME']
    assert os.environ['OSS_ACCESS_KEY_ID']
    assert os.environ['OSS_ACCESS_KEY_SECRET']
    oss_helper = OssHelper(oss_endpoint_name,oss_bucket_name)
    
    doc = pf.convert_text(markdown_content,input_format='markdown',output_format='panflute',standalone=True)
    doc.doc_path = file_path
    doc = pf.run_filter(action=figure_filter,doc=doc,tracing_logger=tracing_logger,oss_helper=oss_helper)
    
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(pf.convert_text(doc,input_format='panflute',output_format='gfm',standalone=True))
    
  • Outputs(./output.md): refer to test_md2md_figure.md.

    <figure>
    <img
    src="https://raw.little-train.com/111199e36daf608352089b12cec935fc5cbda5e3dcba395026d0b8751a013d1d.png"
    alt="自定义头像" />
    <figcaption aria-hidden="true">自定义头像</figcaption>
    </figure>
    
    <figure>
    <img
    src="https://raw.little-train.com/20061af9ba13d3b92969dc615b9ba91abb4c32c695f532a70a6159d7b806241c.svg"
    alt="Level-of-concepts" />
    <figcaption aria-hidden="true">Level-of-concepts</figcaption>
    </figure>
    

Convert markdown to html

Normalize anchors, internal links and link-like strings

  • Inputs(./input.md):

    Refer to test_md2html_anchor_and_link.md.

  • Coding:

    import pathlib
    import logging
    import functools
    import panflute as pf
    
    from pandoc_filter.utils import TracingLogger
    from pandoc_filter.md2html_filters import anchor_filter,internal_link_recorder,link_like_filter
    from pandoc_filter.md2md_filters import internal_link_filter
    
    pathlib.Path(f"./logs").mkdir(parents=True, exist_ok=True)
    tracing_logger = TracingLogger(name="./logs/pf_log",level=logging.INFO)
    
    def finalize(doc:pf.Doc,**kwargs):
        tracing_logger = kwargs['tracing_logger']
        id_set = set()
        for k,v in doc.anchor_count.items():
            for i in range(1,v+1):
                id_set.add(f"{k}-{i}")
        for patched_elem,url,guessed_url_with_num in doc.internal_link_record:
            if f"{url}-1" in id_set:
                patched_elem.sub(f"{url}-1",tracing_logger)
            elif guessed_url_with_num in id_set: # None is not in id_set
                patched_elem.sub(f"{guessed_url_with_num}",tracing_logger)
            else:
                tracing_logger.logger.warning(f"{patched_elem.elem}")
                tracing_logger.logger.warning(f"The internal link `{url}` is invalid and will not be changed because no target header is found.")
    
    file_path = pathlib.Path("./input.md")
    with open(file_path,'r',encoding='utf-8') as f:
        markdown_content = f.read()
    output_path = pathlib.Path("./output.html")
    
    doc = pf.convert_text(markdown_content,input_format='markdown',output_format='panflute',standalone=True)
    doc = pf.run_filter(action=internal_link_filter,doc=doc,tracing_logger=tracing_logger)
    
    _finalize = functools.partial(finalize,tracing_logger=tracing_logger)
    doc = pf.run_filters(actions=[anchor_filter,internal_link_recorder,link_like_filter],doc=doc,finalize=_finalize,tracing_logger=tracing_logger)
    
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(pf.convert_text(doc,input_format='panflute',output_format='html',standalone=True))
    
  • Outputs(./output.html):

    Refer to test_md2html_anchor_and_link.html.

Contribution

Contributions are welcome. But recently, the introduction and documentation are not complete. So, please wait for a while.

A simple way to contribute is to open an issue to report bugs or request new features.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandoc-filter-0.0.1.tar.gz (32.9 kB view hashes)

Uploaded Source

Built Distribution

pandoc_filter-0.0.1-py3-none-any.whl (31.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page