Skip to main content

A fast parser for reStructuredText

Project description

rst_fast_parse

A fast, spec compliant*, concrete syntax parser for reStructuredText.

In development, use at your own risk

Features:

  • Fault tolerant parsing; designed to never raise an exception
  • Concrete syntax tokens with full source mapping
  • Diagnostics for common issues
  • No required dependencies
  • Functional parsing design, with no modifiable global state (thread safe)
  • Fully typed with "strict" mypy settings

This parser is NOT intended to be a full replacement for the docutils/sphinx rST parser. The initial goal is to parse the "outline" of a reStructuredText document, without necessarily knowing the full information about all roles / directives, into a structure that can be used as a foundation for tools like linters, formatters and Language Servers (as opposed to having to wait for a full sphinx build).

Incremental parsing and formatting is also planned.

* spec compliant for all rST syntaxes (tested extensively against docutils), but no spec exists for all directive/role content, due to their highly dynamic nature.

Usage

To parse a string, use the parse_string function.

from rst_fast_parse import parse_string

nodes, diagnostics = parse_string("""
Title
-----
hallo
there *world!*
""",
inline_sourcemaps=True
)
assert nodes.debug_repr() == """\
<title style='-'> 1-2
  <inline> 1-1
    <text> 1:0-1:5
<paragraph> 3-4
  <inline> 3-4
    <text> 3:0-4:6
    <emphasis> 4:6-4:14
"""

Improving performance

Note, if only block line parsing is required, use parse_inlines=False for a reasonable speed-up.

from rst_fast_parse import parse_string

nodes, diagnostics = parse_string("""
Hello
-----
*world!*
""",
parse_inlines=False)
assert nodes.debug_repr() == """\
<title style='-'> 1-2
  <inline> 1-1
<paragraph> 3-3
  <inline> 3-3
"""

Also, the inline_sourcemaps option, to compute and add source mappings to inline nodes, is disabled by default, since this also has a performance impact.

For comparison, parsing the restructured specification file (>3000 lines) currently takes:

  • 25ms with parse_inlines=False
  • 35ms with parse_inlines=True
  • 44ms with parse_inlines=True, inline_sourcemaps=True

Nesting sections

The parser does not automatically nest sections, based on title underline/overline styles, like docutils, since this is not generally needed for linting or formatting tools, and will allow for incremental parsing.

If you wish to nest sections, you can use the nest_sections function:

from rst_fast_parse import parse_string, nest_sections

nodes, diagnostics = parse_string("""
Header 1
========
Header 1.1
----------
""")
nodes = nest_sections(nodes)
assert nodes.debug_repr() == """\
<section> 1-4
  <title style='='> 1-2
    <inline> 1-1
      <text>
  <section> 3-4
    <title style='-'> 3-4
      <inline> 3-3
        <text>
"""

Directive parsing

Due to the highly dynamic nature of directives, and their tight coupling to docutils/sphinx, the parser does not attempt to parse all directives.

Instead there is a default mapping of standard directives, to a simple declarative definition of the directive. These definitions can be modified and passed to the parser as needed:

from rst_fast_parse import parse_string, get_default_directives

print(get_default_directives())

nodes, diagnostics = parse_string("""
.. note:: This is a note
     :class: my-note
""",
directives={
    'note': {
      "argument": False,  # can have an argument
      "options": True,  # can have an options block
      "content": True,  # can have a content block
      "parse_content": True,  # parse content as rST
    }
})
assert nodes.debug_repr() == """\
<directive name='note'> 1-2
  <options>
    <option name='class'> 2-2
  <body>
    <paragraph> 1-1
      <inline> 1-1
        <text>
"""

Diagnostics

Diagnostics are returned for any known issues found during parsing.

from rst_fast_parse import parse_string

nodes, diagnostics = parse_string("""
- list `no role name`
no blank line
""")
assert nodes.debug_repr() == """\
<bullet_list symbol='-'> 1-1
  <list_item> 1-1
    <paragraph> 1-1
      <inline> 1-1
        <text>
        <role>
<paragraph> 2-2
  <inline> 2-2
    <text>
"""
assert [d.as_dict() for d in diagnostics] == [
  {
    'code': 'block.blank_line',
    'message': 'Blank line expected after Bullet list',
    'line_start': 1,
    'character_end': 21
  },
  {
    'code': 'inline.role_no_name',
    'message': 'Inline role without name.',
    'line_start': 1,
    'character_start': 7,
    'character_end': 21
  }
]

Available diagnostic codes:

  • source.tab_in_line: Warns on tabs in a line, which can degrade performance of source mapping.
  • block.blank_line: Warns on missing blank lines between syntax blocks.
  • block.title_line: Warns on issues with title under/over lines.
  • block.title_disallowed: Warns on unexpected titles in a context where they are not allowed.
  • block.paragraph_indentation: Warns on unexpected indentation of a paragraph line.
  • block.literal_no_content: Warns on literal blocks with no content.
  • block.target_malformed: Warns on malformed hyperlink targets.
  • block.substitution_malformed: Warns on malformed substitution definition.
  • block.table_malformed: Warns on malformed tables.
  • block.inconsistent_title_level: Warns on inconsistent title levels, e.g. a level 1 title style followed by a level 3 style.
  • block.directive_indented_options: Warns if the second line of a directive starts with an indented :.
  • block.directive_malformed: Warns on malformed directives.
  • inline.no_closing_marker: Warns on inline markup with no closing marker.
  • inline.role_malformed: Warns on malformed inline roles.
  • inline.role_no_name: Warns on inline roles with no name.

Walking the node tree

Use the walk_children function to walk a node's (block) children. A builtin use of this is the walk_line_inside function, which yields all nodes that contain a given line number.

from rst_fast_parse import parse_string
from rst_fast_parse.nodes import walk_line_inside

nodes, diagnostics = parse_string("""
- a

  1. content

- b
""")
assert [e.tagname for e in walk_line_inside(nodes, 3)] == [
  'bullet_list', 'list_item', 'enum_list', 'list_item', 'paragraph', 'inline'
]

Command line usage

There is also a simple CLI for linting reStructuredText stdin/files:

$ echo "- a\n1. *b" | python -m rst_fast_parse.cli.lint --print-ast --ast-maps -
<bullet_list symbol='-'> 0-0
  <list_item> 0-0
    <paragraph> 0-0
      <inline> 0-0
        <text> 0:2-0:3
<enum_list ptype='period' etype='arabic'> 1-1
  <list_item> 1-1
    <paragraph> 1-1
      <inline> 1-1
        <problematic> 1:3-1:4
        <text> 1:4-1:5

<stdin>:1:1: Blank line expected after Bullet list [block.blank_line]
<stdin>:2:4: Inline emphasis no closing marker. [inline.no_closing_marker]

Found 2 error.

Design decisions

The parse does not automatically nest sections, based on title underline styles, like docutils. This allows for incremental parsing, as well as a simpler design.

We want to try to avoid any user-defined "dynamic" code execution, e.g. for parsing directive content, since this limits the future ability to convert the codebase to a different language, to configure using a declarative format, or to run in a sandboxed environment.

Licensing

For now the project is under a fairly strict license, and the distributed code is relatively obscured.

This is to mitigate "bad faith" copying of the codebase, especially whilst in development, which unfortunately has happened to me in the past 😒

Changelog

0.0.16

  • 🎉 Add inline parsing
  • 🎉 Add character-level source mappings for diagnostics
  • Refactor elements to nodes

0.0.15

  • 🎉 Add directive parsing
  • Replace ElementProtocol.line_inside with walk_line_inside function.
  • Replace ElementList with RootElement
  • Add InlineElement, ParagraphElement, BulletListElement EnumListElement, FieldListElement, FieldItemElement, DefinitionListElement, DefinitionItemElement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

rst_fast_parse-0.0.16-py3-none-any.whl (106.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page