Skip to main content

A modern lexer/parser generator for a growing number of languages

Project description

Mage: Text Analysis Made Easy

Mage is an experimental tool for performing text analysis. It does so by generating a lexer, parser and parse tree for you. Whether it is a piece of programming code or some tabular data in a fringe format, Mage has got you covered!

Features

  • ✅ A simple yet expressive DSL to write your grammars in
  • ✅ Full support for Python typings. Avoid runtime errors while building your language!
  • 🚧 Lots of unit tests to enure your code does what you expect it to do
  • 🚧 An intermediate language to very easily add support to any other programming language

👀 Mage is written in itself. Check out the generated code of part of our Python generator!

Implementation Status

Feature Python Rust C C++ JavaScript
CST
AST
Lexer 🚧
Parser
Emitter

Installation

$ pip3 install --user -U magelang

Usage

Currently requires at least Python version 3.12 to run.

mage generate <lang> <filename>

Generate a parser for the given grammar in a language that you specify.

Example

mage generate python foo.mage --prefix foo --out-dir src/foolang

🚧 mage test <filename..>

[!WARNING]

This command is under construction.

Run all tests inside the documentation of the given grammar.

Grammar

pub <name> = <expr>

Define a new node or token that must be parsed according the given expression.

You can use both inline rules and other node rules inside expr. When referring to another node, that node will become a field in the node that referred to it. Nodes that have no fields are converted to a special token type that is more efficient to represent.

pub var_decl = 'var' name:ident '=' type_expr

<name> = <expr>

Define a new inline rule that can be used inside other rules.

As the name suggests, this type of rule is merely syntactic sugar and gets inlined whenever it is referred to inside another rule.

digits = [0-9]+

extern <name>

Defines a new parsing rule that is defined somewhere else, possibly in a different language.

extern token <name>

Defines a new lexing rule that is defined somewhere else, possibly in a different language.

pub token <name> = <expr>

Like pub <name> = <expr> but forces the rule to be a token.

Mage will show an error when the rule could not be converted to a token rule. This usually means that the rule references another rule that is pub.

pub token float_expression
  = digits? '.' digits

expr1 expr2

First parse expr1 and continue to parse expr2 immediately after it.

pub two_column_csv_line
  = text ',' text '\n'

expr1 | expr2

First try to parse expr1. If that fails, try to parse expr2. If none of the expressions matched, the parser fails.

pub declaration
  = function_declaration
  | let_declaration
  | const_declaration

expr?

Parse or skip the given expression, depending on whether the expression can be parsed.

pub singleton_or_pair
  = value (',' value)?

expr*

Parse the given expression as much as possible.

skip = (multiline_comment | whitespace)*

expr+

Parse the given expression one or more times.

For example, in Python, there must always be at least one statement in the body of a class or function:

body = stmt+

\expr

Escape an expression by making it hidden. The expression will be parsed, but not be visible in the resulting CST/AST.

expr{n,m}

Parse the expression at least n times and at most m times.

unicode_char = 'U+' hex_digit{4,4}

@keyword

Treat the given rule as being a potential source for keywords.

String literals matching this rule will get the special _keyword-suffix during transformation. The lexer will also take into account that the rule conflicts with keywords and generate code accordingly.

@keyword
pub token ident
  = [a-zA-Z_] [a-zA-Z_0-9]*

@skip

Register the chosen rule as a special rule that the lexer uses to lex 'gibberish'.

The rule will still be available in other rules, e.g. when @noskip was added.

@skip
whitespace = [\n\r\t ]*

🚧 @noskip

[!WARNING]

This decorator is under construction.

Disable automatic injection of the @skip rule for the chosen rule.

This can be useful for e.g. parsing indentation in a context where whitespace is normally discarded.

@skip
__ = [\n\r\t ]*

@noskip
pub body
  = ':' __ stmt
  | ':' \indent stmt* \dedent

@wrap

Adding this decorator to a rule ensures that a real CST node is emitted for that rule, instead of possibly a variant.

This decorator makes the CST heavier, but this might be warranted in the name of robustness and forward compatibility. Use this decorator if you plan to add more fields to the rule.

@wrap
pub lit_expr
   = literal:(string | integer | boolean)

keyword

A special rule that matches any keyword present in the grammar.

The generated CST will contain predicates to check for a keyword:

print_bold = False
if is_py_keyword(token):
    print_bold = True

token

A rule that matches any token in the grammar.

pub macro_call
  = name:ident '{' token* '}'

node

A special rule that matches any parseable node in the grammar, excluding tokens.

syntax

A special rule that matches any rule in the grammar, including tokens.

Python API

This section documents the API that is generated by taking a Mage grammar as input and specifying python as the output language.

In what follows, Node is the name of an arbitrary CST node (such as PyReturnStmt or MageRepeatExpr) and foo and bar are the name of fields of such a node. Examples of field names are expr, return_keyword, min, max,, and so on.

Node(...)

Construct a node with the fields specified in the ... part of the expression.

First go all elements that are required, i.e. they weren't suffixed with ? or * in the grammar or something similar. They may be specified as positional arguments or as keyword.

Next are all optional arguments. They must be specified as keyword arguments. When omitted, the corresponding fields are either set to None or a new empty token/node is created.

Examples

Creating a new CST node by providing positional arguments for required fields:

PyInfixExpr(
    PyNamedExpr('value'),
    PyIsKeyword(),
    PyNamedExpr('None')
)

The same example but now with keyword arguments:

PyInfixExpr(
    left=PyNamedExpr('value'),
    op=PyIsKeyword(),
    right=PyNamedExpr('None')
)

Omitting fields that are trivial to construct:

# Note that `return_keyword` is not specified
stmt = PyReturnStmt(expr=PyConstExpr(42))

# stmt.return_keyword was automatically created
assert(isinstance(stmt.return_keyword, ReturnKeyword()))

Node.count_foos() -> int

This member is generated when there was a repetition in field foo such as the Mage expression '.'+

It returns the amount of elements that are actually present in the CST node.

FAQ

What is a CST, AST, visitor, and so on?

A CST is a collection of structures and enumerations that completely represent the source code that needs to be parsed/emitted.

An AST is an abstract representation of the CST. Mage can automatically derive a good AST from a CST.

A visitor is (usually) a function that traverses the AST/CST in a particular way. It is useful for various things, such as code analysis and evaluation.

A rewriter is similar to a visitor in that it traverses that AST/CST but also creates new nodes during this traversal.

A lexer or scanner is at it core a program that splits the input stream into separate tokens that are easy to digest by the parser.

A parser converts a stream of tokens in AST/CST nodes. What parts of the input stream are converted to which nodes usually depends on how the parser is invoked.

How do I assign a list of nodes to another node in Python without type errors?

This is probably due to this feature in the Python type checker, which prevents subclasses from being assigned to a more general type.

For small lists, we recommend making a copy of the list, like so:

defn = PyFuncDef(body=list([ ... ]))

See also this issue in the Pyright repository.

Contributing

Run the following command in a terminal to link the mage command to your checkout:

pip3 install -e '.[dev]'

License

This code is generously licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

magelang-0.1.1.tar.gz (114.2 kB view details)

Uploaded Source

Built Distribution

magelang-0.1.1-py3-none-any.whl (126.6 kB view details)

Uploaded Python 3

File details

Details for the file magelang-0.1.1.tar.gz.

File metadata

  • Download URL: magelang-0.1.1.tar.gz
  • Upload date:
  • Size: 114.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.13.0

File hashes

Hashes for magelang-0.1.1.tar.gz
Algorithm Hash digest
SHA256 86498714a7ed4a269a8cb0d71ec58cd15a6d749b37c08eb4b62a40d98d2d7722
MD5 5c9cce39776b49e7266d4fa27842dc67
BLAKE2b-256 30e9b3b7252627f212f634f013e5602ef5c3a4c472160f3bbbb80cd42c9d1c3a

See more details on using hashes here.

File details

Details for the file magelang-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: magelang-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 126.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.13.0

File hashes

Hashes for magelang-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2344952035e161e3f5b24b53794bbbd3f1cfdc069ab83a4aabbbf5970f668be
MD5 3a4d24306d24ddc782ca6cc714fec84f
BLAKE2b-256 03c41ee63c3a838353a0dec9993998f6ece312565dcfa9494ba832f1734f73a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page