An ASD-STE100 (Simplified Technical English) parser.
Project description
biz.dfch.AsdSte100Parser
This library implements a:
- An EBNF grammar for Lark (earley)
- A multi-pass transformer
- A tokenizer
- A serializer.
You must use a special structure of Markdown as the input text.
Installation
biz-dfch-ste100parser is on PyPI. Create a virtual environment and install the library with pip:
pip install biz-dfch-ste100parser
Usage
from biz.dfch.ste100parser import ContainerTransformer
from biz.dfch.ste100parser import GrammarType
from biz.dfch.ste100parser import Parser
from biz.dfch.ste100parser import Token
value = "" # Specify text (example content and output see below).
parser = Parser(GrammarType.CONTAINER)
assert parser.is_valid(value)
# This parses the tree according to the CONTAINER grammar.
initial_tree = parser.invoke(value)
# This transforms the tree to the tokens described in the "Format" section.
transformer = ContainerTransformer()
transformed_tree = transformer.invoke(initial_tree)
# This prints the resulting AST.
print(transformed.pretty())
Input text
# This is a heading *level 1*
This is the start of the _first_ paragraph. This is the second sentence.
Third sentence, after a LINEBREAK. The fourth sentence starts a list:
1 This is the first list item.
2 Another item
3 Last item.
The paragraph continues.
This is para2. And, this is a new paragraph with only a single sentence.
## This is our procedure
1. Do this
2. Do that:
a This is a list with item 1
b The next item
c The last item.
3. And then, do this one last time.
This is para3. Here, we have another paragraph.
This is para4. Here, we have another paragraph.
This continues para4 after a LINEBREAK.
This is para5. Here, we have another paragraph.
a This is a list with item 1
b The next item
c The last item.
1. Another proc (without heading)
2. Last step.
> Line1. This-is-some-cite-text-1.1. This-is-some-cite-text-2.1.
> Line2. This-is-some-cite-text-2.1. This-is-some-cite-text-2.2.
And yet another, paragraph.
> LineA. This-is-some-cite-text-A.1. This-is-some-cite-text-A.1.
> LineB. This-is-some-cite-text-B.1. This-is-some-cite-text-B.2.
Transformed tree
start
heading
HEADING_LEVEL 1
TEXT This
WS 1
TEXT is
WS 1
TEXT a
WS 1
TEXT heading
WS 1
bold
TEXT level
WS 1
TEXT 1
paragraph
TEXT This
WS 1
TEXT is
WS 1
TEXT the
WS 1
TEXT start
WS 1
TEXT of
WS 1
TEXT the
WS 1
emph
TEXT first
WS 1
TEXT paragraph.
WS 1
TEXT This
WS 1
TEXT is
WS 1
TEXT the
WS 1
TEXT second
WS 1
TEXT sentence.
LINEBREAK
TEXT Third
WS 1
TEXT sentence,
WS 1
TEXT after
WS 1
TEXT a
WS 1
TEXT LINEBREAK.
WS 1
TEXT The
WS 1
TEXT fourth
WS 1
TEXT sentence
WS 1
TEXT starts
WS 1
TEXT a
WS 1
TEXT list:
list_item
LIST_MARKER 1
LIST_INDENT 2
TEXT This
WS 1
TEXT is
WS 1
TEXT the
WS 1
TEXT first
WS 1
TEXT list
WS 1
TEXT item.
list_item
LIST_MARKER 2
LIST_INDENT 2
TEXT Another
WS 1
TEXT item
list_item
LIST_MARKER 3
LIST_INDENT 2
TEXT Last
WS 1
TEXT item.
TEXT The
WS 1
TEXT paragraph
WS 1
TEXT continues.
paragraph
TEXT This
WS 1
TEXT is
WS 1
TEXT para2.
WS 1
TEXT And,
WS 1
TEXT this
WS 1
TEXT is
WS 1
TEXT a
WS 1
TEXT new
WS 1
TEXT paragraph
WS 1
TEXT with
WS 1
TEXT only
WS 1
TEXT a
WS 1
TEXT single
WS 1
TEXT sentence.
heading
HEADING_LEVEL 2
TEXT This
WS 1
TEXT is
WS 1
TEXT our
WS 1
TEXT procedure
proc_item
PROC_STEP 1
PROC_DELIMITER .
TEXT Do
WS 1
TEXT this
proc_item
PROC_STEP 2
PROC_DELIMITER .
TEXT Do
WS 1
TEXT that:
list_item
LIST_MARKER a
LIST_INDENT 3
TEXT This
WS 1
TEXT is
WS 1
TEXT a
WS 1
TEXT list
WS 1
TEXT with
WS 1
TEXT item
WS 1
TEXT 1
list_item
LIST_MARKER b
LIST_INDENT 3
TEXT The
WS 1
TEXT next
WS 1
TEXT item
list_item
LIST_MARKER c
LIST_INDENT 3
TEXT The
WS 1
TEXT last
WS 1
TEXT item.
proc_item
PROC_STEP 3
PROC_DELIMITER .
TEXT And
WS 1
TEXT then,
WS 1
TEXT do
WS 1
TEXT this
WS 1
TEXT one
WS 1
TEXT last
WS 1
TEXT time.
paragraph
TEXT This
WS 1
TEXT is
WS 1
TEXT para3.
WS 1
TEXT Here,
WS 1
TEXT we
WS 1
TEXT have
WS 1
TEXT another
WS 1
TEXT paragraph.
paragraph
TEXT This
WS 1
TEXT is
WS 1
TEXT para4.
WS 1
TEXT Here,
WS 1
TEXT we
WS 1
TEXT have
WS 1
TEXT another
WS 1
TEXT paragraph.
LINEBREAK
TEXT This
WS 1
TEXT continues
WS 1
TEXT para4
WS 1
TEXT after
WS 1
TEXT a
WS 1
TEXT LINEBREAK.
paragraph
TEXT This
WS 1
TEXT is
WS 1
TEXT para5.
WS 1
TEXT Here,
WS 1
TEXT we
WS 1
TEXT have
WS 1
TEXT another
WS 1
TEXT paragraph.
list_item
LIST_MARKER a
LIST_INDENT 4
TEXT This
WS 1
TEXT is
WS 1
TEXT a
WS 1
TEXT list
WS 1
TEXT with
WS 1
TEXT item
WS 1
TEXT 1
list_item
LIST_MARKER b
LIST_INDENT 4
TEXT The
WS 1
TEXT next
WS 1
TEXT item
list_item
LIST_MARKER c
LIST_INDENT 4
TEXT The
WS 1
TEXT last
WS 1
TEXT item.
proc_item
PROC_STEP 1
PROC_DELIMITER .
TEXT Another
WS 1
TEXT proc
WS 1
paren
TEXT without
WS 1
TEXT heading
proc_item
PROC_STEP 2
PROC_DELIMITER .
TEXT Last
WS 1
TEXT step.
cite
TEXT Line1.
WS 1
TEXT This-is-some-cite-text-1.1.
WS 1
TEXT This-is-some-cite-text-2.1.
cite
TEXT Line2.
WS 1
TEXT This-is-some-cite-text-2.1.
WS 1
TEXT This-is-some-cite-text-2.2.
paragraph
TEXT And
WS 1
TEXT yet
WS 1
TEXT another,
WS 1
TEXT paragraph.
cite
TEXT LineA.
WS 1
TEXT This-is-some-cite-text-A.1.
WS 1
TEXT This-is-some-cite-text-A.1.
cite
TEXT LineB.
WS 1
TEXT This-is-some-cite-text-B.1.
WS 1
TEXT This-is-some-cite-text-B.2.
Format
- There a top-level tokens. These are tokens, that must be at the top-most hierarchical level of the text.
- There are tokens, that can only appear inside other tokens.
- A text must end with two
NEWLINEtokens.
Whitespace (WS)
- Whitespace is a sequence of either
\tortokens. \tis the same as eighttokens.
This is TEXT with whitespace.
This is TEXT with multiple whitespace.
And\tthis\tis\talso\ttext\twith\twhitespace.
Single space (SPACE)
- A
SPACEis a delimiter token that only is inside other `tokens. - For example, in
1) TexttheSPACEis the delimiter after1).
1) A work step.
NEWLINE
- A
NEWLINEis a top-level token. - This is a
\r\nor\n.
TEXT
Any character sequence, that does not contain these characters: ^"'*_()\s` (regex).
APOSTROPHE
- An
APOSTROPHEis either'sor'when it comes directly afterTEXT. - You must not put an
APOSTROPHEin asquote.
Heading
- A
headingis a top-level token. - A
NEWLINEthat starts with a#(or a multiple of#) with one or moreTEXTtokens. - Two
NEWLINEtokens stop aheading.
# Heading level 1
## Heading level 2
### Heading level 3
#### Heading level 4
##### Heading level 5
Paragraph
- A
paragraphis a top-level token. - A
paragraphstarts after aNEWLINE, whenTEXTdirectly comes after theNEWLINEtoken. - Two
NEWLINEtokens stop aparagraph. - A
paragraphcan have aNEWLINEtoken betweenTEXTtokens.
This is a paragraph. This is still the paragraph.
This is another paragraph. This is still the second paragraph.
This is still the second paragraph (after a LINEBREAK).
This is a new and the last paragraph.
Procedure (list of work steps)
- A
procedureis a top-level token. - A
procedureis one or more work step (proc_item). - A
procedurestarts after aNEWLINEtoken, when[a-zA-Z0-9]+(proc_marker) and[.)](PROC_DELIMITER) directly come after theNEWLINEtoken. - A
proc_itemcan contain a vertical list. - A
proc_itemcan contain aNOTEor a safety instruction (WARNING,CAUTION). - In contrast to other markdown, there is no two
NEWLINEto stop the vertical list,NOTEor safety instruction. There is only a singleNEWLINEto stop one of these.
1. This is the first work step.
2. This is the second work step.
* This is a list item in a work step.
* Another list item in a work step.
3. This is the third work step.
NOTE: This is a note for the work step.
4. This is the fourth work step.
WARNING: This is a safety instruction for this work step of the type 'WARNING'.
4. This is the fifth work step.
CAUTION: This is a safety instruction for this work step of the type 'CAUTION'.
5. A work step can contain multiple:
* 'NOTE'
* 'WARNING'
* 'CAUTION'.
NOTE: This is a note for the work step.
WARNING: This is a safety instruction for this work step of the type 'WARNING'.
CAUTION: This is a safety instruction for this work step of the type 'CAUTION'.
6. This is the last work step.
Vertical list (list_item)
- A vertical list can occur in a
paragraphor a procedure (proc_item). - A vertical list is one or more
lite_item. - A
NEWLINEstarts alist_itemwhenWS+, alist_markerand aSPACEcome directly after theNEWLINEtoken. - Before the
list_item, there isTEXTthat has a:as the last token. - A numeric
list_markercannot contain a.or). This is only correct forproc_item. - A
list_markerhasWS(indentation). - You must not put a vertical list inside another vertical list.
This is a paragraph, that starts a list:
* Indented list item with "*" as the list marker
* Another list item.
This is another paragraph, that starts a list:
* More indented list item with "*" as the list marker
* Another list item.
This is a paragraph, that starts a list:
1 Indented list item with a numeric as the list marker
2 Another list item.
This is a paragraph, that starts a list:
a Indented list item with a lower alpha as the list marker
a Another list item.
This is a paragraph, that starts a list:
A Indented list item with an upper alpha as the list marker
B Another list item.
Parentheses (paren)
- This container can
- A
paragraphcan containparen. - A
list_itemcan containparen. - A
proc_itemcan containparen. - A
citecan containparen. parenmust not containNEWLINEtokens.- Parentheses can be nested.
Quote and cite
Double quote (dquote)
- This formatter shows text in "double quote" (
dquote). - This token cannot contain
NEWLINE. - You must not nest
dquote. dquotecan containsquote.squotecan contain "formatters".
"this is text in double quote"
Single quote (squote)
- This formatter shows text in 'single quote' (
squote). - This token cannot contain
NEWLINE. - You must not nest
squote. squotecan containdquote.squotecan contain "formatters".
*this is text in single quote*
Citation (cite)
- A
citeis a top-level token. - This formatter shows text as a "citation" (
cite). - A
NEWLINEstarts acite, when a>comes directly after theNEWLINEtoken. - A
citemust not be empty. It must containTEXTorWS. - This token cannot contain
NEWLINE. - You must not nest
cite.
> This is a citation line.
> This is another citation line.
Formatters
Bold
- This formatter shows text is bold (
bold). - This token cannot contain
NEWLINE.
*this is text in bold*
Emphasis
- This formatter shows text is emphasis (
emph). - This token cannot contain
NEWLINE.
_this is text in emphasis_
Bold emphasis
- This formatter shows text is bold emphasis (
boldemph). - This token cannot contain
NEWLINE.
*_this is text in bold emphasis_*
Code
- This formatter shows text is
monospace(code). - This token can contain
NEWLINE.
`this is text in monospace`
Examples:
You find examples in ./test/test_data/.
Heading with paragraph
# This is a heading level 1
This is the start of a paragraph. And this is the end of the paragraph.
This is a new paragraph. A paragraph continues after a single NEWLINE.
This is still the same paragraph.
Paragraph with vertical lists
# This is a heading level 1
This is the start of a paragraph. This will start a new vertical list:
* Note, that the list delimiter '*' is indented by a minimum of one `WS`.
* The next list item.
This continues the paragraph. This is not standard 'Github'-flavored Markdown.
This is a new paragraph. This will start a new vertical list:
- This is another list delimiter.
- Another list item.
This is a new paragraph. This will start a new vertical list:
1 This is another list delimiter.
2 Another list item.
This is another paragraph.
Paragraph with formatters, quotes and cite
# This is a heading level 1
## Text in quotes
This is a paragraph. In *this* paragraph we have "text in double quotes".
> Here is a citation. This is similar to a full line in "double quotes".
This is another paragraph. In _that_ paragraph we have 'text in single quotes'.
At last, this is another paragraph. In *_that_* paragraph we have "text in 'double' quotes" that contains "'single' quotes".
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biz_dfch_ste100parser-0.1.5.tar.gz.
File metadata
- Download URL: biz_dfch_ste100parser-0.1.5.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ae7fdc574aa2b31affee7de257e542419ee90b712a39e12923197a52b395f40
|
|
| MD5 |
991efee2e18a6b31318040635c2f5d31
|
|
| BLAKE2b-256 |
5d6b5327aacd24e40a33c10e2983b3aecfb58b6ed0c0b58371865faacb4ff1b1
|
Provenance
The following attestation bundles were made for biz_dfch_ste100parser-0.1.5.tar.gz:
Publisher:
publish.yml on dfensgmbh/biz.dfch.AsdSte100Parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biz_dfch_ste100parser-0.1.5.tar.gz -
Subject digest:
9ae7fdc574aa2b31affee7de257e542419ee90b712a39e12923197a52b395f40 - Sigstore transparency entry: 849803661
- Sigstore integration time:
-
Permalink:
dfensgmbh/biz.dfch.AsdSte100Parser@b3a094027fe93d28c3640df8de9278bd1355bff9 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/dfensgmbh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a094027fe93d28c3640df8de9278bd1355bff9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file biz_dfch_ste100parser-0.1.5-py3-none-any.whl.
File metadata
- Download URL: biz_dfch_ste100parser-0.1.5-py3-none-any.whl
- Upload date:
- Size: 40.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2be5611e3b2db52dd2b8b3714a42515de304c57bb60d8b4f66a8e1a410fed9ac
|
|
| MD5 |
dd732913b4d32ac883e20fb21938ed1e
|
|
| BLAKE2b-256 |
749b9215704ede421df7a0e81e6bec6c950a1bb3cc3f3450d27daf573bd418ee
|
Provenance
The following attestation bundles were made for biz_dfch_ste100parser-0.1.5-py3-none-any.whl:
Publisher:
publish.yml on dfensgmbh/biz.dfch.AsdSte100Parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biz_dfch_ste100parser-0.1.5-py3-none-any.whl -
Subject digest:
2be5611e3b2db52dd2b8b3714a42515de304c57bb60d8b4f66a8e1a410fed9ac - Sigstore transparency entry: 849803665
- Sigstore integration time:
-
Permalink:
dfensgmbh/biz.dfch.AsdSte100Parser@b3a094027fe93d28c3640df8de9278bd1355bff9 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/dfensgmbh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a094027fe93d28c3640df8de9278bd1355bff9 -
Trigger Event:
release
-
Statement type: