Tools for parsing text content and creating data models for the content found.

fileparse is a package for reading the contents of a file and populating a data model with the information found.


pip install fileparse-tobiasli


Say you have som text, and you have an idea of the structure of this text.

nested_text = """# This is a title.
This is contents.
And some more.

## This is a subtitle.
with subtitle contents.

# This is another title.
With some contents.

You can then define some simple classes defining this content structure and patterns that match each content type. Finally we define a model.Finder, which allows us to search for the content type in the text file.

import re

from fileparse import parse, read

class Text(parse.Content):
text_match = re.compile('^(?P<text>[^#].+)$')
text_finder = parse.ContentFinder(start_pattern=text_match,

class SubTitle(parse.Content):
subtitle_match = re.compile('^## ?(?P<subtitle>[^#].+)$')
subtitle_finder = parse.ContentFinder(start_pattern=subtitle_match,
class Title(parse.Content):
title_match = re.compile('^# ?(?P<title>[^#].+)$')
title_finder = parse.ContentFinder(start_pattern=title_match,
                               sub_content_finders=[subtitle_finder, text_finder])                                      

Notice two things:

  • The regex patterns are named capture groups. The named capture groups are added as property to their content type. I.e. a SubTitle instance will receive a SubTitle.subtitle property.
  • Text content can be found within both a Title and a SubTitle. And that a SubTitle only can be found within a Title.

Finally, we define the Parser.

 file_finder = parse.Parser(finders=[title_finder])   

The file_finder is now ready to parse text content.

For this specific content, we need a text stream able to parse a string. We define it like this:

stream = read.TextStream(reader=read.StringReader(string=nested_text))

We can now parse the text with the rules defined in file_finder, and se what comes out of it. To get information out of a file-object, use the file.get_contents_by_type(content_type) method.

file = file_finder.parse_stream(stream)

print(file.get_contents_by_type(SubTitle)[0].subtitle == 'This is a subtitle.')
print(file.get_contents_by_type(SubTitle)[0].contents[0].text == 'with subtitle contents.')

Happy parsing.

