Skip to main content

Extract Structured Data from text

Project description

Textmater

Don't need to know where you're going, just need to know where you've been

Extract structured data (key values, grouped into sections) from text. (Runs backwards through text.. hence the name) Useful for creating configurations for extracting data from a file, which can then be applied to large numbers of these documents.

Overview

The general application of this is to construct a configuration of the Textmater class that pulls details out of a format of text. This configuration can then be fed further instances of the text and build up a structure of data, which can then be saved to .json or .csv

Example

Say we have an example of text like this

example_text =

-Shops-
Pete's: Grocers
KFC: Fast Food
Newsman: Newsagents
-Sports-
Football: Round Ball
AFL: Egg Ball
Cricket:Round Ball

and we want to get every key and value, with keys being anything before a : and values being anything after :. We also want them to be grouped according to their headers, and we want the output in json We could create an instance with

resource = Textmater(section_header_regex = '_[a-zA-Z]*_')

then run resource.drive(example_text) the resulting resource.section_dict would look like this

{
    '-Shops-': [{"Pete's": "Grocers", "KFC": "Fast Food", "Newsman": "Newsagents"}]
    '-Sports-': [{"Football": "Round Ball", 'AFL": "Egg Ball", "Cricket" : "Round Ball"}]
}

If you ran it again on a similarly formatted section of text, '-Shops-' list would be appended to, as would '-Sports-'

then resource.write_results_to_json() would save it as a json file. One file per section (key in the section_dict)

importing

from textmater import Textmater, tools

(tools is optional but has useful functions for working with text)

configuring and running

resource = Textmater() will instantiate the class, there are a lot of options here. Ones relating to functions run in order of appearance.All are optional

  • filter_functions: [function] takes a list of functions used to skip (or not) an instance of text passed in, each must take in a string and return true or false. E.g you pass in a function that returns false if 'denied' is present in the text anywhere. Then when you run drive this resource over a corpus of documents you can skip the ones with 'denied' in them.
  • transformation_functions: [function] takes a list of functions that are applied to transform the incoming text before further processing. Functions must take a string and return a string
  • section_header_regex: str(regex_pattern) 1st of 2 ways of specifying section headers. Provided pattern is run through the text to build the list of headers. Not to be used in conjunction with the next argument
  • section_header_list: [str] 2nd of 2 ways of specifying section headers. Direct values that if found in the text will be used to divide items found in the text. In the example, the same effect could have been achieved by passing in ['-Shops-', '-Sports-'] to this parameter instead
  • sections_to_skip: [str] list of sections headers that if found will promp Textmater to skip over the values in the section. Useful for improving output when there is a large section of a text you don't require the contents of.
  • cleanup_functions: [function] list of functions applied to each record before it is added to the section_dict. Must take a current_record_dict (<section header>: {dict of items within it}) and return the same. No need to make deepcopies as this is done automatically before passing the dict in.
  • overwrite_duplicate_keys: bool If set to false will generate a unique version of any key that is already present when trying to add to the current_record_dict. It will add _i where i is an integer, starting at 2. In the unlikely occassion <key>_i is also a collision, it increments i until it's not
  • spread_keys: [(str, str)] list of tuples representing keys in sections that you want to spread (e,g you find a value in one section and want it present in all of them, perhaps as an identifier). [0]: section name [1]: key example, you have a key 'patient id' in a section 'identifiers', you want this id shared across all the sections to use as a primary key. Your value for spread_keys would be [('identifiers', 'patient id')].
    If you don't know the section that a key is in but you still want to spread it if it's found, leaving the section name empty, which would look like ('', 'patient id'), will result in Textmater searching for the key across all sections then spreading it.
  • delimiter: str the character/s you want to use as delimiters between keys and values.

Appendix

current_record_dict:

a dict where keys are section headers and values are dicts of items in that section:

{
    'section 1': {'key1' : 'value1', 'key2': 'value2', 'primary_key': '0'},
    'section 2': {'other key 1': 'value 1', 'other key 2': 'value 2', 'primary_key': '0'} 
}

resource.current_record_dict stores the result of the most recent extraction in this format

section_dict:

dict for storing combined current_record_dicts. keys are section headers and values are lists of dicts

{
    'section 1' : [{'key1' : 'value1', 'key2': 'value2', 'primary_key': '0'},
                {'key1' : 'value3', 'key2': 'value4', 'primary_key': '1'}],
    'section 2' : [{'other key 1' : 'value 1', 'primary_key': '0'},
                    {'other key 1': 'value z', 'primary_key': '1'}] 
}

resource.section_dict stores this

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textmater-0.1.tar.gz (27.3 kB view hashes)

Uploaded Source

Built Distribution

textmater-0.1-py3-none-any.whl (17.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page