Skip to main content

Python tool to match phrases listed in the taxonomy

Project description

taxonomy-matcher

Description

Given a gazetteer/taxonomy and some input text, taxonomy-matcher can be used to find all matched phrases.

CI Status

https://travis-ci.org/tilaboy/taxonomy-matcher.svg?branch=master Updates Documentation Status

Requirements

Python 3.6+

Usage

Use taxonomy-match script:

usage: taxonomy-match input_file
                      (--json_tax JSON_TAX | --xml_tax XML_TAX | --gz_tax GZ_TAX)


find matched phrases from input text

positional arguments:
  input_file           input text file

optional arguments:
  --json_tax JSON_TAX  normalization taxonomy in json form
  --xml_tax XML_TAX    taxonomy in xml form
  --gz_tax GZ_TAX      a list of keywords in txt form

Use taxonomy-matcher module

  • From normalization table in JSOM format:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
for matched in taxonomy_matcher.matching(text):
    print(matched)

And an example of the normalization table in JSON:

{
  "meta": {
    "concept_type": "skills",
    "release_datetime": "2019-xx-xx"
  },
  "concepts": [
    {
      "display_name": "Risk Analysis",
      "category": "Financial Skill",
      "id": "ABCDEFG001",
      "surface_forms": [
        {
          "surface_form": "risk analysis",
          "skill_likelihood": 0.9
        },
        {
          "surface_form": "quantitative risk assessment",
          "skill_likelihood": 1.0
        },
        {
          "surface_form": "risk assessment",
          "skill_likelihood": 0.7
        }
      ]
    },
    .......
    {
      "display_name": "Mobile Data",
      "category": "Computer Skill",
      "id": "ABCDEFG002",
      "surface_forms": [
        {
          "surface_form": "mobile data"
        }
      ]
    }
  ]
}
  • From gazetteer:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(gazetteer=gz_file)
for matched in taxonomy_matcher.matching(text):
    print(matched)

and an example of the gazetteer

# gazetteer
mobile data
risk analysis
quantitative risk assessment
risk assessment
.....
  • From Taxonomy Codetable:

from taxonomy_matcher.matcher import Matcher
ct_matcher = Matcher(codetable=ct_file)
for matched in ct_matcher.matching(text):
    print(matched)

CodeTable is a XML version of the JSON example given above.

other functions

  • Context words:

When context are needed for matched phrases, e.g. for the following up validation functions, enable the with\_context option:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file,with_context=True)
for matched in taxonomy_matcher.matching(text):
    print(matched.left_context, matched.right_context)
  • Code Property lookup

If need to lookup the property of an Code in the taxonomy, check the matcher Class property ‘code_property_mapping’, it is a dictionary mapping id to description and category, it is in the form of:

dict[code_id] = {
    'desc':code_description,
    'type':code_category
}

E.g. to get the description of the codeid:

codeid = 12345
from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
if codeid in taxonomy_matcher.code_property_mapping:
    print(taxonomy_matcher.code_property_mapping[codeid]['desc'])

check the Metainfo of the Taxonomy or Gazetteer:

Note: currently only available for the Normalized code JSOM.

The metainfo can be stored in meta part of the JSON document, e.g. if the following information is listed in the JSOM meta section:

"meta": {
  "language": "EN",
  "release_datetime": "2019-04-17T12:22:10.729673",
  "concept_type": "skills",
  "purpose": "normalization"
},

We can fetch it via the matcher object

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
print(taxonomy_matcher['meta_info'])

output will be:

{
  'language': 'EN',
  'release_datetime': '2019-04-17T12:22:10.729673',
  'concept_type': 'skills',
  'purpose': 'normalization'
}

matched phrase object: MatchedPhrase

matcher.matching is an iterable which return a MatchedPhrase instance, the instance has the following attributes:

  • normalize pattern form: matched_pattern

  • surface form: surface_form

  • start position and end position: start_pos, end_pos

  • code_id and code_description (None if not set in the pattern file)

  • left context and right context of the matched skills (only availabe if with_context=True )

for match in matcher.matching(text):
    print("found pattern [{}] in the form of [{}] at position ({}:{}), code:{} {} {}".format(
        matched.matched_pattern
        matched.surface_form
        matched.start_pos
        matched.end_pos
        matched.code_id
        matched.code_description
        matched.category
        matched.left_context
        matched.right_context
    )

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

Testing

To run unit tests, execute the following from the project root directory:

python setup.py test

0.0.5 (2019-07-30)

added script taxonomy-match, which find all matches given the taxonomy and the input text file.

0.0.4 (2019-07-30)

rename the package name to taxonomy-matcher. Reorder the structure of the package.

0.0.3 (2019-07-28)

test the CI frame, added the travis support, added automatic document generation.

0.0.2 (2019-07-27)

Added a working version Matcher, which can create a matcher with the gazetteer from eighor a txt, json, or xml format, and found matched phrases from input text.

0.0.1 (2019-07-27)

Initiate the package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonomy_matcher-0.0.7.tar.gz (19.2 kB view hashes)

Uploaded Source

Built Distribution

taxonomy_matcher-0.0.7-py2.py3-none-any.whl (16.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page