Skip to main content

Python tool to match phrases listed in the taxonomy

Project description

taxonomy-matcher

CI Status

https://travis-ci.org/tilaboy/taxonomy-matcher.svg?branch=master Updates Documentation Status

Description

Given a gazetteer/taxonomy and input text, taxonomy-matcher can be used to find all phrases which matches the codes/instances/keywords in the gazetteer or taxonomy.

For each match, it will return the information of,

  • surface_form

  • matched position

  • Code ID and Code Description

  • and other code related information

Requirements

Python 3.6+

Usage

Use taxonomy-match script:

usage: taxonomy-match input_file taxonomy_file [--output_file OUTPUT_FILE]


load taxonomy phrases from the taxonomy file, and find all matched phrases
from the input text. The result will eithor write to an output file or print
to the screen.

positional arguments:
  input_file            input text file, text to mine phrases
  taxonomy_file         taxonomy file, support json/xml/txt, see documentation
                        for more details

optional arguments:
  --output_file         output file of matched phrases, supports
                        jsonl/csv/tsv/txt format, print matched phrases to
                        the screen if not defined

Use taxonomy-matcher module

  • From normalization table in JSOM format:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
for matched in taxonomy_matcher.matching(text):
    print(matched)

And an example of the normalization table in JSON:

{
  "meta": {
    "concept_type": "skills",
    "release_datetime": "2019-xx-xx"
  },
  "concepts": [
    {
      "display_name": "Risk Analysis",
      "category": "Financial Skill",
      "id": "ABCDEFG001",
      "surface_forms": [
        {
          "surface_form": "risk analysis",
          "skill_likelihood": 0.9
        },
        {
          "surface_form": "quantitative risk assessment",
          "skill_likelihood": 1.0
        },
        {
          "surface_form": "risk assessment",
          "skill_likelihood": 0.7
        }
      ]
    },
    .......
    {
      "display_name": "Mobile Data",
      "category": "Computer Skill",
      "id": "ABCDEFG002",
      "surface_forms": [
        {
          "surface_form": "mobile data"
        }
      ]
    }
  ]
}
  • From gazetteer:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(gazetteer=gz_file)
for matched in taxonomy_matcher.matching(text):
    print(matched)

and an example of the gazetteer

# gazetteer
mobile data
risk analysis
quantitative risk assessment
risk assessment
.....
  • From Taxonomy Codetable:

from taxonomy_matcher.matcher import Matcher
ct_matcher = Matcher(codetable=ct_file)
for matched in ct_matcher.matching(text):
    print(matched)

CodeTable is a XML version of the JSON example given above.

other functions

  • Context words:

When context are needed for matched phrases, e.g. for the following up validation functions, enable the with\_context option:

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file,with_context=True)
for matched in taxonomy_matcher.matching(text):
    print(matched.left_context, matched.right_context)
  • Code Property lookup

If need to lookup the property of an Code in the taxonomy, check the matcher Class property ‘code_property_mapping’, it is a dictionary mapping id to description and category, it is in the form of:

dict[code_id] = {
    'desc':code_description,
    'type':code_category
}

E.g. to get the description of the codeid:

codeid = 12345
from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
if codeid in taxonomy_matcher.code_property_mapping:
    print(taxonomy_matcher.code_property_mapping[codeid]['desc'])

check the Metainfo of the Taxonomy or Gazetteer:

Note: currently only available for the Normalized code JSOM.

The metainfo can be stored in meta part of the JSON document, e.g. if the following information is listed in the JSOM meta section:

"meta": {
  "language": "EN",
  "release_datetime": "2019-04-17T12:22:10.729673",
  "concept_type": "skills",
  "purpose": "normalization"
},

We can fetch it via the matcher object

from taxonomy_matcher.matcher import Matcher
taxonomy_matcher = Matcher(normtable=json_file)
print(taxonomy_matcher['meta_info'])

output will be:

{
  'language': 'EN',
  'release_datetime': '2019-04-17T12:22:10.729673',
  'concept_type': 'skills',
  'purpose': 'normalization'
}

matched phrase object: MatchedPhrase

matcher.matching is an iterable which return a MatchedPhrase instance, the instance has the following attributes:

  • normalize pattern form: matched_pattern

  • surface form: surface_form

  • start position and end position: start_pos, end_pos

  • code_id and code_description (None if not set in the pattern file)

  • left context and right context of the matched skills (only availabe if with_context=True )

for match in matcher.matching(text):
    print("found pattern [{}] in the form of [{}] at position ({}:{}), code:{} {} {}".format(
        matched.matched_pattern
        matched.surface_form
        matched.start_pos
        matched.end_pos
        matched.code_id
        matched.code_description
        matched.category
        matched.left_context
        matched.right_context
    )

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

Testing

To run unit tests, execute the following from the project root directory:

python setup.py test

0.0.9 (2019-11-14)

update the option of likelihood, normalize string for both surface form from taxonomy and matched

0.0.8 (2019-08-05)

Add option “output to file” to taxonomy-match script, support jsonl/csv/tsv/txt, and STDOUT

0.0.7 (2019-07-30)

Add a script taxonomy-match to find all matches from input_text, with a given taxonomy

0.0.6 (2019-07-30)

test the travis-ci

0.0.5 (2019-07-30)

test the travis-ci

0.0.4 (2019-07-30)

rename the package name to taxonomy-matcher. Reorder the structure of the package.

0.0.3 (2019-07-28)

test the CI frame, added the travis support, added automatic document generation.

0.0.2 (2019-07-27)

Added a working version Matcher, which can create a matcher with the gazetteer from eighor a txt, json, or xml format, and found matched phrases from input text.

0.0.1 (2019-07-27)

Initiate the package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonomy_matcher-0.0.9.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxonomy_matcher-0.0.9-py2.py3-none-any.whl (17.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file taxonomy_matcher-0.0.9.tar.gz.

File metadata

  • Download URL: taxonomy_matcher-0.0.9.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.7

File hashes

Hashes for taxonomy_matcher-0.0.9.tar.gz
Algorithm Hash digest
SHA256 ff4ab72338706974e0f9d52f4e2f3726dd0532be5c1283de81ca24ee98878d07
MD5 e0fbdf72212616cb57ab88667c678eb9
BLAKE2b-256 9762aad4021b57eaa663366ae03bd8d7a11f54ba6e6c2e5761c32d6c1f026ed6

See more details on using hashes here.

File details

Details for the file taxonomy_matcher-0.0.9-py2.py3-none-any.whl.

File metadata

  • Download URL: taxonomy_matcher-0.0.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.7

File hashes

Hashes for taxonomy_matcher-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 edcc7768fc581760ccd19d16248112b382ab4669eadcc1b26358e64467fdf317
MD5 5f1a0c358f264ec5531d16bda8cb053c
BLAKE2b-256 a2d1d1830c0d4c6579e06cf3aebd2fade838a3472c6f5a0882d418ec4d753fa6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page