Python tool to match phrases listed in the taxonomy
Project description
taxonomy-matcher
Description
Given a gazetteer/taxonomy and some input text, taxonomy-matcher can be used to find all matched phrases.
CI Status
Requirements
Python 3.6+
Usage
Use taxonomy-match script:
usage: taxonomy-match input_file (--json_tax JSON_TAX | --xml_tax XML_TAX | --gz_tax GZ_TAX) find matched phrases from input text positional arguments: input_file input text file optional arguments: --json_tax JSON_TAX normalization taxonomy in json form --xml_tax XML_TAX taxonomy in xml form --gz_tax GZ_TAX a list of keywords in txt form
Use taxonomy-matcher module
From normalization table in JSOM format:
from taxonomy_matcher.matcher import Matcher taxonomy_matcher = Matcher(normtable=json_file) for matched in taxonomy_matcher.matching(text): print(matched)
And an example of the normalization table in JSON:
{ "meta": { "concept_type": "skills", "release_datetime": "2019-xx-xx" }, "concepts": [ { "display_name": "Risk Analysis", "category": "Financial Skill", "id": "ABCDEFG001", "surface_forms": [ { "surface_form": "risk analysis", "skill_likelihood": 0.9 }, { "surface_form": "quantitative risk assessment", "skill_likelihood": 1.0 }, { "surface_form": "risk assessment", "skill_likelihood": 0.7 } ] }, ....... { "display_name": "Mobile Data", "category": "Computer Skill", "id": "ABCDEFG002", "surface_forms": [ { "surface_form": "mobile data" } ] } ] }
From gazetteer:
from taxonomy_matcher.matcher import Matcher taxonomy_matcher = Matcher(gazetteer=gz_file) for matched in taxonomy_matcher.matching(text): print(matched)
and an example of the gazetteer
# gazetteer mobile data risk analysis quantitative risk assessment risk assessment .....
From Taxonomy Codetable:
from taxonomy_matcher.matcher import Matcher ct_matcher = Matcher(codetable=ct_file) for matched in ct_matcher.matching(text): print(matched)
CodeTable is a XML version of the JSON example given above.
other functions
Context words:
When context are needed for matched phrases, e.g. for the following up validation functions, enable the with\_context option:
from taxonomy_matcher.matcher import Matcher taxonomy_matcher = Matcher(normtable=json_file,with_context=True) for matched in taxonomy_matcher.matching(text): print(matched.left_context, matched.right_context)
Code Property lookup
If need to lookup the property of an Code in the taxonomy, check the matcher Class property ‘code_property_mapping’, it is a dictionary mapping id to description and category, it is in the form of:
dict[code_id] = { 'desc':code_description, 'type':code_category }
E.g. to get the description of the codeid:
codeid = 12345 from taxonomy_matcher.matcher import Matcher taxonomy_matcher = Matcher(normtable=json_file) if codeid in taxonomy_matcher.code_property_mapping: print(taxonomy_matcher.code_property_mapping[codeid]['desc'])
check the Metainfo of the Taxonomy or Gazetteer:
Note: currently only available for the Normalized code JSOM.
The metainfo can be stored in meta part of the JSON document, e.g. if the following information is listed in the JSOM meta section:
"meta": { "language": "EN", "release_datetime": "2019-04-17T12:22:10.729673", "concept_type": "skills", "purpose": "normalization" },
We can fetch it via the matcher object
from taxonomy_matcher.matcher import Matcher taxonomy_matcher = Matcher(normtable=json_file) print(taxonomy_matcher['meta_info'])
output will be:
{ 'language': 'EN', 'release_datetime': '2019-04-17T12:22:10.729673', 'concept_type': 'skills', 'purpose': 'normalization' }
matched phrase object: MatchedPhrase
matcher.matching is an iterable which return a MatchedPhrase instance, the instance has the following attributes:
normalize pattern form: matched_pattern
surface form: surface_form
start position and end position: start_pos, end_pos
code_id and code_description (None if not set in the pattern file)
left context and right context of the matched skills (only availabe if with_context=True )
for match in matcher.matching(text): print("found pattern [{}] in the form of [{}] at position ({}:{}), code:{} {} {}".format( matched.matched_pattern matched.surface_form matched.start_pos matched.end_pos matched.code_id matched.code_description matched.category matched.left_context matched.right_context )
Development
To install package and its dependencies, run the following from project root directory:
python setup.py install
Testing
To run unit tests, execute the following from the project root directory:
python setup.py test
0.0.5 (2019-07-30)
added script taxonomy-match, which find all matches given the taxonomy and the input text file.
0.0.4 (2019-07-30)
rename the package name to taxonomy-matcher. Reorder the structure of the package.
0.0.3 (2019-07-28)
test the CI frame, added the travis support, added automatic document generation.
0.0.2 (2019-07-27)
Added a working version Matcher, which can create a matcher with the gazetteer from eighor a txt, json, or xml format, and found matched phrases from input text.
0.0.1 (2019-07-27)
Initiate the package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for taxonomy_matcher-0.0.7-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95d2be8cacde3b5b17b07a2dfbd43c5dca1f6e580ed0eb65699bbe0f7d682516 |
|
MD5 | aef27c1488e37b9afc493e1030455a41 |
|
BLAKE2b-256 | ec2b4b13b5d43a36c38a3c7839cbe569a5982b6179248ab03f17f59c1e90779d |