quantity extractor
Project description
CQE
This is Version 2 and all updates will be applied to this branch.
Evaultion and data in CQE_Evaluation.
A Framework for Comprehensive Quantity Extraction. This repository contains code for the paper:
CQE: A Framework for Comprehensive Quantity Extraction
Satya Almasian*, Vivian Kazakova*, Philipp Göldner, Michael Gertz
Institute of Computer Science, Heidelberg University
(*
indicates equal contribution)
If you found this useful, consider citing us:
@misc{almasian2023cqe,
title={CQE: A Comprehensive Quantity Extractor},
author={Satya Almasian and Vivian Kazakova and Philip Göldner and Michael Gertz},
year={2023},
eprint={2305.08853},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Prerequisites
you can also install the package using on the root directory of the package.
pip install .
Usage
Create a NumParser
and parse some text or sentence.
from CQE import CQE
parser = CQE.CQE()
text = "The sp 500 was down 2.1% and nasdaq fell 2.5%."
result = parser.parse(text)
print(result)
>>> [(=,2.1,[%],percentage,[sp, 500]), (=,2.5,[%],percentage,[nasdaq])]
Use the overload option for additional functionality. The NumParser will compute the span indices of the Quantity, the normalized input sentence, the long and the simplified scientific notation of the Value, whether the unit is scientific or noun based and the unit surface forms.
parser = CQE.CQE(overload=True)
text = "The sp 500 was down 2.1% and nasdaq fell 2.5%."
result = parser.parse(text)
for res in result:
print(f"""
Quantity: {res}
=====
indices = {res.get_char_indices()}
normalized text = {res.get_normalized_text()}
pre processed text = {res.get_preprocessed_text()}
scientific notation = {res.value.scientific_notation}
simplified scientific notation = {res.value.simplified_scientific_notation}
scientific unit = {res.unit.scientific}
unit surfaces forms = {res.unit.unit_surfaces_forms}""")
>>> Quantity: (down,2.1,[%],percentage,{0: [sp, 500]})
=====
indices = [5, 6]
normalized text = The sp 500 was down 2.1 percentage and nasdaq fell 2.5 percentage .
pre processed text = The sp 500 was down 2.1% and nasdaq fell 2.5% .
scientific notation = 2.100000e+00
simplified scientific notation = 2.1e+00
scientific unit = True
unit surfaces forms = ['percentage', 'percent', 'pc', '%', 'pct', 'pct.']
Quantity: (down,2.5,[%],percentage,{0: [nasdaq]})
=====
indices = [10, 11]
normalized text = The sp 500 was down 2.1 percentage and nasdaq fell 2.5 percentage .
pre processed text = The sp 500 was down 2.1% and nasdaq fell 2.5% .
scientific notation = 2.500000e+00
simplified scientific notation = 2.5e+00
scientific unit = True
unit surfaces forms = ['percentage', 'percent', 'pc', '%', 'pct', 'pct.']
See the example in CQE/example.py as well. Run
python3 CQE/example.py
Evaluation and Data
For replicating the results on the paper and comparing against other system, make sure CQE is installed and use the CQE_Evaluation repo. The evaluation script and data used for evaluation and training unit disambiguators are in this repository.
File and folder structure
Main files for CQE are under CQE package, where unit_classifer
contains code for unit disambiguation based on BERT classifier trained using spacy-transformers. units.json
file is used for normalization of units and unit_models.zip
contains the trained models for the disambiguation which will be unziped on the first run of NumParser
class.
File | Description |
---|---|
CQE/NumberNormalizer.py | Bound, Number and Unit Normalization script |
CQE/NumberParser.py | Quantity Extraction script |
CQE/rules.py | Rules for DependencyMatcher |
CQE/unit.json | 531 units used for the Unit Normalization |
CQE/classes.py | Definition of the Bound, Range, Number, Unit, Noun and Quanitity classes |
CQE/number_lookup.py | Number-word to number mappings |
CQE/example.py | Usage example |
CQE/unit_classifer/unit_disambiguator.py | Class for unit disambiguator based on the bert based classifiers. |
CQE/unit_classifer/train_classifier_bert.py | Script for generating spacy based training data and training commands to create classifiers for disambiguation. |
CQE/unit_classifer/sample_usage.py | Usage example for disambiguation class. |
Units
The units used for normalization of the unit of an extracted quantity are stored in the unit.json . Each of the 531 units has surfaces, symbols, prefixes, entity, URI, dimensions and currency_code. For composing the file, the list of units from quantulum3, the list of units from Wikipedia, the surfaces from Microsoft.Recognizers.Text ,the UCUM units and surfaces and wikipedia page of [units] (https://en.wikipedia.org/wiki/Template:Convert/list_of_units) were used.
Example:
"light-year": {
"surfaces": [
"light-year",
"light year",
"light years"
],
"entity": "length",
"URI": "Light-year",
"dimensions": [],
"symbols": [
"ly",
"[ly]"
]
}
Rules
There are more than 50 rules for DependencyMatcher defined in the rules.py. We use the spaCy-model en core web sm to create a Doc object with linguistic annotations. The key point is that the rules are not simple pattern matching based on the single words in the sentence, but on those annotations and exploit the structure of the sentence.
Existing rules can be changed and new ones can be added by editing the file. Pay attention to the DependencyMatcher syntax.
Example:
"num_symbol" : [
{
"RIGHT_ID": "number",
"RIGHT_ATTRS": {"POS": "NUM"}
},
{
"LEFT_ID": "number",
"REL_OP": ">",
"RIGHT_ID": "symbol",
"RIGHT_ATTRS": {"DEP": {"IN": ["quantmod", "nmod"]}, "POS": "SYM"}
},
]
Input: "The September crude contract was up 19 cents at US $58.24 per barrel and the September natural gas contract was up 10.4 cents to US $2.24 per mmBTU."
Matches:
NUM_SYMBOL [58.24, US$]
NUM_SYMBOL [2.24, US$]
NOUN_NUM [cents, 19]
NOUN_NUM [cents, 10.4]
NUM_RIGHT_NOUN [58.24, barrel]
NUM_RIGHT_NOUN [2.24, mmBTU]
NOUN_NOUN [contract, gas, natural]
UNIT_FRAC [58.24, per, barrel]
UNIT_FRAC [58.24, per, gas]
UNIT_FRAC [58.24, per, contract]
UNIT_FRAC [58.24, per, cents]
UNIT_FRAC [58.24, per, mmBTU]
UNIT_FRAC [2.24, per, mmBTU]
UNIT_FRAC_2 [58.24, per, gas, natural]
LONELY_NUM [19]
LONELY_NUM [58.24]
LONELY_NUM [10.4]
LONELY_NUM [2.24]
Candidates: [[US$, 58.24, per, barrel, 10], [US$, 2.24, per, mmBTU, 25], [19, cents, 6], [10.4, cents, 21]]
Quadruples: [([], [58.24], [US$, per, barrel], 10), ([], [2.24], [US$, per, mmBTU], 25), ([], [19], [cents], 6), ([], [10.4], [cents], 21)]
Output: [(=,58.24,[US$, per, barrel],united states dollar / barrel,[September, crude, contract]), (=,2.24,[US$, per, mmBTU],united states dollar / mmBTU,[September, natural, gas, contract]), (=,19.0,[cents],cent,[September, crude, contract]), (=,10.4,[cents],cent,[September, natural, gas, contract])]
Note that the numbers 6, 10, 21 and 25 indicate the position of the quantity in the text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file CQE-2.0.1.tar.gz
.
File metadata
- Download URL: CQE-2.0.1.tar.gz
- Upload date:
- Size: 475.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47c3a65c0a0e0e6d5c050adea1f5c957a9a3398c1df687f6a763da2f0db18735 |
|
MD5 | 40d1cc363eb6ec61ced079fe967b0023 |
|
BLAKE2b-256 | 0104b43e2dc00f47a14e88ed636ccfb5c6b5ddbf6c01348309901b9f42216ab1 |