Skip to main content

Utils for processing natural language expressions of quantities including efficient rule-based quantity and unit parsers.

Project description

Forschungszentrum Juelich Logo

NLP Utilities for Processing Quantities, Numbers, and Units

A collection of utilities for natural language processing (NLP) tasks that involve quantities, numbers, and units.

Features

  • Rule-based quantity parser
    • Various writing styles of quantities
    • Single quantities, lists, intervals, ratios, and multidimensional quantities
    • Imprecise quantities (e.g., 'a few km')
    • Parses simple uncertainty expressions (e.g., tolerances, standard deviations, and confidence intervals)
    • Resolves ellipses of units and order of magnitudes
    • Uses unit parser with features below
    • Lookups for imprecise quantities, number words, etc. can be customized
  • Rule-based unit parser
    • Links units to the QUDT ontology
    • Attempts to find a single QUDT unit for compound units
    • Handles unknown compound units
    • Unit symbol and label lookups can be customized
  • Quantity modifer extraction based on dictionary-matching
    • Lists of considered quantity modifiers can be customized
  • str2num converts number strings to a numeric datatype and accounts for mamy different ways numbers can be expressed in:
    • cardinals (e.g., "27")
    • ordinals (e.g., "27." or "27th")
    • fractions (e.g., "1/27")
    • with suffixes (e.g., "27-year")
    • spelled out (e.g., "twenty-seven")
    • with different thousands separators (e.g., "1'234" vs. "1234")
    • powers of ten (e.g., "2.7×10^6" or "2.6M" or "2.6 million")
    • etc.
  • Lookup tables for number words, imprecise quantities, quantity modifiers, physical constants, currencies
  • REGEX patterns to identify numbers in text

Installation

Create and activate a virtual environment.
Then, install the package via pip and download the spaCy pipeline.

pip install quinex-utils
python3 -m spacy download en_core_web_md

Usage

Convert numbers from strings to numeric data types

>>> from quinex_utils.functions import str2num

>>> num = str2num("3.23 10^3")
3230

>>> num = str2num("one million")
1e6

Use REGEX patterns

>>> from quinex_utils.src.quinex_utils.patterns.imprecise_quantities import IMPRECISE_VALUE_PATTERN

>>> IMPRECISE_VALUE_PATTERN.findall("They harvested a few apples and oodles of blueberries.")
["a few", "oodles"]

Use boolean checks

>>> from quinex_utils.functions.boolean_checks import contains_any_number

>>> if contains_any_number("..."):
>>>     ...

Use lookups

>>> from quinex_utils.lookups.physical_constants import PHYSICAL_CONSTANTS 

>>> string = "..."
>>> if any(c.lower() in string.lower() for c in PHYSICAL_CONSTANTS)
>>>     ...

Parse unit strings

>>> from quinex_utils.parsers.unit_parser import FastSymbolicUnitParser
>>> unit_parser = FastSymbolicUnitParser()
>>> unit_parser.parse("$2021/kWh")
[
    ('$', 1, 'http://qudt.org/vocab/currency/USD', 2021),
    ('kWh', -1, 'http://qudt.org/vocab/unit/KiloW-HR', None)
]

Parse quantity strings

>>> from quinex_utils.parsers.quantity_parser import FastSymbolicQuantityParser
>>> quantity_parser = FastSymbolicQuantityParser()
>>> quantity_parser.parse("above -120.123/-5 to 10.3 * 10^5 TWh kg*s^2/(m^2 per year)^3 at least")
{
    'nbr_quantities': 2,
    'normalized_quantities': [
        {
            'prefixed_modifier': {'normalized': '>', 'text': 'above'},
            'prefixed_unit': None,
            'value': {
                'normalized': {'is_imprecise': False, 'numeric_value': 2402460.0},
                'text': '-120.123/-5'
                }
            'suffixed_modifier': {'normalized': None, 'text': None},
            'suffixed_unit': {
                'ellipsed_text': 'TWh kg*s^2/(m^2 per year)^3',
                'normalized': [
                    ('TWh', 1, 'http://qudt.org/vocab/unit/TeraW-HR', None),
                    ('kg', 1, 'http://qudt.org/vocab/unit/KiloGM', None),
                    ('s', 2, 'http://qudt.org/vocab/unit/SEC', None),
                    ('m', -6, 'http://qudt.org/vocab/unit/M', None),
                    ('year', 3, 'http://qudt.org/vocab/unit/YR', None)],
                'text': None
                },
        },
        {
            'prefixed_modifier': {'normalized': None, 'text': None},
            'prefixed_unit': None,
            'value': {
                'normalized': {'is_imprecise': False, 'numeric_value': 1030000.0000000001},
                'text': '10.3 * 10^5'
                }
            'suffixed_modifier': {'normalized': '>=', 'text': 'at least'},
            'suffixed_unit': {
                'ellipsed_text': None,
                'normalized': [
                    ('TWh', 1, 'http://qudt.org/vocab/unit/TeraW-HR', None),
                    ('kg', 1, 'http://qudt.org/vocab/unit/KiloGM', None),
                    ('s', 2, 'http://qudt.org/vocab/unit/SEC', None),
                    ('m', -6, 'http://qudt.org/vocab/unit/M', None),
                    ('year', 3, 'http://qudt.org/vocab/unit/YR', None)],
                'text': 'TWh kg*s^2/(m^2 per year)^3'
                },
            }],
    'separators': [('to', 'range_separator')],
    'success': True,
    'text': 'above -120.123/-5 to 10.3 * 10^5 TWh kg*s^2/(m^2 per year)^3 at least',
    'type': 'range'
}

Convert quantities from one unit to another (this is an experimental feature)

from quinex_utils.parsers.unit_parser import FastSymbolicUnitParser

unit_parser = FastSymbolicUnitParser()
value = 9.5
from_unit = unit_parser.parse("kWh/kg")
to_unit = unit_parser.parse("MJ/kg")
conv_value, conv_unit = unit_parser.unit_conversion(
                    value=from_value,
                    from_compound_unit=from_unit,
                    to_compound_unit=convert_to,
                )

You can adjust for inflation and exchange rates when converting currency.

value = 56
from_unit = unit_parser.parse("€/kWh")
to_unit = unit_parser.parse("$_2025/kWh")
conv_value, conv_unit = unit_parser.unit_conversion(
                    value=from_value,
                    from_compound_unit=from_unit,
                    to_compound_unit=convert_to,
                    from_default_year=2020,
                    to_default_year=2025,
                )

Rule-based quantity and unit parsers

Parses strings such as '3 m/s', '16-20.000 Hz', '4,186 kJ/(kg·K)', or 'five kilograms' to structured data.

The quantity parser dissects quantity strings into individual quantities and their components (i.e., numeric values, units, and modifiers). The type of the quantity expression is determined (i.e., single quantity, lists, intervals, ratios, and multidimensional quantities) and values, units and modifiers are normalized. Values are normalized to numeric datatypes if applicable. Units are parsed and normalized using a rule-based unit parser that links units to their corresponding QUDT unit class. Rather than dividing compound units into their smallest parts, the aim is to return as few parts as possible, ideally a single QUDT class.

Please note

[!IMPORTANT] The functions are not fully tested and should not be used in high-stakes or safety-critical applications without careful validation and verification.

  • Short scale is assumed (e.g., a billion is interpretated as 10⁹ and not 10¹²)
  • '-' and '+' are not considered quantity modifiers but directly included in the quantity span. However, 'minus' and 'negative' are considered quantity modifiers. Hence, for '-5%' the numeric value would be -5 and the quantity modifier empty, but for 'minus 5%' the numeric value would be 5 and the normalized quantity modifier '-'.
  • Third, fourth, fifth, etc. are interpreted as ordinals and not as fractions unless they are preceded by a number word smaller than twenty (e.g., "one third" is 1/3 and "twenty third" is 23th)
  • No floating-point arithmetic error mitigation (e.g., '10.3 * 10^5' is normalized to 1030000.0000000001).
  • As the maximum integer value is unbounded in Python 3, the parser's result has no length limit, however, floats larger sys.float_info.max are normalized to inf.
  • The result of quantity parser depends on quantity modifier detection. For example, '25 and 30 km/h' will be interpretated as a list of quantities, whereas 'between 25 and 30 km/h' will be interpretated as a quantity range.

Limitations

  • Only English-language support
  • Unit disambiugations based on hard-coded priorities without considering context
  • Only adjacent quantity modifiers considered
  • Cannot deal well with OCR errors or spelling mistakes
  • Cannot deal with unit modifiers (e.g., CO2 in "kgCO2")
  • Cannot deal with quantity expressions containing additional information (e.g., italy and spain in "2 million (italy) to 5 million (spain)")
  • Repeating units will be detected twice (e.g., in 'kilometers per hour (km/h)')
  • Constants like speed of light in vacuum not considered
  • Cannot distinguish between ordinals and fractions based on context (e.g., fourth could be 1/4 or 4th)
  • The unit lookups may contain errors, as they have been automatically compiled form different sources
  • In particular, cents could be incorrectly mapped to a currency without considering its order of magnitude
  • Unit converstion is an experimental feature and may return incorrect results in some cases
  • Can only perform unit ellipses resolution for suffixed and not prefixed units (e.g., for "10, 20, and 30 km/h" but not for "EUR 10, 20, and 30")
  • OCR and spelling errors matter (e.g., '6.5 EUR/kW h' will be parsed to EUR.kW-1.h not EUR.kW-1.h-1, and '0.8 −0.3' will be normalized to 0.5)

Evaluation

We evaluated a previous version of the parser on the Grobid-quantities test set (see quinex_utils/benchmark/quantity_parser). The results are summerized in the Appendix of our paper Quinex: Quantitative Information Extraction from Text using Open and Lightweight LLMs (search for "rule-based quantity parser"). Since then we updated the unit lookups and fixed some minor bugs, but we have not yet performed a new evaluation.

Update unit lookups

You can update the unit lookups by following the instruction in src/quinex_utils/parsers/scripts/README.md.

Contents

  • Lookups
    • Number words
    • Imprecise quantities
    • Quantity modifiers
    • Physical constants
    • Character mapping
  • Patterns
    • Contains
    • Imprecise quantities
    • Number words
    • Number
    • Numeric value
    • Order of magnitude
    • Split
  • Parsers
    • Unit parser
    • Quantity parser
  • Functions
    • boolean_checks
      • contains_any_number
      • is_imprecise_quantity
      • is_relative_quantity
      • is_small_int
    • normalize
      • normalize_unicode_string
      • normalize_unit_span
      • normalize_num_span
      • normalize_quantity_span
      • rectify_quantity_annotation
    • num2str
      • num2str
      • get_fraction_str
      • get_number_spellings
      • get_digit_notations
    • str2num
      • str2num
        • cast_str_as_int
        • cast_str_as_float
        • cast_str_as_fraction_sum
        • cast_str_as_number_words
        • cast_str_as_num_with_order_of_magnitude
        • cast_str_as_math_expr
        • cast_str_as_digits_and_number_words
        • cast_str_as_power
    • Units
      • remove_exponent_from_ucum_code_of_single_unit

Contribute

We welcome contributions.

License

This project is licensed under the MIT License -- see the LICENSE file for details.

The unit lookups in src/quinex_utils/parsers/static_resources/ are compiled from the following sources:

Citation

If you use quinex in your research, please cite the following paper:

@article{quinex2025,
    title = {{Quinex: Quantitative Information Extraction from Text using Open and Lightweight LLMs}},	
    author = {Göpfert, Jan and Kuckertz, Patrick and Müller, Gian and Lütz, Luna and Körner, Celine and Khuat, Hang and Stolten, Detlef and Weinand, Jann M.},
    month = okt,
    year = {2025},
}

About Us

Institute image ICE-2

We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

Acknowledgements

The authors would like to thank the German Federal Government, the German state governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

NFDI4Ing LogoHelmholtz Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quinex_utils-0.0.0.tar.gz (254.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quinex_utils-0.0.0-py3-none-any.whl (233.5 kB view details)

Uploaded Python 3

File details

Details for the file quinex_utils-0.0.0.tar.gz.

File metadata

  • Download URL: quinex_utils-0.0.0.tar.gz
  • Upload date:
  • Size: 254.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for quinex_utils-0.0.0.tar.gz
Algorithm Hash digest
SHA256 948132dba216a3b5910f883abdc194c13c0402648408a7bdd966cc937db77131
MD5 f44a8a6d8f40e0fd6907bdcd3af0fc0f
BLAKE2b-256 5ab1beda3ac9ba16f4675af49faba59ba41fa407702cd84c74745a0d29a8e14d

See more details on using hashes here.

File details

Details for the file quinex_utils-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: quinex_utils-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 233.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for quinex_utils-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e9a6cf2c3cda6fb30f4b302e11355ceb50036ef35326710e9cace696890daa3
MD5 a5702d9c4a292362e392627b405e8a66
BLAKE2b-256 2a4aef8826c0c9d6a5560ac4ec1e07812f486f4f871f0f63a02a83c6952579ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page