Skip to main content

Extract quantities from unstructured text.

Project description

quantulum3 Travis master build state Coverage Status PyPI version  Supported python versions

Python library for information extraction of quantities, measurements and their units from unstructured text. It is Python 3 compatible fork of recastrodiaz' fork of grhawks' fork of the original by Marco Lagi. The compatability with the newest version of sklearn is based on the fork of sohrabtowfighi.

Installation

First, install numpy, scipy and sklearn. Quantulum would still work without it, but it wouldn't be able to disambiguate between units with the same name (e.g. pound as currency or as unit of mass).

Then,

$ pip install quantulum3

If you want to train the classifier yourself, in addition to the packages above, you'll also need the packages stemming and wikipedia. Use the method train in quantulum3.classifier to train the classifier.

You could also download requirements_classifier.txt and run pip install requirements_classifier.txt.

Contributing

If you'd like to contribute follow these steps:

  1. Clone a fork of this project into your workspace
  2. pip install pipenv
  3. Inside the project folder run pipenv install --dev
  4. Make your changes
  5. Run scripts/format.sh
  6. Test your changes with coverage run --source=quantulum3 --omit="*test*" setup.py test (Optional, will be done automatically after pushing)
  7. Create a Pull Request when having commited and pushed your changes

dev build:

Travis dev build state Coverage Status

Usage

>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]

The Quantity class stores the surface of the original text it was extracted from, as well as the (start, end) positions of the match:

>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)

An inline parser that embeds the parsed quantities in the text is also available (especially useful for debugging):

>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine

As the parser is also able to parse dimensionless numbers, this library can also be used for simple number extraction.

>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]

Units and entities

All units (e.g. litre) and the entities they are associated to (e.g. volume) are reconciled against WikiPedia:

>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)

>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)

This library includes more than 290 units and 75 entities. It also parses spelled-out numbers, ranges and uncertainties:

>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]

>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]

>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1

Non-standard units usually don't have a WikiPedia page. The parser will still try to guess their underlying entity based on their dimensionality:

>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)

Disambiguation

If the parser detects an ambiguity, a classifier based on the WikiPedia pages of the ambiguous units or entities tries to guess the right one:

>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]

>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]

or:

>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)

>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)

Manipulation

While quantities cannot be manipulated within this library, there are many great options out there:

Extension

See units.json for the complete list of units and entities.json for the complete list of entities. The criteria for adding units have been:

It's easy to extend these two files to the units/entities of interest. Here is an example of an entry in entities.json:

{
    "name": "speed",
    "dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
    "URI": "https://en.wikipedia.org/wiki/Speed"
}
  • name and URI are self explanatory.
  • dimensions is the dimensionality, a list of dictionaries each having a base (the name of another entity) and a power (an integer, can be negative).

Here is an example of an entry in units.json:

{
    "name": "metre per second",
    "surfaces": ["metre per second", "meter per second"],
    "entity": "speed",
    "URI": "https://en.wikipedia.org/wiki/Metre_per_second",
    "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
    "symbols": ["mps"]
}
  • name and URI are self explanatory.
  • surfaces is a list of strings that refer to that unit. The library takes care of plurals, no need to specify them.
  • entity is the name of an entity in entities.json
  • dimensions follows the same schema as in entities.json, but the base is the name of another unit, not of another entity.
  • symbols is a list of possible symbols and abbreviations for that unit.

All fields are case sensitive.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantulum3-0.2.7.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

quantulum3-0.2.7-py3-none-any.whl (2.0 MB view details)

Uploaded Python 3

File details

Details for the file quantulum3-0.2.7.tar.gz.

File metadata

  • Download URL: quantulum3-0.2.7.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for quantulum3-0.2.7.tar.gz
Algorithm Hash digest
SHA256 3dfebe571c2665762ab1a2a9028a62ed05f39c2dc3886a3388017c287006f961
MD5 e4d53d1562ecdf20c68f7b5523296ce7
BLAKE2b-256 7c142ace8c92813cf18670362a187b07877ae46257a597fba835a637281380bb

See more details on using hashes here.

Provenance

File details

Details for the file quantulum3-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: quantulum3-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for quantulum3-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e07caf49367f67f10acf982a5b9d10efa3b7dca56b9334b6167f61d4b6f30f5d
MD5 661f4b60119ee658773e97e27897f40d
BLAKE2b-256 65bb1ed055784f05907ca6517135eae7376dd6a8823b1bf5665681a9d41f82d2

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page