Extract quantities from unstructured text.

These details have not been verified by PyPI

Project links

Project description

quantulum3

PyPI - Python Version

Python library for information extraction of quantities, measurements and their units from unstructured text. It is able to disambiguate between similar looking units based on their k-nearest neighbours in their GloVe vector representation and their Wikipedia page.

It is Python 3 compatible fork of recastrodiaz' fork of grhawks' fork of the original by Marco Lagi. The compatability with the newest version of sklearn is based on the fork of sohrabtowfighi.

Installation

First, install numpy, scipy and sklearn. Quantulum would still work without it, but it wouldn't be able to disambiguate between units with the same name (e.g. pound as currency or as unit of mass).

Then,

$ pip install quantulum3

Usage

>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]

The Quantity class stores the surface of the original text it was extracted from, as well as the (start, end) positions of the match:

>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)

An inline parser that embeds the parsed quantities in the text is also available (especially useful for debugging):

>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine

As the parser is also able to parse dimensionless numbers, this library can also be used for simple number extraction.

>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]

Units and entities

All units (e.g. litre) and the entities they are associated to (e.g. volume) are reconciled against WikiPedia:

>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)

>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)

This library includes more than 290 units and 75 entities. It also parses spelled-out numbers, ranges and uncertainties:

>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]

>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]

>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1

Non-standard units usually don't have a WikiPedia page. The parser will still try to guess their underlying entity based on their dimensionality:

>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)

Disambiguation

If the parser detects an ambiguity, a classifier based on the WikiPedia pages of the ambiguous units or entities tries to guess the right one:

>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]

>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]

or:

>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)

>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)

In addition to that, the classifier is trained on the most similar words to all of the units surfaces, according to their distance in GloVe vector representation.

Training the classifier

If you want to train the classifier yourself, in addition to the packages above, you'll also need the packages stemming and wikipedia.

You could also download requirements_classifier.txt and run

$ pip install -r requirements_classifier.txt

Use the script scripts/train.py or the method train_classifier in quantulum3.classifier to train the classifier.

If you want to create a new or different similars.json, install pymagnitude.

For the extraction of nearest neighbours from a vector word representation file, use scripts/extract_vere.py. It automatically extracts the k nearest neighbours in vector space of the vector representation for each of the possible surfaces of the ambiguous units. The resulting neighbours are stored in quantulum3/similars.json and automatically included for training.

The file provided should be in .magnitude format as other formats are first converted to a .magnitude file on-the-run. Check out pre-formatted Magnitude formatted word-embeddings and Magnitude for more information.

Manipulation

While quantities cannot be manipulated within this library, there are many great options out there:

Spoken version

Quantulum classes include methods to convert them to a speakable unit.

>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
ten billion gigawatts
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
Gimme ten billion dollars now and also one terawatt and zero point five joules!

Extension

See units.json for the complete list of units and entities.json for the complete list of entities. The criteria for adding units have been:

the unit has (or is redirected to) a WikiPedia page
the unit is in common use (e.g. not the premetric Swedish units of measurement).

It's easy to extend these two files to the units/entities of interest. Here is an example of an entry in entities.json:

{
    "name": "speed",
    "dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
    "URI": "https://en.wikipedia.org/wiki/Speed"
}

name is self explanatory.
URI is the name of the wikipedia page of the entity. (i.e. https://en.wikipedia.org/wiki/Speed => Speed)
dimensions is the dimensionality, a list of dictionaries each having a base (the name of another entity) and a power (an integer, can be negative).

Here is an example of an entry in units.json:

{
    "name": "metre per second",
    "surfaces": ["metre per second", "meter per second"],
    "entity": "speed",
    "URI": "Metre_per_second",
    "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
    "symbols": ["mps"]
},
{
    "name": "year",
    "surfaces": [ "year", "annum" ],
    "entity": "time",
    "URI": "Year",
    "dimensions": [],
    "symbols": [ "a", "y", "yr" ],
    "prefixes": [ "k", "M", "G", "T", "P", "E" ]
}

name is self explanatory.
URI follows the same scheme as in the entities.json
surfaces is a list of strings that refer to that unit. The library takes care of plurals, no need to specify them.
entity is the name of an entity in entities.json
dimensions follows the same schema as in entities.json, but the base is the name of another unit, not of another entity.
symbols is a list of possible symbols and abbreviations for that unit.
prefixes is an optional list. It can contain Metric and Binary prefixes and automatically generates according units. If you want to add specifics (like different surfaces) you need to create an entry for that prefixes version on its own.

All fields are case sensitive.

Contributing

dev build:

If you'd like to contribute follow these steps:

Clone a fork of this project into your workspace
Run pip install -e . at the root of your development folder.
pip install pipenv and pipenv shell
Inside the project folder run pipenv install --dev
Make your changes
Run scripts/format.sh and scripts/build.py from the package root directory.
Test your changes with coverage run --source=quantulum3 setup.py test (Optional, will be done automatically after pushing)
Create a Pull Request when having commited and pushed your changes

Language support

There is a branch for language support, namely language_support. From inspecting the README file in the _lang subdirectory and the functions and values given in the new _lang.en_US submodule, one should be able to create own language submodules. The new language modules should automatically be invoked and be available, both through the lang= keyword argument in the parser functions as well as in the automatic unittests.

No changes outside the own language submodule folder (i.e. _lang.de_DE) should be necessary. If there are problems implementing a new language, don't hesitate to open an issue.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.2

Jun 25, 2024

0.9.1

Apr 16, 2024

0.9.0

May 9, 2023

0.8.1

Jan 16, 2023

0.8.0

Jan 6, 2023

0.7.11

Oct 15, 2022

0.7.10

Feb 15, 2022

0.7.9

Sep 21, 2021

0.7.8

Aug 30, 2021

0.7.7

Jun 24, 2021

0.7.6

Oct 24, 2020

0.7.5

Jul 16, 2020

0.7.4

May 29, 2020

0.7.3

Oct 31, 2019

0.7.2

Jul 15, 2019

0.7.1

May 7, 2019

0.7.0

Feb 4, 2019

0.6.6

Feb 3, 2019

0.6.5

Dec 21, 2018

0.6.4

Oct 8, 2018

This version

0.6.3

Oct 7, 2018

0.6.2

Oct 7, 2018

0.6.1

Oct 7, 2018

0.6.0

Oct 7, 2018

0.5.0

Sep 29, 2018

0.4.0

Sep 20, 2018

0.3.5

Sep 16, 2018

0.3.4

Sep 15, 2018

0.3.3

Sep 15, 2018

0.3.2

Sep 14, 2018

0.3.1

Sep 13, 2018

0.3.0

Sep 12, 2018

0.2.8

Sep 9, 2018

0.2.7

Sep 9, 2018

0.2.6

Sep 6, 2018

0.2.5

Sep 6, 2018

0.2.4

Sep 6, 2018

0.2.3

Aug 22, 2018

0.2.2

Aug 22, 2018

0.2.1

Aug 20, 2018

0.2.0

Aug 19, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantulum3-0.6.3.tar.gz (7.8 MB view details)

Uploaded Oct 7, 2018 Source

Built Distribution

quantulum3-0.6.3-py3-none-any.whl (16.3 MB view details)

Uploaded Oct 7, 2018 Python 3

File details

Details for the file quantulum3-0.6.3.tar.gz.

File metadata

Download URL: quantulum3-0.6.3.tar.gz
Upload date: Oct 7, 2018
Size: 7.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for quantulum3-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`8c1de0cb3b54033890f869762394557366bb9e20b85ed11a2b229f71e1c41aa6`
MD5	`af5214b988273c21b0df4919227c4f63`
BLAKE2b-256	`bd23dfe0ec0c3991e842b1acccd7afaf7fe3dacb50a66fe10175543a1f524bfc`

See more details on using hashes here.

Provenance

File details

Details for the file quantulum3-0.6.3-py3-none-any.whl.

File metadata

Download URL: quantulum3-0.6.3-py3-none-any.whl
Upload date: Oct 7, 2018
Size: 16.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for quantulum3-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`771b593fa588c823ca50f42a1db7200c97755934be12c84ffa6f56aa4c3f6cf7`
MD5	`60cd8ed9c84e4d060215dc52c988c8e7`
BLAKE2b-256	`cb083462ba570dc7fbf672a2f454450efad971f18e13a20e228f1757a6f4d7e2`