Extract quantities from unstructured text.
Project description
quantulum3 |Travis master build state| |Coverage Status|
========================================================
Python library for information extraction of quantities, measurements
and their units from unstructured text. It is Python 3 compatible fork
of `recastrodiaz' fork <https://github.com/recastrodiaz/quantulum>`__ of
`grhawks' fork <https://github.com/grhawk/quantulum>`__ of `the original
by Marco Lagi <https://github.com/marcolagi/quantulum>`__. The
compatability with the newest version of sklearn is based on the fork of
`sohrabtowfighi <https://github.com/sohrabtowfighi/quantulum>`__.
Installation
------------
First, install
`sklearn <http://scikit-learn.org/stable/install.html>`__. Quantulum
would still work without it, but it wouldn't be able to disambiguate
between units with the same name (e.g. *pound* as currency or as unit of
mass).
Then,
.. code:: bash
$ pip install quantulum3
Contributing
------------
If you’d like to contribute follow these steps: 1. Clone a fork of this
project into your workspace 2. ``pip install pipenv yapf`` 3. Inside the
project folder run ``pipenv install`` 4. Make your changes 5. Run
``format.sh`` 6. Create a Pull Request when having commited your changes
``dev`` build:
|Travis dev build state| |Coverage Status|
Usage
-----
.. code:: python
>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]
The *Quantity* class stores the surface of the original text it was
extracted from, as well as the (start, end) positions of the match:
.. code:: python
>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)
An inline parser that embeds the parsed quantities in the text is also
available (especially useful for debugging):
.. code:: python
>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine
As the parser is also able to parse dimensionless numbers, this library
can also be used for simple number extraction.
.. code:: python
>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]
Units and entities
------------------
All units (e.g. *litre*) and the entities they are associated to (e.g.
*volume*) are reconciled against WikiPedia:
.. code:: python
>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)
>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)
This library includes more than 290 units and 75 entities. It also
parses spelled-out numbers, ranges and uncertainties:
.. code:: python
>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]
>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]
>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1
Non-standard units usually don't have a WikiPedia page. The parser will
still try to guess their underlying entity based on their
dimensionality:
.. code:: python
>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)
Disambiguation
--------------
If the parser detects an ambiguity, a classifier based on the WikiPedia
pages of the ambiguous units or entities tries to guess the right one:
.. code:: python
>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]
>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]
or:
.. code:: python
>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)
>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)
Manipulation
------------
While quantities cannot be manipulated within this library, there are
many great options out there:
- `pint <https://pint.readthedocs.org/en/latest/>`__
- `natu <http://kdavies4.github.io/natu/>`__
- `quantities <http://python-quantities.readthedocs.org/en/latest/>`__
Extension
---------
See *units.json* for the complete list of units and *entities.json* for
the complete list of entities. The criteria for adding units have been:
- the unit has (or is redirected to) a WikiPedia page
- the unit is in common use (e.g. not the `premetric Swedish units of
measurement <https://en.wikipedia.org/wiki/Swedish_units_of_measurement#Length>`__).
It's easy to extend these two files to the units/entities of interest.
Here is an example of an entry in *entities.json*:
.. code:: python
{
"name": "speed",
"dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
"URI": "https://en.wikipedia.org/wiki/Speed"
}
- *name* and *URI* are self explanatory.
- *dimensions* is the dimensionality, a list of dictionaries each
having a *base* (the name of another entity) and a *power* (an
integer, can be negative).
Here is an example of an entry in *units.json*:
.. code:: python
{
"name": "metre per second",
"surfaces": ["metre per second", "meter per second"],
"entity": "speed",
"URI": "https://en.wikipedia.org/wiki/Metre_per_second",
"dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
"symbols": ["mps"]
}
- *name* and *URI* are self explanatory.
- *surfaces* is a list of strings that refer to that unit. The library
takes care of plurals, no need to specify them.
- *entity* is the name of an entity in *entities.json*
- *dimensions* follows the same schema as in *entities.json*, but the
*base* is the name of another unit, not of another entity.
- *symbols* is a list of possible symbols and abbreviations for that
unit.
All fields are case sensitive.
.. |Travis master build state| image:: https://travis-ci.com/nielstron/quantulum3.svg?branch=master
:target: https://travis-ci.com/nielstron/quantulum3
.. |Coverage Status| image:: https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=master
:target: https://coveralls.io/github/nielstron/quantulum3?branch=master
.. |Travis dev build state| image:: https://travis-ci.com/nielstron/quantulum3.svg?branch=dev
:target: https://travis-ci.com/nielstron/quantulum3
.. |Coverage Status| image:: https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=dev
:target: https://coveralls.io/github/nielstron/quantulum3?branch=dev
========================================================
Python library for information extraction of quantities, measurements
and their units from unstructured text. It is Python 3 compatible fork
of `recastrodiaz' fork <https://github.com/recastrodiaz/quantulum>`__ of
`grhawks' fork <https://github.com/grhawk/quantulum>`__ of `the original
by Marco Lagi <https://github.com/marcolagi/quantulum>`__. The
compatability with the newest version of sklearn is based on the fork of
`sohrabtowfighi <https://github.com/sohrabtowfighi/quantulum>`__.
Installation
------------
First, install
`sklearn <http://scikit-learn.org/stable/install.html>`__. Quantulum
would still work without it, but it wouldn't be able to disambiguate
between units with the same name (e.g. *pound* as currency or as unit of
mass).
Then,
.. code:: bash
$ pip install quantulum3
Contributing
------------
If you’d like to contribute follow these steps: 1. Clone a fork of this
project into your workspace 2. ``pip install pipenv yapf`` 3. Inside the
project folder run ``pipenv install`` 4. Make your changes 5. Run
``format.sh`` 6. Create a Pull Request when having commited your changes
``dev`` build:
|Travis dev build state| |Coverage Status|
Usage
-----
.. code:: python
>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]
The *Quantity* class stores the surface of the original text it was
extracted from, as well as the (start, end) positions of the match:
.. code:: python
>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)
An inline parser that embeds the parsed quantities in the text is also
available (especially useful for debugging):
.. code:: python
>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine
As the parser is also able to parse dimensionless numbers, this library
can also be used for simple number extraction.
.. code:: python
>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]
Units and entities
------------------
All units (e.g. *litre*) and the entities they are associated to (e.g.
*volume*) are reconciled against WikiPedia:
.. code:: python
>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)
>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)
This library includes more than 290 units and 75 entities. It also
parses spelled-out numbers, ranges and uncertainties:
.. code:: python
>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]
>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]
>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1
Non-standard units usually don't have a WikiPedia page. The parser will
still try to guess their underlying entity based on their
dimensionality:
.. code:: python
>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)
Disambiguation
--------------
If the parser detects an ambiguity, a classifier based on the WikiPedia
pages of the ambiguous units or entities tries to guess the right one:
.. code:: python
>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]
>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]
or:
.. code:: python
>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)
>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)
Manipulation
------------
While quantities cannot be manipulated within this library, there are
many great options out there:
- `pint <https://pint.readthedocs.org/en/latest/>`__
- `natu <http://kdavies4.github.io/natu/>`__
- `quantities <http://python-quantities.readthedocs.org/en/latest/>`__
Extension
---------
See *units.json* for the complete list of units and *entities.json* for
the complete list of entities. The criteria for adding units have been:
- the unit has (or is redirected to) a WikiPedia page
- the unit is in common use (e.g. not the `premetric Swedish units of
measurement <https://en.wikipedia.org/wiki/Swedish_units_of_measurement#Length>`__).
It's easy to extend these two files to the units/entities of interest.
Here is an example of an entry in *entities.json*:
.. code:: python
{
"name": "speed",
"dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
"URI": "https://en.wikipedia.org/wiki/Speed"
}
- *name* and *URI* are self explanatory.
- *dimensions* is the dimensionality, a list of dictionaries each
having a *base* (the name of another entity) and a *power* (an
integer, can be negative).
Here is an example of an entry in *units.json*:
.. code:: python
{
"name": "metre per second",
"surfaces": ["metre per second", "meter per second"],
"entity": "speed",
"URI": "https://en.wikipedia.org/wiki/Metre_per_second",
"dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
"symbols": ["mps"]
}
- *name* and *URI* are self explanatory.
- *surfaces* is a list of strings that refer to that unit. The library
takes care of plurals, no need to specify them.
- *entity* is the name of an entity in *entities.json*
- *dimensions* follows the same schema as in *entities.json*, but the
*base* is the name of another unit, not of another entity.
- *symbols* is a list of possible symbols and abbreviations for that
unit.
All fields are case sensitive.
.. |Travis master build state| image:: https://travis-ci.com/nielstron/quantulum3.svg?branch=master
:target: https://travis-ci.com/nielstron/quantulum3
.. |Coverage Status| image:: https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=master
:target: https://coveralls.io/github/nielstron/quantulum3?branch=master
.. |Travis dev build state| image:: https://travis-ci.com/nielstron/quantulum3.svg?branch=dev
:target: https://travis-ci.com/nielstron/quantulum3
.. |Coverage Status| image:: https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=dev
:target: https://coveralls.io/github/nielstron/quantulum3?branch=dev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
quantulum3-0.2.4.tar.gz
(2.0 MB
view details)
Built Distribution
File details
Details for the file quantulum3-0.2.4.tar.gz
.
File metadata
- Download URL: quantulum3-0.2.4.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5225a2db8e8e6ae6d83f23a6d5fdf1509c849d6d157ca72d309a833c5a60024 |
|
MD5 | 4189fb9f088c818bdbe7186d044edbed |
|
BLAKE2b-256 | 3f3c4939241c5efd942463bf452917e827c49568efa9ae6c90c08b8d047ab4c5 |
Provenance
File details
Details for the file quantulum3-0.2.4-py3-none-any.whl
.
File metadata
- Download URL: quantulum3-0.2.4-py3-none-any.whl
- Upload date:
- Size: 2.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37fcb600eaacbb54ecb5a673741bec202ef55aa1a43427db505a122e256dd335 |
|
MD5 | 035c91ab687b3c19d3353f657eda06b1 |
|
BLAKE2b-256 | 8f84249591dac4e45be87a358e657229b473255327068b480fb0add40e8ff542 |