Skip to main content

Python package to carry out entity disambiguation based on string matching

Project description

=======
disamby
=======


.. image:: https://img.shields.io/pypi/v/disamby.svg
:target: https://pypi.python.org/pypi/disamby

.. image:: https://img.shields.io/travis/verginer/disamby.svg
:target: https://travis-ci.org/verginer/disamby

.. image:: https://readthedocs.org/projects/disamby/badge/?version=latest
:target: https://disamby.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://pyup.io/repos/github/verginer/disamby/shield.svg
:target: https://pyup.io/repos/github/verginer/disamby/
:alt: Updates

* Free software: MIT license
* Documentation: https://disamby.readthedocs.io.

``disamby`` is a python package designed to carry out entity disambiguation based on fuzzy
string matching.

It works best for entities which if the same have very similar strings.
Examples of situation where this disambiguation algorithm works fairly well is with
company names and addresses which have typos, alternative spellings or composite names.
Other use-cases include identifying people in a database where the name might be misspelled.

The algorithm works by exploiting how informative a given word/token is, based on the
observed frequencies in the whole corpus of strings. For example the word 'inc' in the
case of firm names is not very informative, however "Solomon" is, since the former appears
repeatedly whereas the second rarely.

With these frequencies the algorithms computes for a given pair of instances how similar
they are, and if they are above an arbitrary threshold they are connected in an
"alias graph" (i.e. a directed network where an entity is connected to an other
if it is similar enough). After all records have been connected in this way disamby
returns sets of entities, which are strongly connected [2]_ . Strongly connected means
in this case that there exists a path from all nodes to all nodes within the component.


Example
-------

To use disamby in a project::

import pandas as pd
import disamby.preprocessors as pre
form disamby import Disamby

# create a dataframe with the fields you intend to match on as columns
df = pd.DataFrame({
'name': ['Luca Georger', 'Luca Geroger', 'Adrian Sulzer'],
'address': ['Mira, 34, Augsburg', 'Miri, 34, Augsburg', 'Milano, 34']},
index= ['L1', 'L2', 'O1']
)

# define the pipeline to process the strings, note that the last step must return
# a tuple of strings
pipeline = [
pre.normalize_whitespace,
pre.remove_punctuation,
lambda x: pre.trigram(x) + pre.split_words(x) # any python function is allowed
]

# instantiate the disamby object, it applies the given pre-processing pipeline and
# computes their frequency.
dis = Disamby(df, pipeline)

# let disamby compute disambiguated sets. Node that a threshold must be given or it
# defaults to 0.
dis.disambiguated_sets(threshold=0.5)
[{'L2', 'L1'}, {'O1'}] # output

# To check if the sets are accurate you can get the rows from the
# pandas dataframe like so:
df.loc[['L2', 'L1']]


Credits
---------
I got the inspiration for this package from the seminar "The SearchEngine - A Tool for
Matching by Fuzzy Criteria" by Thorsten Doherr at the CISS [1]_ Summer School 2017

This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.

.. [1] http://www.euro-ciss.eu/ciss/home.html
.. [2] https://en.wikipedia.org/wiki/Strongly_connected_component


=======
History
=======

0.2.2 (2017-06-30)
------------------

* working release with minimal documentation
* works with multiple field matching
* carries out all steps autonomously from string pre-processing to
identifying the strongly connected components


0.1.0 (2017-06-24)
------------------

* First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disamby-0.2.2.tar.gz (638.8 kB view details)

Uploaded Source

Built Distribution

disamby-0.2.2-py2.py3-none-any.whl (12.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file disamby-0.2.2.tar.gz.

File metadata

  • Download URL: disamby-0.2.2.tar.gz
  • Upload date:
  • Size: 638.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for disamby-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d55cea979e11e071779bb3da72395d20eb33c3b622d172085c144d5fc0b2ce76
MD5 8dfa342c1cca1ca3b0d94f4b4da3e0c3
BLAKE2b-256 b6fcc74a7dfe9f2d03ed3064bc007a49a95f4c13eaacb03909ff1c636d090d82

See more details on using hashes here.

File details

Details for the file disamby-0.2.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for disamby-0.2.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c6fe75ad1fb59c8a83019449bcf576296ce7d90b6cc5e267bba16f6a33c2ed8f
MD5 ef5f4549a276c754ff45a7d6bd6314d8
BLAKE2b-256 6868ed1ec55acb8961493c5b5ef5ecb1f9b299f686c25d8b026c4db027afe424

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page