Skip to main content

Transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Project description

Entity Embed

PyPi version PyPI - Python Version Documentation Status codecov License: MIT

Entity Embed allows you to transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Using Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.

Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier.

Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.

⚠️ Warning: this project is under heavy development.

Embedding Space Example

Documentation

https://entity-embed.readthedocs.io

Requirements

System

  • MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).
  • Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.

Libraries

And others, see requirements.txt.

Installation

pip install entity-embed

Examples

Run:

pip install -r requirements-examples.txt

Then check the example Jupyter Notebooks:

Releases

See CHANGELOG.md.

Credits

This project is maintained by open-source contributors and Vinta Software.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Commercial Support

Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: contact@vinta.com.br

Citations

If you use Entity Embed in your research, please consider citing it.

BibTeX entry:

@software{entity-embed,
  title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
  author = {Juvenal, Flávio and Vieira, Renato},
  url = {https://github.com/vintasoftware/entity-embed},
  version = {0.0.2},
  date = {2021-04-06},
  year = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity-embed-0.0.2.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entity_embed-0.0.2-py2.py3-none-any.whl (35.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file entity-embed-0.0.2.tar.gz.

File metadata

  • Download URL: entity-embed-0.0.2.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6

File hashes

Hashes for entity-embed-0.0.2.tar.gz
Algorithm Hash digest
SHA256 63596a97528a6008e1366da4c6b9098631f0bf53ef8383f055d3d7ca042f4106
MD5 43a76846d776e52d7db14b838ed74f33
BLAKE2b-256 58e42f8e989542b2a9c9b9e3647d933fad9181ab927ea2d9bc4788cf8d8943d8

See more details on using hashes here.

File details

Details for the file entity_embed-0.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: entity_embed-0.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6

File hashes

Hashes for entity_embed-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fb238abe7360161e1b811a978a145048acd0d5c97e8209f4c203a9cfb31b06e6
MD5 f0f03928daf5b5733468772b967a55bf
BLAKE2b-256 f4cabb34625e94cf683c63550869ba655d6171a0499551bc84806c06259d6745

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page