Skip to main content

High Performance Fuzzy Business Entity Matching

Project description

Seamster

PyPI version Pipeline status Coverage report

High Performance Fuzzy Business Entity Matching

Motivation

The purpose of this package is to facilitate a broader goal of centralizing and standardizing publicly available data on businesses. Juniper is doing this because we believe that the key to innovation in Commercial Insurance underwriting lies in making public data accessible, reliable, and complete.

Features

  • Built on top of Pandas and Scipy to do parallelized calculation of string similarities.
  • Extensible Join class allows for custom joins

Installation

Seamster requires Python 3.5 or newer to run.

Python package

You can easily install Seamster using pip:

pip3 install seamster

Manual

Alternatively, to get the latest development version, you can clone this repository and then manually install it:

git clone git@gitlab.com:juniperlabs-foss/seamster.git
cd seamster
python3 setup.py install

Usage

import pandas as pd
from seamster.join_side import JoinSide
from seamster.join import NameZipEntTypeJoin

source1 = {
        "id": [1, 2, 3, 4],
        "names": [
            "Subway",
            "Blimpies",
            "McDonalds Hamburguesas, Inc.",
            "MacDonalds Hamburgers",
        ],
        "zip": [80238, 80238, 80230, 80238],
        "entity_type": ["llc", "llc", "corporation", "corporation"],
    }

source2 = pd.DataFrame(
    {
        "id": [5, 6, 7],
        "names": ["McDonalds Hamburgers Inc", "Burger King", "Wendys"],
        "zip": [80238, 80238, 80230],
        "entity_type": ["corporation", "llc", "inc"],
    }
)

js_a = JoinSide(
    data=pd.DataFrame(source1),
    source="a",
    entity_name_field="names",
    id_field="id",
    zip_field="zip",
    entity_type_field="entity_type",
)
js_b = JoinSide(
    data=pd.DataFrame(source2),
    source="b",
    entity_name_field="names",
    id_field="id",
    zip_field="zip",
    entity_type_field="entity_type",
)

bs = NameZipEntTypeJoin(join_sides=(js_a, js_b))

df = bs.join(lower_bound=0.8)

print(df.to_dict(orient="records"))
# [
#         {
#             "id_a": 4,
#             "names_a": "MacDonalds Hamburgers",
#             "zip_a": 80238,
#             "entity_type_a": "corporation",
#             "source_a": "a",
#             "clean_names_a": "macdonalds hamburgers",
#             "clean_entity_type_a": "corp",
#             "id_b": 5,
#             "names_b": "McDonalds Hamburgers Inc",
#             "zip_b": 80238,
#             "entity_type_b": "corporation",
#             "source_b": "b",
#             "clean_names_b": "mcdonalds hamburgers",
#             "clean_entity_type_b": "corp",
#             "similarity": 0.86529,
#         }
#     ]

TODO

  • Create transform class that can permute and enrich the dataframe (e.g., geolocation, )
  • Support for multiple fuzzy joins

Contributing

For information on how to contribute to the project, please check the Contributor's Guide.

Contact

support@juniperlabs.io

incoming+juniperlabs-foss/seamster@gitlab.com

License

Apache 2.0

Credits

This package was created with Cookiecutter and the python-cookiecutter project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seamster-0.0.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

seamster-0.0.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file seamster-0.0.1.tar.gz.

File metadata

  • Download URL: seamster-0.0.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5

File hashes

Hashes for seamster-0.0.1.tar.gz
Algorithm Hash digest
SHA256 03d0fef6bc07a14d44ccf2241c425b9cbf2dea63121f57a2c91631dc8c1876bf
MD5 e2f19ac545944baef4c8420115c5c213
BLAKE2b-256 59fd7f8329fc4821a641e3e5940a5b6a72a4fb00379d055823baffe4036b14b4

See more details on using hashes here.

File details

Details for the file seamster-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: seamster-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5

File hashes

Hashes for seamster-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a6bbbcf84a5d8e39aa51493493e43d8f5174662dae894310c40b476c9fe91552
MD5 01252cb34bdf01957599d51cc0a9a482
BLAKE2b-256 fc160543936144dde744e73f533b3a97a15d3bb0bbbdf1936f46e8670b2c832c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page