High Performance Fuzzy Business Entity Matching

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- End Users/Desktop
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language

Project description

Seamster

High Performance Fuzzy Business Entity Matching

Motivation

The purpose of this package is to facilitate a broader goal of centralizing and standardizing publicly available data on businesses. Juniper is doing this because we believe that the key to innovation in Commercial Insurance underwriting lies in making public data accessible, reliable, and complete.

Features

Built on top of Pandas and Scipy to do parallelized calculation of string similarities.
Extensible Join class allows for custom joins

Installation

Seamster requires Python 3.5 or newer to run.

Python package

You can easily install Seamster using pip:

pip3 install seamster

Manual

Alternatively, to get the latest development version, you can clone this repository and then manually install it:

git clone git@gitlab.com:juniperlabs-foss/seamster.git
cd seamster
python3 setup.py install

Usage

import pandas as pd
from seamster.join_side import JoinSide
from seamster.join import NameZipEntTypeJoin

source1 = {
        "id": [1, 2, 3, 4],
        "names": [
            "Subway",
            "Blimpies",
            "McDonalds Hamburguesas, Inc.",
            "MacDonalds Hamburgers",
        ],
        "zip": [80238, 80238, 80230, 80238],
        "entity_type": ["llc", "llc", "corporation", "corporation"],
    }

source2 = pd.DataFrame(
    {
        "id": [5, 6, 7],
        "names": ["McDonalds Hamburgers Inc", "Burger King", "Wendys"],
        "zip": [80238, 80238, 80230],
        "entity_type": ["corporation", "llc", "inc"],
    }
)

js_a = JoinSide(
    data=pd.DataFrame(source1),
    source="a",
    entity_name_field="names",
    id_field="id",
    zip_field="zip",
    entity_type_field="entity_type",
)
js_b = JoinSide(
    data=pd.DataFrame(source2),
    source="b",
    entity_name_field="names",
    id_field="id",
    zip_field="zip",
    entity_type_field="entity_type",
)

bs = NameZipEntTypeJoin(join_sides=(js_a, js_b))

df = bs.join(lower_bound=0.8)

print(df.to_dict(orient="records"))
# [
#         {
#             "id_a": 4,
#             "names_a": "MacDonalds Hamburgers",
#             "zip_a": 80238,
#             "entity_type_a": "corporation",
#             "source_a": "a",
#             "clean_names_a": "macdonalds hamburgers",
#             "clean_entity_type_a": "corp",
#             "id_b": 5,
#             "names_b": "McDonalds Hamburgers Inc",
#             "zip_b": 80238,
#             "entity_type_b": "corporation",
#             "source_b": "b",
#             "clean_names_b": "mcdonalds hamburgers",
#             "clean_entity_type_b": "corp",
#             "similarity": 0.86529,
#         }
#     ]

TODO

Create transform class that can permute and enrich the dataframe (e.g., geolocation, )
Support for multiple fuzzy joins

Contributing

For information on how to contribute to the project, please check the Contributor's Guide.

Contact

support@juniperlabs.io

incoming+juniperlabs-foss/seamster@gitlab.com

License

Apache 2.0

Credits

This package was created with Cookiecutter and the python-cookiecutter project template.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- End Users/Desktop
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language

Release history Release notifications | RSS feed

This version

0.0.1

Nov 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seamster-0.0.1.tar.gz (16.0 kB view details)

Uploaded Nov 28, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seamster-0.0.1-py3-none-any.whl (12.9 kB view details)

Uploaded Nov 28, 2019 Python 3

File details

Details for the file seamster-0.0.1.tar.gz.

File metadata

Download URL: seamster-0.0.1.tar.gz
Upload date: Nov 28, 2019
Size: 16.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5

File hashes

Hashes for seamster-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`03d0fef6bc07a14d44ccf2241c425b9cbf2dea63121f57a2c91631dc8c1876bf`
MD5	`e2f19ac545944baef4c8420115c5c213`
BLAKE2b-256	`59fd7f8329fc4821a641e3e5940a5b6a72a4fb00379d055823baffe4036b14b4`

See more details on using hashes here.

File details

Details for the file seamster-0.0.1-py3-none-any.whl.

File metadata

Download URL: seamster-0.0.1-py3-none-any.whl
Upload date: Nov 28, 2019
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5

File hashes

Hashes for seamster-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6bbbcf84a5d8e39aa51493493e43d8f5174662dae894310c40b476c9fe91552`
MD5	`01252cb34bdf01957599d51cc0a9a482`
BLAKE2b-256	`fc160543936144dde744e73f533b3a97a15d3bb0bbbdf1936f46e8670b2c832c`

See more details on using hashes here.

seamster 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Seamster

Motivation

Features

Installation

Usage

TODO

Contributing

Contact

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes