Skip to main content

The SQL/Ibis powered sklearn of record linkage.

Project description

Mismo

PyPI - Version PyPI - Python Version

The SQL/Ibis powered sklearn of record linkage.

Still in alpha stage. Breaking changes will happen frequently and with no warning. Once things are more stabilized I will come up with a stability policy. Any suggestions as to how you want the API to look like would be greatly appreciated. I do use this in my work, so at least I do decent job of ensuring correctness.


Goals

Mismo tries to be the sklearn of record linkage, backed by the scalability and power of SQL and Ibis. It is made of many small data structures and functions, each with a well-defined and standard API that allows them to be composed together and extended easily. None of the other record linkage packages I have seen, such as Splink, Dedupe, or Record Linkage Toolkit, had all of these properties, so I decided to make my own.

See Goals and Alternatives for a more detailed discussion of the goals of Mismo and how it compares to other record linkage packages.

Features

  • Supports larger-than-memory datasets, executed on powerful SQL engines. Use DuckDB for prototyping and for jobs up to maybe ~10M records, or Spark or other distributed backends for larger tasks, without needing to change your code!
  • Use the clean, strong-typed, pythonic, Dataframe APIs of Ibis.
  • Small, modular functions and data structures that are easy to plug together and extend.
  • Layered API: Use top-level APIs if your task is common enough that it is supported out of the box.

Installation

mismo is available on PyPI. I try to publish semver'ed releases after most changes.

If I forget to do this, then there are alsoprereleases on PyPI. These are published every week by a github action using the HEAD commit of this repo.

You can also install directly from a branch or a specific commit from github:

uv pip install "mismo[viz] @ git+https://github.com/NickCrews/mismo@<SOME-SHA-OR-BRANCH>"

Examples

See the example notebook.

Documentation

See the documentation.

Contributing

See the contributing guide.

License

mismo is distributed under the terms of the LGPL-3.0-or-later license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mismo-0.3.1.dev6.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mismo-0.3.1.dev6-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file mismo-0.3.1.dev6.tar.gz.

File metadata

  • Download URL: mismo-0.3.1.dev6.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mismo-0.3.1.dev6.tar.gz
Algorithm Hash digest
SHA256 4eac50eda5846c48b7aa2c5b8050c48aaa1727619ea687d72f7a6c6d40ef4ded
MD5 298473d2700b62d84c786ec901213767
BLAKE2b-256 c47e50d29695b56aab47e2a11f50c072b12d0c421d2fd0ffe4e03eb6f79a5e7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for mismo-0.3.1.dev6.tar.gz:

Publisher: release.yml on NickCrews/mismo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mismo-0.3.1.dev6-py3-none-any.whl.

File metadata

  • Download URL: mismo-0.3.1.dev6-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mismo-0.3.1.dev6-py3-none-any.whl
Algorithm Hash digest
SHA256 21190eaeb19a62fc5a7ef2d6fe9da3fa6e76ed56cb97f29c809f2a548f6deafd
MD5 556f8a6f9dbc600feb5d475af04c7f1b
BLAKE2b-256 3dffd165a340ef30d829d72f3487e5de2da3bfe68f87efaac3d70db9b02f74ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for mismo-0.3.1.dev6-py3-none-any.whl:

Publisher: release.yml on NickCrews/mismo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page