The SQL/Ibis powered sklearn of record linkage.
Project description
Mismo
The SQL/Ibis powered sklearn of record linkage.
Still in alpha stage. Breaking changes will happen frequently and with no warning. Once things are more stabilized I will come up with a stability policy. Any suggestions as to how you want the API to look like would be greatly appreciated. I do use this in my work, so at least I do decent job of ensuring correctness.
Goals
Mismo tries to be the sklearn of record linkage, backed by the scalability and power of SQL and Ibis. It is made of many small data structures and functions, each with a well-defined and standard API that allows them to be composed together and extended easily. None of the other record linkage packages I have seen, such as Splink, Dedupe, or Record Linkage Toolkit, had all of these properties, so I decided to make my own.
See Goals and Alternatives for a more detailed discussion of the goals of Mismo and how it compares to other record linkage packages.
Features
- Supports larger-than-memory datasets, executed on powerful SQL engines. Use DuckDB for prototyping and for jobs up to maybe ~10M records, or Spark or other distributed backends for larger tasks, without needing to change your code!
- Use the clean, strong-typed, pythonic, Dataframe APIs of Ibis.
- Small, modular functions and data structures that are easy to plug together and extend.
- Layered API: Use top-level APIs if your task is common enough that it is supported out of the box.
Installation
mismo is available on PyPI.
I try to publish semver'ed releases after most changes.
If I forget to do this, then there are alsoprereleases on PyPI. These are published every week by a github action using the HEAD commit of this repo.
You can also install directly from a branch or a specific commit from github:
uv pip install "mismo[viz] @ git+https://github.com/NickCrews/mismo@<SOME-SHA-OR-BRANCH>"
Examples
See the example notebook.
Documentation
See the documentation.
Contributing
See the contributing guide.
License
mismo is distributed under the terms of the
LGPL-3.0-or-later license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mismo-0.3.0.tar.gz.
File metadata
- Download URL: mismo-0.3.0.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92684551ee98d1b79cec5eefe08f5fb9c406df1dda233b3dee7db1118330189b
|
|
| MD5 |
3ab483e6bcc666ff7383c4f55e8dd8c6
|
|
| BLAKE2b-256 |
66b529d3d8d4c25cae66660f78280dca209c17dafe56a2d2cbcca8b0d6aae7de
|
Provenance
The following attestation bundles were made for mismo-0.3.0.tar.gz:
Publisher:
release.yml on NickCrews/mismo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mismo-0.3.0.tar.gz -
Subject digest:
92684551ee98d1b79cec5eefe08f5fb9c406df1dda233b3dee7db1118330189b - Sigstore transparency entry: 829278918
- Sigstore integration time:
-
Permalink:
NickCrews/mismo@34cb815d17132c1072d43e45b70f949a4cd5d77b -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/NickCrews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@34cb815d17132c1072d43e45b70f949a4cd5d77b -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file mismo-0.3.0-py3-none-any.whl.
File metadata
- Download URL: mismo-0.3.0-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0e0c9ad25135970af77720dd24df2a39c0e130fd261d7c5661228015a74d4e2
|
|
| MD5 |
8b1d6baae8424be915025e03d906b07c
|
|
| BLAKE2b-256 |
c7bebea00abd6ef0d86a88fb217100e6c206c6f8c659d6829f3d950067f73a21
|
Provenance
The following attestation bundles were made for mismo-0.3.0-py3-none-any.whl:
Publisher:
release.yml on NickCrews/mismo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mismo-0.3.0-py3-none-any.whl -
Subject digest:
a0e0c9ad25135970af77720dd24df2a39c0e130fd261d7c5661228015a74d4e2 - Sigstore transparency entry: 829278920
- Sigstore integration time:
-
Permalink:
NickCrews/mismo@34cb815d17132c1072d43e45b70f949a4cd5d77b -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/NickCrews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@34cb815d17132c1072d43e45b70f949a4cd5d77b -
Trigger Event:
workflow_dispatch
-
Statement type: