Join two tables by a fuzzy comparison of text columns.
Project description
fuzzyjoin
Join two tables by a fuzzy comparison of text columns.
Features
Command line utility to quickly join CSV files.
Ngram blocking to reduce the total number of comparisons.
Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
License: [MIT](https://opensource.org/licenses/MIT)
Description
The goal of this package is to provide a quick and convenient way to join two tables on a pair of text columns, which often contain variations of names for the same entity. fuzzyjoin satisfies the simple and common case of joining by a single column from each table for a small to medium-sized dataset.
For more sophisticated and comprehensive treatments of the topic that will allow you to join records using multiple fields, see the packages below:
[dedupe](https://github.com/dedupeio/dedupe) [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)
TODO
[ ] Test transformation and exclude functions.
[ ] Implement left join and full join.
[ ] Optionally use python-Levenshtein for speed.
[ ] Check that the ID is actually unique.
[ ] Add documentation.
[ ] Option to rename headers and disambiguate duplicate header names.
History
0.2.1 (2019-04-10)
Additional docs and tests.
0.2.0 (2019-04-09)
Write multiples matches to a separate file.
Added types and docstrings.
0.1.2 (2019-04-09)
Duplicate release of 0.1.1
0.1.1 (2019-04-09)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzyjoin-0.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89b46d586e3bd6d2ef4da8ddb1d3e9b7bbaa02371b201b71c5be36e6afb7dd68 |
|
MD5 | 398f10f576b038b8c59c0a62a4e1e3a3 |
|
BLAKE2b-256 | 529bc69f2b602553b374f8a0a8fd0583eaa5a9c19a3c683d197235867116d74c |