Join two tables by a fuzzy comparison of text columns.
Project description
fuzzyjoin
Join two tables by a fuzzy comparison of text columns.
Features
Command line utility to quickly join CSV files.
Ngram blocking to reduce the total number of comparisons.
Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
License: [MIT](https://opensource.org/licenses/MIT)
Installation
Pure python: pip install fuzzyjoin
Optimized: pip install fuzzyjoin[fast]
Description
The goal of this package is to provide a quick and convenient way to join two tables on a pair of text columns, which often contain variations of names for the same entity. fuzzyjoin satisfies the simple and common case of joining by a single column from each table for datasets in the thousands of records.
For a more sophisticated and comprehensive treatment of the topic that will allow you to join records using multiple fields, see the packages below:
[dedupe](https://github.com/dedupeio/dedupe)
[recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)
TODO
Test transformation and exclude functions.
Implement left join and full join.
Check that the ID is actually unique.
Add documentation.
Option to rename headers and disambiguate duplicate header names.
History
0.3.4 (2019-04-11)
Fix function defaults.
Minor optimizations.
Additional CLI parameters.
0.3.3 (2019-04-10)
Cleanup checks.
0.3.2 (2019-04-10)
Include basic installation instructions.
0.3.1 (2019-04-10)
Minor README updates.
0.3.0 (2019-04-10)
Use editdistance if available, otherwise fallback to pylev.
Report progress by default.
Number comparison options.
Renamed get_multiples to filter_multiples.
0.2.1 (2019-04-10)
Additional docs and tests.
0.2.0 (2019-04-09)
Write multiples matches to a separate file.
Added types and docstrings.
0.1.2 (2019-04-09)
Duplicate release of 0.1.1
0.1.1 (2019-04-09)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzyjoin-0.4.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 569f4593d8745a2537b63306ea90123558a3f6b0112fbd6886f0266a4bca9d1f |
|
MD5 | 120c8f12dca21ce6898b9c1c8c67f0e0 |
|
BLAKE2b-256 | b36c1f5d2632af69bb96ae705d9b4d400f40dc63e8d314e5b5f0b6b538edf876 |