Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Join two tables by a fuzzy comparison of text columns.

Project description

fuzzyjoin
=========

Join two tables by a fuzzy comparison of text columns.

Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)


Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`


Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.

For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:

* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)


CLI Help
--------
```bash
\> fuzzyjoin --help

Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV

Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.

Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
```


CLI Usage
---------

```bash
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
```

API Usage
---------
```python
from fuzzyjoin import io

# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
```

TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.


History
=======
0.5.1 (2019-04-15)
------------------
* Update usage docs.
* Rename key_* params to field_* for consistency.

0.5.0 (2019-04-15)
------------------
* Removed ID field requirement.

0.4.1 (2019-04-12)
------------------
* Options parameter.

0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.

0.3.3 (2019-04-10)
------------------
* Cleanup checks.

0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.

0.3.1 (2019-04-10)
------------------
* Minor README updates.


0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.


0.2.1 (2019-04-10)
------------------
* Additional docs and tests.

0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.

0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1

0.1.1 (2019-04-09)
------------------
* First release on PyPI.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for fuzzyjoin, version 0.5.2
Filename, size File type Python version Upload date Hashes
Filename, size fuzzyjoin-0.5.2-py2.py3-none-any.whl (11.5 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size fuzzyjoin-0.5.2.tar.gz (11.5 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page