Skip to main content

Join two tables by a fuzzy comparison of text columns.

Project description

fuzzyjoin
=========

Join two tables by a fuzzy comparison of text columns.

Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)


Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`


Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.

For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:

* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)


CLI Help
--------
```bash
\> fuzzyjoin --help

Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV

Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.

Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
```


CLI Usage
---------

```bash
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
```

API Usage
---------
```python
from fuzzyjoin import io

# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
```

TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.


History
=======
0.5.1 (2019-04-15)
------------------
* Update usage docs.
* Rename key_* params to field_* for consistency.

0.5.0 (2019-04-15)
------------------
* Removed ID field requirement.

0.4.1 (2019-04-12)
------------------
* Options parameter.

0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.

0.3.3 (2019-04-10)
------------------
* Cleanup checks.

0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.

0.3.1 (2019-04-10)
------------------
* Minor README updates.


0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.


0.2.1 (2019-04-10)
------------------
* Additional docs and tests.

0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.

0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1

0.1.1 (2019-04-09)
------------------
* First release on PyPI.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyjoin-0.5.2.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

fuzzyjoin-0.5.2-py2.py3-none-any.whl (11.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file fuzzyjoin-0.5.2.tar.gz.

File metadata

  • Download URL: fuzzyjoin-0.5.2.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for fuzzyjoin-0.5.2.tar.gz
Algorithm Hash digest
SHA256 f5d594f37d806ed553ed3431980c89bfcf2240fb5cec0f43a0fb76baf118ebc2
MD5 d72c4a9f6d4597495abc54f24e273119
BLAKE2b-256 646a21430e07d12afba18367490d563a335aea3cdc4d984f75487b3ed3b4bef5

See more details on using hashes here.

File details

Details for the file fuzzyjoin-0.5.2-py2.py3-none-any.whl.

File metadata

  • Download URL: fuzzyjoin-0.5.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for fuzzyjoin-0.5.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 847fe6fa1c735101b91034d318fb90a405292e7f3431522b3e3586b0adddf241
MD5 16702f549b8292d56ea69d3d73cf62ce
BLAKE2b-256 8849acb8228b11ae8f3d627ad78b2a5235094d27b8a9bc78920d918194d08106

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page