fuzzyjoin

Join two tables by a fuzzy comparison of text columns.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

fuzzyjoin
=========

Join two tables by a fuzzy comparison of text columns.

Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)

Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`

Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.

For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:

* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)

CLI Help
--------
```bash
\> fuzzyjoin --help

Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV

Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.

Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
```

CLI Usage
---------

```bash
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
```

API Usage
---------
```python
from fuzzyjoin import io

# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
```

TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.

History
=======
0.5.1 (2019-04-15)
------------------
* Update usage docs.
* Rename key_* params to field_* for consistency.

0.5.0 (2019-04-15)
------------------
* Removed ID field requirement.

0.4.1 (2019-04-12)
------------------
* Options parameter.

0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.

0.3.3 (2019-04-10)
------------------
* Cleanup checks.

0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.

0.3.1 (2019-04-10)
------------------
* Minor README updates.

0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.

0.2.1 (2019-04-10)
------------------
* Additional docs and tests.

0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.

0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1

0.1.1 (2019-04-09)
------------------
* First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.5.2

Apr 15, 2019

0.5.1

Apr 15, 2019

0.5.0

Apr 15, 2019

0.4.1

Apr 12, 2019

0.3.4

Apr 12, 2019

0.3.3

Apr 10, 2019

0.3.2

Apr 10, 2019

0.3.1

Apr 10, 2019

0.3.0

Apr 10, 2019

0.2.1

Apr 10, 2019

0.2.0

Apr 10, 2019

0.1.2

Apr 10, 2019

0.1.1

Apr 10, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyjoin-0.5.2.tar.gz (11.5 kB view details)

Uploaded Apr 15, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fuzzyjoin-0.5.2-py2.py3-none-any.whl (11.5 kB view details)

Uploaded Apr 15, 2019 Python 2Python 3

File details

Details for the file fuzzyjoin-0.5.2.tar.gz.

File metadata

Download URL: fuzzyjoin-0.5.2.tar.gz
Upload date: Apr 15, 2019
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for fuzzyjoin-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`f5d594f37d806ed553ed3431980c89bfcf2240fb5cec0f43a0fb76baf118ebc2`
MD5	`d72c4a9f6d4597495abc54f24e273119`
BLAKE2b-256	`646a21430e07d12afba18367490d563a335aea3cdc4d984f75487b3ed3b4bef5`

See more details on using hashes here.

File details

Details for the file fuzzyjoin-0.5.2-py2.py3-none-any.whl.

File metadata

Download URL: fuzzyjoin-0.5.2-py2.py3-none-any.whl
Upload date: Apr 15, 2019
Size: 11.5 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for fuzzyjoin-0.5.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`847fe6fa1c735101b91034d318fb90a405292e7f3431522b3e3586b0adddf241`
MD5	`16702f549b8292d56ea69d3d73cf62ce`
BLAKE2b-256	`8849acb8228b11ae8f3d627ad78b2a5235094d27b8a9bc78920d918194d08106`

See more details on using hashes here.

fuzzyjoin 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes