Join two tables by a fuzzy comparison of text columns.
Project description
fuzzyjoin
=========
Join two tables by a fuzzy comparison of text columns.
Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)
Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`
Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.
For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:
* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)
CLI Help
--------
```bash
\> fuzzyjoin --help
Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV
Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.
Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
```
CLI Usage
---------
```bash
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
```
API Usage
---------
```python
from fuzzyjoin import io
# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
```
TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.
History
=======
0.5.1 (2019-04-15)
------------------
* Update usage docs.
* Rename key_* params to field_* for consistency.
0.5.0 (2019-04-15)
------------------
* Removed ID field requirement.
0.4.1 (2019-04-12)
------------------
* Options parameter.
0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.
0.3.3 (2019-04-10)
------------------
* Cleanup checks.
0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.
0.3.1 (2019-04-10)
------------------
* Minor README updates.
0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.
0.2.1 (2019-04-10)
------------------
* Additional docs and tests.
0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.
0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1
0.1.1 (2019-04-09)
------------------
* First release on PyPI.
=========
Join two tables by a fuzzy comparison of text columns.
Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)
Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`
Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.
For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:
* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)
CLI Help
--------
```bash
\> fuzzyjoin --help
Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV
Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.
Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
```
CLI Usage
---------
```bash
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
```
API Usage
---------
```python
from fuzzyjoin import io
# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
```
TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.
History
=======
0.5.1 (2019-04-15)
------------------
* Update usage docs.
* Rename key_* params to field_* for consistency.
0.5.0 (2019-04-15)
------------------
* Removed ID field requirement.
0.4.1 (2019-04-12)
------------------
* Options parameter.
0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.
0.3.3 (2019-04-10)
------------------
* Cleanup checks.
0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.
0.3.1 (2019-04-10)
------------------
* Minor README updates.
0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.
0.2.1 (2019-04-10)
------------------
* Additional docs and tests.
0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.
0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1
0.1.1 (2019-04-09)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
fuzzyjoin-0.5.2.tar.gz
(11.5 kB
view details)
Built Distribution
File details
Details for the file fuzzyjoin-0.5.2.tar.gz
.
File metadata
- Download URL: fuzzyjoin-0.5.2.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f5d594f37d806ed553ed3431980c89bfcf2240fb5cec0f43a0fb76baf118ebc2 |
|
MD5 | d72c4a9f6d4597495abc54f24e273119 |
|
BLAKE2b-256 | 646a21430e07d12afba18367490d563a335aea3cdc4d984f75487b3ed3b4bef5 |
File details
Details for the file fuzzyjoin-0.5.2-py2.py3-none-any.whl
.
File metadata
- Download URL: fuzzyjoin-0.5.2-py2.py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 847fe6fa1c735101b91034d318fb90a405292e7f3431522b3e3586b0adddf241 |
|
MD5 | 16702f549b8292d56ea69d3d73cf62ce |
|
BLAKE2b-256 | 8849acb8228b11ae8f3d627ad78b2a5235094d27b8a9bc78920d918194d08106 |