Identify and merge duplicates in bibliographic records
Project description
bib-dedupe
Overview
Bib-Dedupe is an open-source Python library for deduplication of bibliographic records, tailored for literature reviews. Unlike traditional deduplication methods, Bib-Dedupe focuses on entity resolution, linking duplicate records instead of simply deleting them. This approach enables validation, undo operations, and a more nuanced understanding of record relationships.
Features
- Automated Duplicate Linking with Zero False Positives: Bib-Dedupe automates the duplicate linking process with a focus on eliminating false positives.
- Preprocessing Approach: Bib-Dedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
- Entity Resolution: Bib-Dedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
- Programmatic Access: Bib-Dedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
- Transparent and Reproducible Rules: Bib-Dedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
- Continuous Benchmarking: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
- Efficient and Parallel Computation: Bib-Dedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.
Installation
To install Bib-Dedupe, use the following pip command:
pip install bib-dedupe
Getting Started
import pandas as pd
from bib_dedupe.bib_dedupe import merge
# Load your bibliographic dataset into a pandas DataFrame
records_df = pd.read_csv("records.csv")
# Get the merged_df
merged_df = merge(records_df)
For more detailed usage instructions and customization options, refer to the documentation.
For advanced use cases, it is also possible to complete and customize each step individually
from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe
# Block records
blocked_df = block(records_df)
# Identify matches
matched_df = match(blocked_df)
# Check maybe cases
export_maybe(matched_df, records_df, matches)
matches = import_maybe(matches)
# Merge
merged_df = merge(records_df, matches=matches)
Fields used by BibDeduper
Name | Definition |
---|---|
ID | A unique ID |
author | The author(s) of the publication |
title | The title of the publication |
year | The year of publication |
journal | The name of the journal in which the publication appeared |
volume | The volume number of the publication |
number | The issue number of the publication |
pages | The page numbers of the publication |
doi | The Digital Object Identifier (DOI) |
abstract | The abstract |
search_set | Distinct sets of papers (e.g., old_search), can be empty. |
Continuous evaluation
Bib-dedupe is continuously evaluated against other Python libraries (currently the asreview datatools) for duplicate removal in bibliographic datasets. Complementary data from Hair et al. (2021) is added to the overview. The notebooks are available for the evaluation, and the datasets are available in the data section. A summary of the evaluation is available in the README.md, aggregated summaries are exported to current_results.md, and detailed results are exported to a csv file.
Documentation
Explore the official documentation for comprehensive information on installation, usage, and customization of Bib-Dedupe.
Citation
If you use Bib-Dedupe in your research, please cite it as follows:
TODO
Contribution Guidelines
We welcome contributions from the community to enhance and expand Bib-Dedupe. If you would like to contribute, please follow our contribution guidelines.
License
Bib-Dedupe is released under the MIT License, allowing free and open use and modification.
Contact
For any questions, issues, or feedback, please open an issue on our GitHub repository.
Happy deduplicating with Bib-Dedupe!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bib_dedupe-0.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | caf77cc0332b5cfeaaf054f463edcb79dd491a643dbaa9a9550a2d34bc4c25f9 |
|
MD5 | 9cb56cffb8ed19be84d340cf8ccf1767 |
|
BLAKE2b-256 | ecba09df86f826411163e216b64b58e31c54824f1b5314350ce2a8314b99aa60 |