Skip to main content

PyDI - Python Data Integration Framework

Project description

PyDI - Python Data Integration Framework

The PyDI framework provides methods for end-to-end data integration. The framework covers all steps of the integration process, including schema matching, data translation, entity matching, and data fusion. The framework offers both traditional string-based methods as well as modern LLM- and embedding-based techniques for these tasks. PyDI is designed as a set of independent, composable modules that operate on pandas DataFrames as the underlying data structure, ensuring interoperability with third-party packages that rely on pandas.

This page provides an overview of the functionality of the PyDI framework. As alternatives to familiarizing yourself with the framework, you can also read the PyDI Tutorial or have a look at the code examples in our Wiki!

Installing PyDI

You can install PyDI via pip:

pip install uma-pydi

Functionality

The PyDI framework covers all steps of the data integration process, including data loading, schema matching, data translation, entity matching, and data fusion. This section gives an overview of the functionality and the alternative algorithms that are provided for each of these steps.

Schema Matching: Schema matching identifies attributes in multiple schemata that have the same meaning. PyDI provides three schema matching methods which either rely on attribute labels or data values, or exploit an existing mapping of records (duplicate-based schema matching) in order to find attribute correspondences. PyDI's schema matching module offers:

  • Label-based schema matching
  • Instance-based schema matching
  • Duplicate-based schema matching
  • LLM-based schema matching
  • Evaluation of schema matching results
  • Debug reports about the matching process

Data Translation: Translates data from a source schema into a target schema. The translation process may include value normalization and information extraction. PyDI provides the following methods for value normalization and information extraction:

  • Value normalization
    • Data type detection
    • Value & header normalization
    • Unit of measurement conversion
    • Data validation
  • Information extraction via
    • Regex
    • Python functions
    • Large language models
  • Evaluation of information extraction results

Entity Matching: Entity matching methods identify records in different datasets that describe the same real-world entity. PyDI offers a range of entity matching methods, starting from simple attribute similarity-based rules over machine-learned rules, to Pre-trained Language Models (PLMs) and Large Language Models (LLMs). Entity matching methods rely on blocking in order to reduce the number of record comparisons. PyDI provides the following blocking and entity matching methods:

  • Blocking Methods
    • Key-based blocking
    • Sorted-neighbourhood blocking
    • Token-based blocking
    • Embedding-based blocking
  • Entity Matching
    • Rule-based entity matching (manual or machine learning-based)
    • PLM-based entity matching
    • LLM-based entity matching
    • 6 post-clustering methods
  • Evaluation of entity matching and blocking results
  • Debug reports about the matching process

Data Fusion: Data fusion combines data from multiple sources into a single, consolidated dataset. Different sources may provide conflicting data values. PyDI allows you to resolve such data conflicts (decide which value to include in the final dataset) by applying different conflict resolution functions. PyDI's fusion module offers the following:

  • 13 value-based conflict resolution functions for strings, numbers, and sets
  • 4 metadata-based conflict resolution functions.
  • Evaluation of data fusion results against ground truth
  • Debug reports about the fusion process

IO: PyDI provides methods for reading standard data formats such as JSON, XML, and CSV into pandas DataFrames. All read methods can optionally add unique identifiers and provenance metadata to the DataFrames.

Contact

If you have questions or need help, please first consult the PyDI Tutorial, the Wiki, and the project documentation. For issues, feature requests, or contributions, please open a GitHub Issue or submit a Pull Request. For further information, please email the maintainers of the framework.

Acknowledgements

PyDI is developed by the Web-based Systems Group at the University of Mannheim.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uma_pydi-0.1.3.tar.gz (244.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uma_pydi-0.1.3-py3-none-any.whl (294.6 kB view details)

Uploaded Python 3

File details

Details for the file uma_pydi-0.1.3.tar.gz.

File metadata

  • Download URL: uma_pydi-0.1.3.tar.gz
  • Upload date:
  • Size: 244.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for uma_pydi-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a5496dd298c207e639193eae0a9e0e69007faa04227fa14bc3a46870a5b1c732
MD5 b3fc550a332798ac3fcaabb47a74f32c
BLAKE2b-256 1f8d081024f4aae6621b6f842393f6c52b66d5abdd418f1eb4052672db6c2e46

See more details on using hashes here.

File details

Details for the file uma_pydi-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: uma_pydi-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 294.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for uma_pydi-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 361e28a9230514c85385e3b7fde54054948ce6664b8ec1c11c0d49b252074ac6
MD5 65de4846870b6bb4c16a8d38dd643ffe
BLAKE2b-256 833b93042eff8e2e52b6f1bd9f38fa98ccd46d58a4164d4d8b311f93d572391a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page