PyDI - Python Data Integration Framework

These details have not been verified by PyPI

Project links

Project description

PyDI - Python Data Integration Framework

The PyDI framework provides methods for end-to-end data integration. The framework covers all steps of the integration process, including schema matching, data translation, entity matching, and data fusion. The framework offers both traditional string-based methods as well as modern LLM- and embedding-based techniques for these tasks. PyDI is designed as a set of independent, composable modules that operate on pandas DataFrames as the underlying data structure, ensuring interoperability with third-party packages that rely on pandas.

This page provides an overview of the PyDI framework. Further details about the functionality of the framework are found in the Wiki. In order to learn how to use the framework, please read the Tutorials or have a look at the Use Cases which illustrate how PyDI is used for end-to-end data integration.

Installing PyDI

You can install PyDI via pip:

pip install uma-pydi

Functionality

The PyDI framework covers all steps of the data integration process, including data loading, schema matching, data translation, entity matching, and data fusion. This section gives an overview of the functionality and the alternative methods that are provided for each of these steps.

Schema Matching: Schema matching identifies attributes in multiple schemata that have the same meaning. PyDI provides four schema matching methods which either rely on attribute labels or data values, or exploit an existing mapping of records in order to find attribute correspondences (duplicate-based schema matching). PyDI's schema matching module offers:

Label-based schema matching
Instance-based schema matching
Duplicate-based schema matching
LLM-based schema matching
Evaluation of schema matching results
Debug reports about the matching process

Data Translation: Translates data from a source schema into a target schema. The translation process may include value normalization and information extraction. PyDI provides the following data translaton methods:

Value normalization
- Data profiling with automatic type and pattern detection
- Unit of measurement conversion (length, weight, temperature, etc.)
- Scale modifier expansion (MEO, MEUR, million, billion)
- Country, currency, and language code normalization
- Number validation (phone, IBAN, VAT, ISBN)
- JSON Schema support for defining normalization specs
- Data quality validation (ranges, patterns, completeness, uniqueness)
Information extraction via
- Regex
- Python functions
- Large language models
Evaluation of information extraction results

Entity Matching: Entity matching methods identify records in different datasets that describe the same real-world entity. PyDI offers a range of entity matching methods, starting from simple attribute similarity-based rules over machine-learned rules, to Pre-trained Language Models (PLMs) and Large Language Models (LLMs). Entity matching methods rely on blocking in order to reduce the number of record comparisons. PyDI provides the following blocking and entity matching methods:

Blocking Methods
- Key-based blocking
- Sorted-neighbourhood blocking
- Token-based blocking
- Embedding-based blocking
Entity Matching
- Rule-based entity matching (manual or machine learning-based)
- PLM-based entity matching
- LLM-based entity matching
- 5 correspondence filtering and clustering methods
Evaluation of entity matching and blocking results
Debug reports about the matching process

Data Fusion: Data fusion combines data from multiple sources into a single, consolidated dataset. Different sources may provide conflicting data values. PyDI allows you to resolve such data conflicts (decide which value to include in the final dataset) by applying different conflict resolution functions. PyDI's fusion module offers the following:

15 conflict resolution functions
- Strings: longest_string, shortest_string, most_complete
- Numbers: average, median, maximum, minimum, sum_values
- Dates: most_recent, earliest
- Lists: union, intersection, intersection_k_sources
- Metadata-based: voting, weighted_voting, favour_sources, prefer_higher_trust
Evaluation of data fusion results against ground truth
Provenance tracking for fused values
Debug reports about the fusion process

IO: PyDI provides methods for reading standard data formats into pandas DataFrames with provenance tracking:

Supported formats: CSV, JSON, XML, Excel, Parquet, Feather, HTML tables, fixed-width files
Automatic provenance metadata (source path, timestamps, checksums)
Optional unique identifier generation for downstream matching and fusion

Tutorials

Tutorial	Description
Data Integration Tutorial	End-to-end pipeline: loading, blocking, matching, fusion
Value Normalization Tutorial	Profiling, specs, unit conversion, data cleaning
Schema Matching Tutorial	LLM-based schema matching with JSON Schema

Contact

For issues, feature requests, or contributions, please open a GitHub Issue or submit a Pull Request. For further information about PyDI, please email the maintainers of the framework.

Acknowledgements

PyDI is developed by the Web-based Systems Group at the University of Mannheim. The framework is used for projects and exercises in the course Web Data Integration at the University of Mannheim.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Feb 24, 2026

0.2.0

Dec 17, 2025

0.1.5

Nov 18, 2025

0.1.4

Nov 12, 2025

0.1.3

Oct 23, 2025

0.1.1

Oct 23, 2025

0.1.0

Sep 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uma_pydi-0.2.1.tar.gz (258.6 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uma_pydi-0.2.1-py3-none-any.whl (301.5 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file uma_pydi-0.2.1.tar.gz.

File metadata

Download URL: uma_pydi-0.2.1.tar.gz
Upload date: Feb 24, 2026
Size: 258.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for uma_pydi-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`a60c05fdd7b878792b6730793e882817654d4342cf09649204a9a685a457539f`
MD5	`956d3a0b7101293f213982a7f2c5c933`
BLAKE2b-256	`953c6a052533d3c02fda24edcccd8dae68ef8b09ef288ee16b7d3fa263a89291`

See more details on using hashes here.

File details

Details for the file uma_pydi-0.2.1-py3-none-any.whl.

File metadata

Download URL: uma_pydi-0.2.1-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 301.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for uma_pydi-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69cb08a308e1568ae9419d5a0f035709f8d7d20eb6d40f38de81847bb35cfbbf`
MD5	`2404f3ca0fa291e5abd155227ad88d57`
BLAKE2b-256	`94a0902ffbb28e4eaf2df792d3842155581657e848eafe9f09ca97679b373da0`

See more details on using hashes here.

uma-pydi 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyDI - Python Data Integration Framework

Installing PyDI

Functionality

Tutorials

Contact

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes