Mercury's DataSchema package allows the automatic recognition and validation of feature types.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

mercury-dataschema

mercury-dataschema is a submodule of the Mercury library which acts as a utility tool that, given a Pandas DataFrame, its DataSchema class auto-infers feature types and automatically calculates different statistics depending on them.

This type inference isn't solely based on data types but in the information the variables contain. For example: if a feature is encoded as a float but its cardinality is 2, we can be sure it's a binary feature.

This package is used by other Mercury submodules, and you also can use it separately from the rest of the library.

As an idea (there are plenty of them, though), it is particularly useful when preprocessing datasets. Having to specify the typical categorical_cols and coninuous_cols is over!

Mercury project at BBVA

Mercury is a collaborative library that was developed by the Advanced Analytics community at BBVA. Originally, it was created as an InnerSource project but after some time, we decided to release certain parts of the project as Open Source. That's the case with the mercury-dataschema package.

If you're interested in learning more about the Mercury project, we recommend reading this blog post from www.bbvaaifactory.com

User installation

The easiest way to install mercury-dataschema is using pip:

pip install -U mercury-dataschema

Example

from mercury.dataschema.schemagen import DataSchema
from mercury.dataschema.feature import FeatType

dataset = UCIDataset().load()   # Any Dataframe

schma = (DataSchema()         # Generate a lazy Schema object
    .generate(dataset)        # Manually trigger its construction (it mostly infers data types...)
    .calculate_statistics())  # Manually trigger extra statistic calculations for each feature

Then, we can inspect all the features with

schma.feats

{'ID': Discrete Feature (NAME=None, dtype=DataType.INTEGER),
 'LIMIT_BAL': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'SEX': Binary Feature (NAME=None, dtype=DataType.INTEGER),
 'EDUCATION': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'MARRIAGE': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'AGE': Discrete Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_0': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_2': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_3': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_4': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_5': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_6': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'BILL_AMT1': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT2': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT3': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT4': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT5': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT6': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT1': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT2': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT3': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT4': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT5': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT6': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'default.payment.next.month': Binary Feature (NAME=None, dtype=DataType.INTEGER)}

And we can get extra feature statistics by inspecting the .stats attribute of the Feature objects.

schma.feats['BILL_AMT4'].stats

{'num_nan': 0,
 'percent_nan': 0.0,
 'samples': 30000,
 'percent_unique': 0.7182666666666667,
 'cardinality': 21548,
 'min': -170000.0,
 'max': 891586.0,
 'distribution': [3.3333333333333335e-05,
  0.0,
  3.3333333333333335e-05,
  0.0,
  0.0,
  3.3333333333333335e-05,
  0.0,
  3.3333333333333335e-05,
  3.3333333333333335e-05,
  0.0,
  3.3333333333333335e-05,
  6.666666666666667e-05,
  6.666666666666667e-05,
  0.00016666666666666666,
  ...,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  3.3333333333333335e-05],
 'distribution_bins': [-170000.0,
  -163898.93103448275,
  -157797.8620689655,
  -151696.7931034483,
  ...,
  867181.724137931,
  873282.7931034482,
  879383.8620689653,
  885484.9310344828,
  891586.0]}

schma.feats

{'ID': Discrete Feature (NAME=None, dtype=DataType.INTEGER),
 'LIMIT_BAL': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'SEX': Binary Feature (NAME=None, dtype=DataType.INTEGER),
 'EDUCATION': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'MARRIAGE': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'AGE': Discrete Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_0': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_2': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_3': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_4': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_5': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'PAY_6': Categorical Feature (NAME=None, dtype=DataType.INTEGER),
 'BILL_AMT1': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT2': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT3': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT4': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT5': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'BILL_AMT6': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT1': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT2': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT3': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT4': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT5': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'PAY_AMT6': Discrete Feature (NAME=None, dtype=DataType.FLOAT),
 'default.payment.next.month': Binary Feature (NAME=None, dtype=DataType.INTEGER)}

Note how for different features, the computed statistics vary:

schma.feats['default.payment.next.month'].stats

{'num_nan': 0,
 'percent_nan': 0.0,
 'samples': 30000,
 'percent_unique': 6.666666666666667e-05,
 'cardinality': 2,
 'distribution': [0.7788, 0.2212],
 'distribution_bins': [0, 1],
 'domain': [1, 0]}

Example notebooks

from mercury.dataschema import create_tutorials

create_tutorials('.')	# Creates a folder with example notebooks in the current path.

Saving and loading schemas

You can serialize and reload DataSchemas so you can reuse them in the future.

PATH = 'schma.json'
# Save the schema
schma.save(PATH)

# Load it back!
recovered = DataSchema.load(PATH)

Help and support

This library is currently maintained by a dedicated team of data scientists and machine learning engineers from BBVA.

Documentation

website: https://bbva.github.io/mercury-dataschema/site/

Email

mercury.group@bbva.com

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.2.0

Mar 11, 2026

This version

1.1.2

Feb 18, 2025

0.0.1

Mar 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mercury_dataschema-1.1.2.tar.gz (22.7 kB view details)

Uploaded Feb 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mercury_dataschema-1.1.2-py3-none-any.whl (23.0 kB view details)

Uploaded Feb 18, 2025 Python 3

File details

Details for the file mercury_dataschema-1.1.2.tar.gz.

File metadata

Download URL: mercury_dataschema-1.1.2.tar.gz
Upload date: Feb 18, 2025
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for mercury_dataschema-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a8b77de912280be806dd83e1ea36c2845a050decc3435871452f75ce5778221c`
MD5	`2493723f011476beb635950978d76745`
BLAKE2b-256	`c2040c1f21c4a54e4a11b4998a69311b95ebb5a2d96aa31d9e241db1bbb2e26f`

See more details on using hashes here.

File details

Details for the file mercury_dataschema-1.1.2-py3-none-any.whl.

File metadata

Download URL: mercury_dataschema-1.1.2-py3-none-any.whl
Upload date: Feb 18, 2025
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for mercury_dataschema-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fe980a0fb165c2a8c51419c99a4424bea448ef450edfe3571bfa44d7eda1687`
MD5	`f033497a43d4b81abe653546922d31ec`
BLAKE2b-256	`308d83176b654f0480c4072c0df8858be695856ed57e25bd166ad383a4b4a308`

See more details on using hashes here.

mercury-dataschema 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mercury-dataschema

Mercury project at BBVA

User installation

Example

Example notebooks

Saving and loading schemas

Help and support

Documentation

Email

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes