Skip to main content

HuggingFace community-driven open-source library for dataset disaggregation

Project description


Hugging Face Disaggregators

GitHub GitHub release

⚠️ Please note: This library is in early development, and the disaggregation modules that are included are proofs of concept that are not production-ready. Additionally, all APIs are subject to breaking changes any time before a 1.0.0 release. Rigorously tested versions of the included modules will be released in the future, so stay tuned. We'd love your feedback in the meantime!

The disaggregators library allows you to easily add new features to your datasets to enable disaggregated data exploration and disaggregated model evaluation. disaggregators is preloaded with disaggregation modules for text data, with image modules coming soon!

This library is intended to be used with 🤗 Datasets, but should work with any other "mappable" interface to a dataset.

Requirements and Installation

disaggregators has been tested on Python 3.8, 3.9, and 3.10.

pip install disaggregators will fetch the latest release from PyPI.

Note that some disaggregation modules require extra dependencies such as SpaCy modules, which may need to be installed manually. If these dependencies aren't installed, disaggregators will inform you about how to install them.

To install directly from this GitHub repo, use the following command:

pip install git+https://github.com/huggingface/disaggregators.git

Usage

You will likely want to use 🤗 Datasets with disaggregators.

pip install datasets

The snippet below loads the IMDB dataset from the Hugging Face Hub, and initializes a disaggregator for "pronoun" that will run on the IMDB dataset's "text" column. If you would like to run multiple disaggregations, you can pass a list to the Disaggregator constructor (e.g. Disaggregator(["pronoun", "sentiment"], column="text")). We then use the 🤗 Datasets map method to apply the disaggregation to the dataset.

from disaggregators import Disaggregator
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")
disaggregator = Disaggregator("pronoun", column="text")

ds = dataset.map(disaggregator)  # New boolean columns are added for she/her, he/him, and they/them

The resulting dataset can now be used for data exploration and disaggregated model evaluation.

You can also run disaggregations on Pandas DataFrames with .apply and .merge:

from disaggregators import Disaggregator
import pandas as pd
df = pd.DataFrame({"text": ["They went to the park."]})

disaggregator = Disaggregator("pronoun", column="text")

new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)

Available Disaggregation Modules

The following modules are currently available:

  • "age"
  • "gender"
  • "pronoun"
  • "religion"
  • "continent"

Note that disaggregators is in active development, and that these (and future) modules are subject to changing interfaces and implementations at any time before a 1.0.0 release. Each module provides its own method for overriding the default configuration, with the general interface documented below.

Module Configurations

Modules may make certain variables and functionality configurable. If you'd like to configure a module, import the module, its labels, and its config class. Then, override the labels and set the configuration as needed while instantiating the module. Once instantiated, you can pass the module to the Disaggregator. The example below shows this with the Age module.

from disaggregators import Disaggregator
from disaggregators.disaggregation_modules.age import Age, AgeLabels, AgeConfig

class MeSHAgeLabels(AgeLabels):
    INFANT = "infant"
    CHILD_PRESCHOOL = "child_preschool"
    CHILD = "child"
    ADOLESCENT = "adolescent"
    ADULT = "adult"
    MIDDLE_AGED = "middle_aged"
    AGED = "aged"
    AGED_80_OVER = "aged_80_over"

age = Age(
    config=AgeConfig(
        labels=MeSHAgeLabels,
        ages=[list(MeSHAgeLabels)],
        breakpoints=[0, 2, 5, 12, 18, 44, 64, 79]
    ),
    column="question"
)

disaggregator = Disaggregator([age, "gender"], column="question")

Custom Modules

Custom modules can be created by extending the CustomDisaggregator. All custom modules must have labels and a module_id, and must implement a __call__ method.

from disaggregators import Disaggregator, DisaggregationModuleLabels, CustomDisaggregator

class TabsSpacesLabels(DisaggregationModuleLabels):
    TABS = "tabs"
    SPACES = "spaces"

class TabsSpaces(CustomDisaggregator):
    module_id = "tabs_spaces"
    labels = TabsSpacesLabels

    def __call__(self, row, *args, **kwargs):
        if "\t" in row[self.column]:
            return {self.labels.TABS: True, self.labels.SPACES: False}
        else:
            return {self.labels.TABS: False, self.labels.SPACES: True}

disaggregator = Disaggregator(TabsSpaces, column="text")

Development

Development requirements can be installed with pip install .[dev]. See the Makefile for useful targets, such as code quality and test running.

To run tests locally across multiple Python versions (3.8, 3.9, and 3.10), ensure that you have all the Python versions available and then run nox -r. Note that this is quite slow, so it's only worth doing to double-check your code before you open a Pull Request.

Contact

Nima Boscarino – nima <at> huggingface <dot> co

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disaggregators-0.1.2.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

disaggregators-0.1.2-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file disaggregators-0.1.2.tar.gz.

File metadata

  • Download URL: disaggregators-0.1.2.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for disaggregators-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7ceb4e7a33a9accd1d3d2162861f8e8b882fb212eff30ec3858f227f26c5a7cb
MD5 39d896d440773c19086b2f2fc82a6866
BLAKE2b-256 2aa9631b13b95997c2986c1e67aa889f6cc355001b7a92b31f0938c24e81fd1d

See more details on using hashes here.

File details

Details for the file disaggregators-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for disaggregators-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c77d8fcf568e7d6776a1bdf44509a04f5554bb468d6baf74ad2fd848d9a45450
MD5 a6d3be9ae8405cabb0e9070eb0662ce6
BLAKE2b-256 c6f44e7dadf21e7c6deebebe596b40cb0931b475888f44b190182fde9c0abbbe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page