HuggingFace community-driven open-source library for dataset disaggregation
Project description
⚠️ Please note: This library is in early development, and the disaggregation modules that are included are proofs of concept that are not production-ready. Additionally, all APIs are subject to breaking changes any time before a 1.0.0 release. Rigorously tested versions of the included modules will be released in the future, so stay tuned. We'd love your feedback in the meantime!
The disaggregators
library allows you to easily add new features to your datasets to enable disaggregated data exploration and disaggregated model evaluation. disaggregators
is preloaded with disaggregation modules for text data, with image modules coming soon!
This library is intended to be used with 🤗 Datasets, but should work with any other "mappable" interface to a dataset.
Requirements and Installation
disaggregators
has been tested on Python 3.8, 3.9, and 3.10.
pip install disaggregators
will fetch the latest release from PyPI.
Note that some disaggregation modules require extra dependencies such as SpaCy modules, which may need to be installed manually. If these dependencies aren't installed, disaggregators
will inform you about how to install them.
To install directly from this GitHub repo, use the following command:
pip install git+https://github.com/huggingface/disaggregators.git
Usage
You will likely want to use 🤗 Datasets with disaggregators
.
pip install datasets
The snippet below loads the IMDB dataset from the Hugging Face Hub, and initializes a disaggregator for "pronoun" that will run on the IMDB dataset's "text" column. If you would like to run multiple disaggregations, you can pass a list to the Disaggregator
constructor (e.g. Disaggregator(["pronoun", "sentiment"], column="text")
). We then use the 🤗 Datasets map
method to apply the disaggregation to the dataset.
from disaggregators import Disaggregator
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
disaggregator = Disaggregator("pronoun", column="text")
ds = dataset.map(disaggregator) # New boolean columns are added for she/her, he/him, and they/them
The resulting dataset can now be used for data exploration and disaggregated model evaluation.
You can also run disaggregations on Pandas DataFrames with .apply
and .merge
:
from disaggregators import Disaggregator
import pandas as pd
df = pd.DataFrame({"text": ["They went to the park."]})
disaggregator = Disaggregator("pronoun", column="text")
new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)
Available Disaggregation Modules
The following modules are currently available:
"age"
"gender"
"pronoun"
"religion"
"continent"
Note that disaggregators
is in active development, and that these (and future) modules are subject to changing interfaces and implementations at any time before a 1.0.0
release. Each module provides its own method for overriding the default configuration, with the general interface documented below.
Module Configurations
Modules may make certain variables and functionality configurable. If you'd like to configure a module, import the module, its labels, and its config class. Then, override the labels and set the configuration as needed while instantiating the module. Once instantiated, you can pass the module to the Disaggregator
. The example below shows this with the Age
module.
from disaggregators import Disaggregator
from disaggregators.disaggregation_modules.age import Age, AgeLabels, AgeConfig
class MeSHAgeLabels(AgeLabels):
INFANT = "infant"
CHILD_PRESCHOOL = "child_preschool"
CHILD = "child"
ADOLESCENT = "adolescent"
ADULT = "adult"
MIDDLE_AGED = "middle_aged"
AGED = "aged"
AGED_80_OVER = "aged_80_over"
age = Age(
config=AgeConfig(
labels=MeSHAgeLabels,
ages=[list(MeSHAgeLabels)],
breakpoints=[0, 2, 5, 12, 18, 44, 64, 79]
),
column="question"
)
disaggregator = Disaggregator([age, "gender"], column="question")
Custom Modules
Custom modules can be created by extending the CustomDisaggregator
. All custom modules must have labels
and a module_id
, and must implement a __call__
method.
from disaggregators import Disaggregator, DisaggregationModuleLabels, CustomDisaggregator
class TabsSpacesLabels(DisaggregationModuleLabels):
TABS = "tabs"
SPACES = "spaces"
class TabsSpaces(CustomDisaggregator):
module_id = "tabs_spaces"
labels = TabsSpacesLabels
def __call__(self, row, *args, **kwargs):
if "\t" in row[self.column]:
return {self.labels.TABS: True, self.labels.SPACES: False}
else:
return {self.labels.TABS: False, self.labels.SPACES: True}
disaggregator = Disaggregator(TabsSpaces, column="text")
Development
Development requirements can be installed with pip install .[dev]
. See the Makefile
for useful targets, such as code quality and test running.
To run tests locally across multiple Python versions (3.8, 3.9, and 3.10), ensure that you have all the Python versions available and then run nox -r
. Note that this is quite slow, so it's only worth doing to double-check your code before you open a Pull Request.
Contact
Nima Boscarino – nima <at> huggingface <dot> co
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file disaggregators-0.1.2.tar.gz
.
File metadata
- Download URL: disaggregators-0.1.2.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ceb4e7a33a9accd1d3d2162861f8e8b882fb212eff30ec3858f227f26c5a7cb |
|
MD5 | 39d896d440773c19086b2f2fc82a6866 |
|
BLAKE2b-256 | 2aa9631b13b95997c2986c1e67aa889f6cc355001b7a92b31f0938c24e81fd1d |
File details
Details for the file disaggregators-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: disaggregators-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c77d8fcf568e7d6776a1bdf44509a04f5554bb468d6baf74ad2fd848d9a45450 |
|
MD5 | a6d3be9ae8405cabb0e9070eb0662ce6 |
|
BLAKE2b-256 | c6f44e7dadf21e7c6deebebe596b40cb0931b475888f44b190182fde9c0abbbe |