Datalabs
Project description
DataLab is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:
- Data Diagnostics: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
- Operation Standardization: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
- Data Search: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
- Global Analysis: DataLab provides tools to perform global analyses over a variety of datasets.
Installation
DataLab can be installed from PyPi
pip install --upgrade pip
pip install datalabs
or from the source
# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install .
Getting started (Documentation)
Here we give several examples to showcase the usage of DataLab. For more information, please refer to the corresponding sections in our documentation.
# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("ag_news")
# Preprocessing operation
from preprocess import *
res=dataset["test"].apply(lower)
print(next(res))
# Featurizing operation
from featurize import *
res = dataset["test"].apply(get_text_length) # get length
print(next(res))
res = dataset["test"].apply(get_entities_spacy) # get entity
print(next(res))
# Editing/Transformation operation
from edit import *
res = dataset["test"].apply(change_person_name) # change person name
print(next(res))
# Prompting operation
from prompt import *
res = dataset["test"].apply(template_tc1)
print(next(res))
# Aggregating operation
from aggregate.text_classification import *
res = dataset["test"].apply(get_statistics)
Acknowledgment
DataLab originated from a fork of the awesome Huggingface Datasets and TensorFlow Datasets. We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section. We thank Antonis Anastasopoulos for sharing the mapping data between countries and languages, and thank Alissa Ostapenko, Yulia Tsvetkov, Jie Fu, Ziyun Xu, Hiroaki Hayashi, and Zhengfu He for useful discussion and suggestions for the first version.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datalabs-0.3.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0bf713b69a3be0e76b43b3cff4a26956b41eff3f7c05b6560538c6d27b756677 |
|
MD5 | 9312b09006c774895ed515f37e2f2f29 |
|
BLAKE2b-256 | a6ce0093b3ce667af69942d751c8a5dd08908be1de93145df03324a53363b666 |