Datalabs
Project description
DataLab is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:
- Data Diagnostics: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
- Operation Standardization: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
- Data Search: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
- Global Analysis: DataLab provides tools to perform global analyses over a variety of datasets.
Table of Content
-
Installation SDK
-
Supported Datasets
- Datasets in SDK
- Datasets in Web Platform
-
Documentation for Web Users
-
Documentation for SDK Users
Installation
DataLab can be installed from PyPi
pip install --upgrade pip
pip install datalabs
or from the source
# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install .
Getting started
Here we give several examples to showcase the usage of DataLab. For more information, please refer to the corresponding sections in our documentation.
# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("ag_news")
# Preprocessing operation
from preprocess import *
res=dataset["test"].apply(lower)
print(next(res))
# Featurizing operation
from featurize import *
res = dataset["test"].apply(get_text_length) # get length
print(next(res))
res = dataset["test"].apply(get_entities_spacy) # get entity
print(next(res))
# Editing/Transformation operation
from edit import *
res = dataset["test"].apply(change_person_name) # change person name
print(next(res))
# Prompting operation
from prompt import *
res = dataset["test"].apply(template_tc1)
print(next(res))
# Aggregating operation
from aggregate.text_classification import *
res = dataset["test"].apply(get_statistics)
Acknowledgment
DataLab originated from a fork of the awesome Huggingface Datasets and TensorFlow Datasets. We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datalabs-0.2.8-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec2764c88c8b75c774cbbaf728f51cac4a4d493c27b6d6257c8a4b39700c4659 |
|
MD5 | d84d019b82e59008b7452b30926d2d04 |
|
BLAKE2b-256 | 6d47f62877df87d751d6e70090757590543dca7746fbf2c110467339a8fd9a1f |