Datalabs
Project description
DataLab is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:
- Data Diagnostics: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
- Operation Standardization: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
- Data Search: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
- Global Analysis: DataLab provides tools to perform global analyses over a variety of datasets.
Table of Content
-
Installation SDK
-
Supported Datasets
- Datasets in SDK
- Datasets in Web Platform
-
Documentation for Web Users
-
Documentation for SDK Users
Installation
DataLab can be installed from PyPi
pip install --upgrade pip
pip install datalabs
or from the source
# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install .
Getting started
Here we give several examples to showcase the usage of DataLab. For more information, please refer to the corresponding sections in our documentation.
# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("ag_news")
# Preprocessing operation
from preprocess import *
res=dataset["test"].apply(lower)
print(next(res))
# Featurizing operation
from featurize import *
res = dataset["test"].apply(get_text_length) # get length
print(next(res))
res = dataset["test"].apply(get_entities_spacy) # get entity
print(next(res))
# Editing/Transformation operation
from edit import *
res = dataset["test"].apply(change_person_name) # change person name
print(next(res))
# Prompting operation
from prompt import *
res = dataset["test"].apply(template_tc1)
print(next(res))
# Aggregating operation
from aggregate.text_classification import *
res = dataset["test"].apply(get_statistics)
Acknowledgment
DataLab originated from a fork of the awesome Huggingface Datasets and TensorFlow Datasets. We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datalabs-0.2.9-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0961f9aadfb52804e4c647f11d8cfdd88b3fd26eb5dfa9719a79095e8b9f5d51 |
|
MD5 | 41342615122a74a176f7bf070b2a1218 |
|
BLAKE2b-256 | f2b54618a7cbdb07892167d68f5633419877460d73e5f763d3c90a4e982a167b |