Datalabs

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DataLab is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:

Data Diagnostics: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
Operation Standardization: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
Data Search: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
Global Analysis: DataLab provides tools to perform global analyses over a variety of datasets.

Installation

DataLab can be installed from PyPi

pip install --upgrade pip
pip install datalabs

or from the source

# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install .

Getting started

Here we give several examples to showcase the usage of DataLab. For more information, please refer to the corresponding sections in our documentation.

# pip install datalabs
from datalabs import operations, load_dataset
from featurize import *


dataset = load_dataset("ag_news")

# print(task schema)
print(dataset['test']._info.task_templates)

# data operators
res = dataset["test"].apply(get_text_length)
print(next(res))


# get entity
res = dataset["test"].apply(get_entities_spacy)
print(next(res))

# get postag
res = dataset["test"].apply(get_postag_spacy)
print(next(res))

from edit import *
# add typos
res = dataset["test"].apply(add_typo)
print(next(res))

#  change person name
res = dataset["test"].apply(change_person_name)
print(next(res))

Task Schema

text-classification
- text:str
- label:ClassLabel
text-matching
- text1:str
- text2:str
- label:ClassLabel
summarization
- text:str
- summary:str
sequence-labeling
- tokens:List[str]
- tags:List[ClassLabel]
question-answering-extractive:
- context:str
- question:str
- answers:List[{"text":"","answer_start":""}]

one can use dataset[SPLIT]._info.task_templates to get more useful task-dependent information, where SPLIT could be train or validation or test.

Supported Datasets

here

Acknowledgment

DataLab originated from a fork of the awesome Huggingface Datasets and TensorFlow Datasets. We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.15

Dec 22, 2022

0.4.14

Oct 2, 2022

0.4.13

Sep 20, 2022

0.4.12

Sep 12, 2022

0.4.11

Sep 11, 2022

0.4.10

Sep 5, 2022

0.4.9

Aug 24, 2022

0.4.8

Aug 2, 2022

0.4.7

Jul 27, 2022

0.4.6

Jul 25, 2022

0.4.5

Jul 17, 2022

0.4.4

Jul 14, 2022

0.4.3

Jun 2, 2022

0.4.2

May 29, 2022

0.4.1

May 14, 2022

0.4.0

May 11, 2022

0.3.13

May 9, 2022

0.3.12

May 3, 2022

0.3.11

Apr 22, 2022

0.3.10

Apr 13, 2022

0.3.9

Apr 13, 2022

0.3.8

Apr 10, 2022

0.3.7

Mar 23, 2022

0.3.6

Mar 20, 2022

0.3.5

Mar 20, 2022

0.3.4

Mar 12, 2022

0.3.3

Mar 11, 2022

0.3.2

Mar 11, 2022

0.3.1

Mar 10, 2022

0.3.0

Mar 7, 2022

0.2.11

Mar 6, 2022

0.2.10

Mar 6, 2022

0.2.9

Feb 26, 2022

0.2.8

Feb 24, 2022

0.2.7

Feb 16, 2022

0.2.6

Feb 16, 2022

0.2.5.dev0 pre-release

Feb 16, 2022

0.2.4.dev0 pre-release

Feb 16, 2022

0.2.2.dev0 pre-release

Feb 13, 2022

This version

0.2.1.dev0 pre-release

Feb 11, 2022

0.2.0.dev0 pre-release

Feb 11, 2022

0.1.8.dev0 pre-release

Feb 9, 2022

0.1.7.dev0 pre-release

Feb 9, 2022

0.1.6.dev0 pre-release

Feb 4, 2022

0.1.5.dev0 pre-release

Feb 3, 2022

0.1.4.dev0 pre-release

Feb 3, 2022

0.1.3.dev0 pre-release

Feb 3, 2022

0.1.1.dev0 pre-release

Jan 18, 2022

0.1.0.dev0 pre-release

Jan 18, 2022

0.0.5.dev0 pre-release

Jan 15, 2022

0.0.4.dev0 pre-release

Jan 14, 2022

0.0.3.dev0 pre-release

Jan 13, 2022

0.0.2.dev0 pre-release

Jan 12, 2022

0.0.1.dev0 pre-release

Jan 12, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalabs-0.2.1.dev0.tar.gz (298.2 kB view hashes)

Uploaded Feb 11, 2022 Source

Built Distribution

datalabs-0.2.1.dev0-py2.py3-none-any.whl (2.2 MB view hashes)

Uploaded Feb 11, 2022 Python 2 Python 3

Hashes for datalabs-0.2.1.dev0.tar.gz

Hashes for datalabs-0.2.1.dev0.tar.gz
Algorithm	Hash digest
SHA256	`0f6382b8a425f99679cddde6452ccabed82383467b3703dd3a3fc04d2cb71126`
MD5	`e171c0efe10f26e3474349166dca845f`
BLAKE2b-256	`e67d0b132f666c1b0ed107f22c675231a4d1a429cfcf6ddca252749be9848811`

Hashes for datalabs-0.2.1.dev0-py2.py3-none-any.whl

Hashes for datalabs-0.2.1.dev0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`abf869a0e33994896b6f0307d2e6710bf66a84688708ebd98b678db45d84a174`
MD5	`83929b650d82247df66ea700b815a2f4`
BLAKE2b-256	`5314761780c0e0ab194ea8c9c7f65ff6f1786cae400f6e76a6a59cb890bfe225`