Skip to main content

Datalabs

Project description



License GitHub stars PyPI Integration Tests

DataLab is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:

  • Data Diagnostics: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
  • Operation Standardization: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
  • Data Search: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
  • Global Analysis: DataLab provides tools to perform global analyses over a variety of datasets.

Installation

DataLab can be installed from PyPi

pip install --upgrade pip
pip install datalabs
python -m nltk.downloader omw-1.4 # to support more feature calculation

or from the source

# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install -e .[dev]
python -m nltk.downloader omw-1.4 # to support more feature calculation

By adding [dev], some extra libraries will be installed, such as pre-commit.

Code Quality Check?

If you would like to contribute to DataLab, checking the code style and quality before your pull request is highly recommended. In this project, three types of checks will be expected: (a) black (2) flake8 (3) isort

you could achieve this in two ways:

Manually (suitable for developers using Github Destop)
pre-commit install
git init .
pre-commit run --all-files or

where pre-commit run -all-files is equivalent to

pre-commit run black   # (this is also equivalent to python -m black .)
pre-commit run isort   # (this is also equivalent to isort .)
pre-commit run flake8  # (this is  also equivalent to flake8)

Notably, black and isort can help us fix code style automatically, while flake8 only provide hints with us, which means we need to fix these issues raised by flake8.

Automatically (suitable for developers using Git CLI)
pre-commit install
git init .
git commit -m "your update message"

The git commit will automatically activate the command pre-commit run -all-files

Using DataLab

Below we give several simple examples to showcase the usage of DataLab:

You can also view documentation:

# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("ag_news")


# Preprocessing operation
from preprocess import *
res=dataset["test"].apply(lower)
print(next(res))

# Featurizing operation
from featurize import *
res = dataset["test"].apply(get_text_length) # get length
print(next(res))

res = dataset["test"].apply(get_entities_spacy) # get entity
print(next(res))

# Editing/Transformation operation
from edit import *
res = dataset["test"].apply(change_person_name) #  change person name
print(next(res))

# Prompting operation
from prompt import *
res = dataset["test"].apply(template_tc1)
print(next(res))

# Aggregating operation
from aggregate.text_classification import *
res = dataset["test"].apply(get_statistics)

Acknowledgment

DataLab originated from a fork of the awesome Huggingface Datasets and TensorFlow Datasets. We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section. We thank Antonis Anastasopoulos for sharing the mapping data between countries and languages, and thank Alissa Ostapenko, Yulia Tsvetkov, Jie Fu, Ziyun Xu, Hiroaki Hayashi, and Zhengfu He for useful discussion and suggestions for the first version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalabs-0.4.15.tar.gz (340.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalabs-0.4.15-py2.py3-none-any.whl (2.3 MB view details)

Uploaded Python 2Python 3

File details

Details for the file datalabs-0.4.15.tar.gz.

File metadata

  • Download URL: datalabs-0.4.15.tar.gz
  • Upload date:
  • Size: 340.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for datalabs-0.4.15.tar.gz
Algorithm Hash digest
SHA256 020e88f01890f21614af117f51922e78eaf48d1003d2ff93df150101b0e5e9af
MD5 2ce3c09388ecd6c13d43e5b999bf0cf6
BLAKE2b-256 54b625da85ae5f1758a13dd3d1f3be17f28ffe6d190e0c05e218cf00af2ce2f4

See more details on using hashes here.

File details

Details for the file datalabs-0.4.15-py2.py3-none-any.whl.

File metadata

  • Download URL: datalabs-0.4.15-py2.py3-none-any.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for datalabs-0.4.15-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 463d48186b98a70b1ce1d0abb10aa549f1cbaf65b0b8b7ed64d10b80969c483a
MD5 8bce84ec52c2a0e13be98f5f1f7e9434
BLAKE2b-256 f21353c4f424079a8769a40cd0dda67ac1601e34455d2a3400ba9a4a68de0955

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page