Skip to main content

A library for squeakily cleaning and filtering language datasets.

Project description

squeakily

This repository is heavily inspired by BigScience’s ROOTs project and EleutherAI’s The Pile.

The overall pipeline is as follows:

flowchart LR
  A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
  B --> C(Defining Cleaners <br/>per Datasource)

In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.

Install

pip install squeakily

How to use

Using the API

First, we need to define a datasource. squeakily accepts any Dataset object from the HuggingFace Datasets library. For example, we can use the wikitext dataset:

from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:10%]")

We simply need to wrap the Dataset object in a dictionary, with the key being the name of the datasource and the value being the Dataset object, the filter and cleaners. For example:

from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace

datasources = [
    {
        "dataset": ds,
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
    },
    # ...
]

Warning

Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.

Important

Note: As of now, we only use the first column of the given column names. This is because the squeakily library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.

Finally, we can apply the filters and cleaners to the datasouces using a Pipeline object:

from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()

Note

Note: If you want to run cleaners first, you can pass cleaning_first=True to the run function.

pipeline.run(cleaning_first=True)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squeakily-0.0.1.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

squeakily-0.0.1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file squeakily-0.0.1.tar.gz.

File metadata

  • Download URL: squeakily-0.0.1.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for squeakily-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b64a2f1c74826edc028742669fb90914327abb3db5280482d1a2513c9fef0e71
MD5 9f7a7ca621ed9ff3eeee0f2dc01dcf5e
BLAKE2b-256 44dba26bac20bac158d798682d3f35ea991818185ac52fdb72e8c6676ba278d8

See more details on using hashes here.

File details

Details for the file squeakily-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: squeakily-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for squeakily-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c501fd880351ce89f296f56fcf27eb138ed28a735a30cc3ba6834ce0a23da5f6
MD5 9cd1f0419ff5767bf2f2aa18d08b762f
BLAKE2b-256 071bec2d6d0e5d099986446c6d1111e241d6b31e9f4a726922a8828a3a57f710

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page