Skip to main content

A library for squeakily cleaning and filtering language datasets.

Project description

squeakily

This repository is heavily inspired by BigScience’s ROOTs project and EleutherAI’s The Pile.

The overall pipeline is as follows:

In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.

Install

pip install squeakily

How to use

Using the API

First, we need to define a datasource. squeakily accepts any Dataset object from the HuggingFace Datasets library. For example, we can use the wikitext dataset:

from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]")

We simply need to wrap the Dataset object in a dictionary, with the key being the name of the datasource and the value being the Dataset object, the filter and cleaners. For example:

from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace

datasources = [
    {
        "dataset": ds,
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
    },
    # ...
]

Warning

Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.

Important

Note: As of now, we only use the first column of the given column names. This is because the squeakily library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.

Finally, we can apply the filters and cleaners to the datasouces using a Pipeline object:

from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()
[11/16/22 04:32:57] INFO     Running datasource: wikitext                                                core.py:41
                    INFO     Running filter: check_char_repetition on text                               core.py:54
                    INFO     Running filter: check_flagged_words on text                                 core.py:54
                    INFO     Running cleaner: remove_empty_lines on text                                 core.py:57
[11/16/22 04:32:59] INFO     Running cleaner: normalize_whitespace on text                               core.py:57

Note

Note: If you want to run cleaners first, you can pass cleaning_first=True to the run function.

pipeline.run(cleaning_first=True)

If you need to run a filter or cleaner at the dataset level rather than the example level, you can pass global_filters or global_cleaners to the Pipeline.run function. For example:

from squeakily.filter import minhash_dedup

pipeline.run(global_filters=[minhash_dedup])

Note

Note: If you use global filters or cleaners, all datasets must have a common column name in order to properly concatenate them.

Note

Note: You can also specifiy if you want a specific dataset to be skipped by setting the skip_global parameter to True when defining the datasource.

datasources = [
    {
        "dataset": ds,
        "columns": ["text"],
        "filters": [check_char_repetition, check_flagged_words],
        "cleaners": [remove_empty_lines, normalize_whitespace],
        "skip_global": True,
    },
    # ...
]

Additionally, you can run the pipeline in a dry run mode by passing dry_run=True to the run function. This will make no modifications to the datasets’ documents, but will add additional columns to the datasets with the results of the filters and cleaners. For example, if you if you ran the pipeline with the check_char_repetition filter, you would get a new column called check_char_repetition with a float value between 0 and 1 indicating the percentage of characters that are repeated in the document.

::: {.cell}
``` {.python .cell-code}
pipeline = Pipeline(datasources)
pipeline.run(dry_run=True)
pipeline.datasources[0]["dataset"].features

:::

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squeakily-0.0.2.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

squeakily-0.0.2-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file squeakily-0.0.2.tar.gz.

File metadata

  • Download URL: squeakily-0.0.2.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for squeakily-0.0.2.tar.gz
Algorithm Hash digest
SHA256 12ea479d62c4d30d7d3560aa7cfb10cda5b92d55f6f12c17399195d45fab36fc
MD5 9b9e1d131a1a88ff08139c0b972a56f5
BLAKE2b-256 39aa0421fd40e966964f5e26d7c8536b7d6ab48e84562b8e4e5f6ce59c529d3c

See more details on using hashes here.

File details

Details for the file squeakily-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: squeakily-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for squeakily-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1625bcb2d322234be3d494f31f4c70675c667f19958f607561a9713075be83c3
MD5 773a51f518a0bfeb7607d0dc63252a4d
BLAKE2b-256 9dc085d8d44d22c14172d100cd658872acc878265b2a7199904e1f3ac43b004c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page