A library for squeakily cleaning and filtering language datasets.
Project description
squeakily
This repository is heavily inspired by BigScience’s ROOTs project and EleutherAI’s The Pile.
The overall pipeline is as follows:
flowchart LR
A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
B --> C(Defining Cleaners <br/>per Datasource)
In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.
Install
pip install squeakily
How to use
Using the API
First, we need to define a datasource. squeakily
accepts any Dataset
object from the HuggingFace
Datasets library. For
example, we can use the
wikitext dataset:
from datasets import load_dataset
ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:10%]")
We simply need to wrap the Dataset
object in a dictionary, with the
key being the name of the datasource and the value being the Dataset
object, the filter and cleaners. For example:
from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace
datasources = [
{
"dataset": ds,
"columns": ["text"],
"filters": [check_char_repetition, check_flagged_words],
"cleaners": [remove_empty_lines, normalize_whitespace],
},
# ...
]
Warning
Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.
Important
Note: As of now, we only use the first column of the given column names. This is because the
squeakily
library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.
Finally, we can apply the filters and cleaners to the datasouces using a
Pipeline
object:
from squeakily.core import Pipeline
pipeline = Pipeline(datasources)
pipeline.run()
Note
Note: If you want to run cleaners first, you can pass
cleaning_first=True
to therun
function.pipeline.run(cleaning_first=True)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file squeakily-0.0.1.tar.gz
.
File metadata
- Download URL: squeakily-0.0.1.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b64a2f1c74826edc028742669fb90914327abb3db5280482d1a2513c9fef0e71 |
|
MD5 | 9f7a7ca621ed9ff3eeee0f2dc01dcf5e |
|
BLAKE2b-256 | 44dba26bac20bac158d798682d3f35ea991818185ac52fdb72e8c6676ba278d8 |
File details
Details for the file squeakily-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: squeakily-0.0.1-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c501fd880351ce89f296f56fcf27eb138ed28a735a30cc3ba6834ce0a23da5f6 |
|
MD5 | 9cd1f0419ff5767bf2f2aa18d08b762f |
|
BLAKE2b-256 | 071bec2d6d0e5d099986446c6d1111e241d6b31e9f4a726922a8828a3a57f710 |