Curate, annotate, and clean massive unstructured text datasets for machine learning and AI systems.

Project description

Galactic

jpg

Galactic provides cleaning and curation tools for massive unstructured text datasets. It's designed to help you curate fine-tuning datasets, create document collections for retrieval-augmented generation (RAG), and even perform deduplication of web-scale datasets for LLM pre-training. This README provides a non-exhaustive overview of what Galactic can do; for more details, check out the API reference.

Galactic is made available under the Apache 2.0 license.

Getting Started

To get started, install the package (pip install galactic-ai) and import it:

from galactic import GalacticDataset

Galactic supports familiar generic methods for data processing that you'll be familiar with if you've ever used HuggingFace datasets, like map, filter, select. Galactic also provides a higher-level API for common workflows in text data curation, such as computing embeddings, language detection, scrubbing PII, labeling columns with AI, and large-scale deduplication. Let's dive in!

✨ NEW! (v0.2.4) AI data labeling & classifier distillation ✨

Galactic now supports using AI to augment and enrich your datasets. Classify a text column into fixed categories with ai_classifier, do open-ended processing like cleaning or summarization with ai_column, or tag multiple attributes with ai_tagger. When you hit your budget for sending API calls to OpenAI and still want to go bigger, train a fasttext_classifier or embeddings_classifier to distill your AI- or human-written labels into a fast classifier without leaving your notebook.

See the OpenHermes example notebook for a full demonstration, where we use OpenAI to label a few thousand examples, distill a speedy classifier, and then use that classifier to label 100k+ examples in 1 minute. All the AI labeling and distillation methods are documented in the API reference as well.

Loading Data

Galactic can load datasets from typical file formats (CSV, JSONL, Parquet), as well as from HuggingFace. If you're loading a massive dataset from HuggingFace, you can even filter data as it streams in, ensuring you don't load any duplicates, and you only hang on to data you want. For instance, here we will load the Falcon RefinedWeb dataset, but automatically deduplicate it, and only keep examples with fewer than 1024 characters. Let's get 5000 samples that meet our requirements.

filter_func = lambda x: len(x['content']) < 1024
dataset = GalacticDataset.from_hugging_face_stream(
    "tiiuae/falcon-refinedweb",
    split="train",
    filters=[filter_func],
    dedup_fields=["content"],
    max_samples=5000
)

Understanding the Data

Galactic is designed to help you understand unstructured text datasets. Let's start with some basics: we'll get the lengths of the texts using our tokenizer of choice, detect the language of the text, and scan for PII.

import matplotlib.pyplot as plt
dataset = dataset.count_tokens(fields=["content"], tokenizer="gpt2")
plt.hist(dataset["__token_count__content"], bins=50);

png

from collections import Counter
dataset.detect_language(field="content")
Counter(dataset["__language"])

INFO: Detected language in field content, added language metadata to '__language'.

Counter({'en': 4975,
         'es': 7,
         'fr': 7,
         'de': 3,
         'da': 2,
         'ru': 1,
         'nl': 1,
         'pt': 1,
         'sh': 1,
         'eo': 1,
         'ceb': 1})

dataset.detect_pii(
    fields=["content"]
)
print("Email:", sum(dataset["__pii__email"]))
print("Phone:", sum(dataset["__pii__phone"]))
print("Username/Password:", sum(dataset["__pii__credential"]))

INFO: Detected PII in fields: ['content']; added __pii__email, __pii__phone, __pii__credential, and __pii__any metadata.


Email: 285
Phone: 242
Username/Password: 9

Custom Tagging & Filtering

The built-in functions are just to get you started--Galactic allows you to tag and filter your data however you want. For instance, here we'll do the following:

Filter out all examples that have "blogspot" in the URL.
Tag all examples that mention fruit or vegetables.

dataset = dataset.filter_string(
    fields=["url"],
    values=["blogspot"]
)
len(dataset)

INFO: Filtered dataset in-place with exact string matching on fields: ['url']

4902

dataset = dataset.tag_regex(
    fields=["content"],
    regex="fruit|Fruit|vegetable|Vegetable|veggie|Veggie",
    tag="fruit_veggie"
)
f'{sum(dataset["__tag__fruit_veggie"])} records tagged with __tag__fruit_veggie'

INFO: Tagged dataset in-place with tag '__tag__fruit_veggie', using regex matching on fields: ['content']

'38 records tagged with __tag__fruit_veggie'

Embeddings & Clustering

Text embeddings are a great way to explore and understand unstructured data. Galactic can compute embeddings right on your CPU with the gte-small model, and then use them to cluster and deduplicate your data. (You also have the option to use OpenAI API embeddings as the backend--they're faster, but each embedding is larger, and you have to provide an API key.) On my Intel Macbook (no fancy M1 or M2), I can compute 1000 embeddings in a couple minutes. Longer texts also take longer since they have to be chunked into 512-token segments.

# to use openai, set dataset.openai_api_key = [...], and use backend="openai"
dataset.get_embeddings(field="content", backend="cpu")

INFO: Created embeddings on field 'content'

Once we've computed embeddings for a dataset, we can cluster it with k-means. Clusters can help discover domains in the data, or subsets that we might want to remove. They can also be used downstream for intra-cluster semantic deduplication (i.e. removing examples that share a cluster and are very close by in the embedding space).

dataset.cluster(n_clusters=10)
dataset.get_cluster_info(field="content")

Cluster 0 (550 items)
Cluster 1 (549 items)
Cluster 2 (673 items)
Cluster 3 (468 items)
Cluster 4 (403 items)
Cluster 5 (592 items)
Cluster 6 (290 items)
Cluster 7 (616 items)
Cluster 8 (461 items)
Cluster 9 (300 items)

Semantic deduplication within clusters is carried out with semdedup (inspired by this paper). You can provide a target retention rate for what percent of data you want to keep (the threshold will be tuned to achieve roughly this rate), or you can provide a cosine similarity threshold, and pairs within a cluster whose similarity is above the threshold will be considered duplicates.

dataset.semdedup(target_retention=0.75)

INFO: Tuning threshold on 3 clusters...
INFO: Threshold: 0.92
INFO: Cluster 0 has 106 duplicates (19.3%).
INFO: Cluster 1 has 133 duplicates (24.2%).
INFO: Cluster 2 has 489 duplicates (72.7%).
INFO: Cluster 3 has 125 duplicates (26.7%).
INFO: Cluster 4 has 96 duplicates (23.8%).
INFO: Cluster 5 has 354 duplicates (59.8%).
INFO: Cluster 6 has 80 duplicates (27.6%).
INFO: Cluster 7 has 96 duplicates (15.6%).
INFO: Cluster 8 has 174 duplicates (37.7%).
INFO: Cluster 9 has 25 duplicates (8.3%).
INFO: Removed 1678 / 4902 items flagged as semantic near-duplicates (34.23%).

Saving the result

Finally, let's save this data--either for more Galactic goodness later on (who wants to compute those embeddings again?), or so we can use it downstream for retrieval or fine-tuning.

dataset.save("my_dataset.jsonl")

What's Next?

This first release is just a taste of what we have planned for Galactic. Here's what you have to look forward to:

AI Data Labeling: Use API language models like OpenAI, or small local models, to automatically tag or filter your data for you. We'll also provide more fast classifiers (as with language identification) to do things like flag SEO spam, or detect if a sample is text vs. source code.
More Powerful Deduplication: We will add Minhash-LSH to remove near-duplicates before you even have to compute embeddings. We will also add support for D4, which follows semantic deduplication with a "diversification" step, keeping data that's further from cluster centroids.
Scaling: The features we've built so far can handle thousands or even hundreds of thousands of examples without breaking a sweat, but for true web-scale data processing, local embeddings start to feel slow, and memory becomes precious. We're working on features (like smaller, faster embeddings) to allow Galactic to scale gracefully to millions of examples.

If you like what we're doing, throw us a star on GitHub (or even better, contribute!), and stay tuned for more.

Project details

Release history Release notifications | RSS feed

0.2.16

Oct 14, 2023

0.2.15

Oct 9, 2023

0.2.14

Oct 9, 2023

0.2.13

Oct 9, 2023

0.2.12

Oct 9, 2023

0.2.11

Oct 9, 2023

0.2.10

Oct 4, 2023

0.2.9

Oct 4, 2023

0.2.8

Oct 4, 2023

This version

0.2.7

Sep 30, 2023

0.2.6

Sep 27, 2023

0.2.5

Sep 27, 2023

0.2.4

Sep 24, 2023

0.2.3

Sep 24, 2023

0.2.2

Sep 21, 2023

0.2.1

Sep 16, 2023

0.2.0

Sep 16, 2023

0.1.9

Sep 16, 2023

0.1.8

Sep 15, 2023

0.1.7

Sep 15, 2023

0.1.6

Sep 15, 2023

0.1.5

Sep 15, 2023

0.1.4

Sep 15, 2023

0.1.3

Sep 15, 2023

0.1.2

Sep 15, 2023

0.1.1

Sep 15, 2023

0.1.0

Sep 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

galactic-ai-0.2.7.tar.gz (48.4 kB view hashes)

Uploaded Sep 30, 2023 Source

Built Distribution

galactic_ai-0.2.7-py3-none-any.whl (51.1 kB view hashes)

Uploaded Sep 30, 2023 Python 3

Hashes for galactic-ai-0.2.7.tar.gz

Hashes for galactic-ai-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`f77afccf04acfec0260c84ce176f53c1d3895a3530d2be372996268744d5cd6c`
MD5	`0bc9f60bd0ee2462bf3c43d8d38ba977`
BLAKE2b-256	`b318ef0fa7d8c32bd8721ecef6cce9d0ac00b0a4a896221ca2ef5c31b938fd95`

Hashes for galactic_ai-0.2.7-py3-none-any.whl

Hashes for galactic_ai-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85f5562eaab2d156f99b33686b306fedc45c2b85e2584ad3b2602e07763ded58`
MD5	`3909516f9b28cceb2c17cb991801da17`
BLAKE2b-256	`5bd66ba0626149e1239dd69edea47945a2bff542d698b45248dac7d9b408c94d`