Curate, annotate, and clean massive unstructured text datasets for machine learning and AI systems.

Project description

Galactic

Cleaning and curation tools for massive unstructured text datasets.

To get started, install the package (pip install galactic-ai) and import it:

from galactic import GalacticDataset

Loading and Saving Data

`from_csv`

@classmethod
def from_csv(cls, path: str) -> 'GalacticDataset':

Parameters:

path (str): The path to the CSV file.

Returns:

GalacticDataset: A dataset instance initialized from the CSV file.

Example:

ds = GalacticDataset.from_csv("data.csv")

`from_jsonl`

@classmethod
def from_jsonl(cls, path: str, **kwargs) -> 'GalacticDataset':

Parameters:

path (str): The path to the JSONL file.
**kwargs: Additional parameters passed to datasets.load_dataset.

Returns:

GalacticDataset: A dataset instance initialized from the JSONL file.

Example:

ds = GalacticDataset.from_jsonl("data.jsonl")

`from_parquet`

@classmethod
def from_jsonl(cls, path: str) -> 'GalacticDataset':

Parameters:

path (str): The path to the Parquet file.

Returns:

GalacticDataset: A dataset instance initialized from the Parquet file.

Example:

ds = GalacticDataset.from_parquet("data.parquet")

`from_pandas`

@classmethod
def from_pandas(cls, df, **kwargs) -> 'GalacticDataset':

Parameters:

df: A Pandas DataFrame.
**kwargs: Additional parameters passed to datasets.Dataset.from_pandas.

Returns:

GalacticDataset: A dataset instance initialized from the DataFrame.

Example:

import pandas as pd
df = pd.read_csv("data.csv")
ds = GalacticDataset.from_pandas(df)

`from_hugging_face`

@classmethod
def from_hugging_face(
   cls, 
   path: str, 
   split: str,  
   config_name: Optional[str] = None,
   **kwargs
) -> 'GalacticDataset':

Parameters:

path (str): The identifier of the Hugging Face dataset.
split (str): The desired split ('train', 'validation', 'test').
config_name (str, optional): Specific dataset configuration name. (For example, for C4, a config name like en or realnewslike is required).
**kwargs: Additional parameters passed to datasets.load_dataset.

Returns:

GalacticDataset: A dataset instance initialized from the Hugging Face dataset.

Example:

ds = GalacticDataset.from_hugging_face("squad", split="train")

`from_hugging_face_stream`

@classmethod
def from_hugging_face_stream(
    cls,
    path: str,
    split: str,
    config_name: Optional[str] = None,
    filters: list[Callable[[dict], bool]] = [],
    dedup_fields: Optional[list[str]] = None,
    max_samples: Optional[int] = 200000,
    **kwargs
) -> 'GalacticDataset':

Parameters:

path (str): The identifier of the Hugging Face dataset.
split (str): The desired split ('train', 'validation', 'test').
config_name (str, optional): Specific dataset configuration name. (For example, for C4, a config name like en or realnewslike is required).
filters (list[Callable], optional): List of filter functions to apply.
dedup_fields (list[str], optional): Fields to check for duplicates.
max_samples (int, optional): Maximum number of samples to load, after filtering.
**kwargs: Additional parameters passed to HuggingFace datasets.load_dataset.

Returns:

GalacticDataset: A dataset instance initialized from the Hugging Face dataset.

Example:

filters = [lambda x: x['field'] > 1]
ds = GalacticDataset.from_hugging_face_stream("squad", split="train", filters=filters)

`save`

def save(self, path: str, overwrite: bool = False) -> None:

Parameters:

path (str): The path to save the dataset to.
overwrite (bool, optional): Whether to overwrite the file if it already exists.

Returns:

None

Example:

ds.save("data.parquet")

Filtering Data

`apply_bloom_filter`

A Bloom filter is a memory-efficient, probabilistic data structure for exact-deduplication, which allows you to deduplicate with a single pass over the dataset rather than comparing all possible pairs. It guarantees there will be no false negatives--all actual duplicates will be removed. But there is a small probability of false positives, which means a small number of non-duplicates may be removed.

def apply_bloom_filter(self, fields: Sequence[str], inplace: bool = True) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to apply the Bloom filter on. Two records will be considered duplicates if they match all of these fields.
inplace (bool, default=True): Whether to modify the dataset in-place.

Returns:

GalacticDataset: Modified dataset with filtered records.

Example:

ds.apply_bloom_filter(['text_field'])

`filter_string`

def filter_string(self, fields: Sequence[str], values: Sequence[str], inplace: bool = True) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to apply the filter on.
values (Sequence[str]): List of string values to filter out.
inplace (bool, default=True): Whether to modify the dataset in-place.

Returns:

GalacticDataset: Modified dataset with filtered records.

Example:

ds.filter_string(['text_field'], ['exclude_this', 'and_this'])

`filter_regex`

def filter_regex(self, fields: Sequence[str], regex: str, inplace: bool = True) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to apply the regex-based filter on.
regex (str): The regex pattern to filter out.
inplace (bool, default=True): Whether to modify the dataset in-place.

Returns:

GalacticDataset: Modified dataset with filtered records.

Example:

ds.filter_regex(['text_field'], r'\d+')

Feel free to modify the descriptions and examples as you see fit.

Processing Data

`trim_whitespace`

def trim_whitespace(self, fields: Sequence[str], inplace: bool = True) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to trim whitespace for.
inplace (bool, default=True): Whether to modify the dataset in-place.

Returns:

GalacticDataset: Modified dataset with trimmed fields.

Example:

ds.trim_whitespace(['text_field'])

`tag_string`

def tag_string(self, fields: Sequence[str], values: Sequence[str], tag: str) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to apply the tag.
values (Sequence[str]): List of values to tag.
tag (str): The tag to be applied.

Returns:

GalacticDataset: Modified dataset with new tags.

Example:

ds.tag_string(['text_field'], ['value1', 'value2'], 'my_tag')

`tag_regex`

def tag_regex(self, fields: Sequence[str], regex: str, tag: str) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to apply the regex-based tag.
regex (str): The regex pattern.
tag (str): The tag to be applied.

Returns:

GalacticDataset: Modified dataset with new tags.

Example:

ds.tag_regex(['text_field'], r'\d+', 'contains_number')

`detect_language`

def detect_language(self, field: str) -> 'GalacticDataset':

Parameters:

field (str): Field to detect the language for.

Returns:

GalacticDataset: Modified dataset with detected languages.

Example:

ds.detect_language('text_field')

`calc_perplexity`

def calc_perplexity(self, field: str) -> 'GalacticDataset':

Parameters:

field (str): Field to calculate the perplexity for.

Returns:

GalacticDataset: Modified dataset with calculated perplexities.

Example:

ds.calc_perplexity('text_field')

`detect_pii`

def detect_pii(self, fields: Sequence[str]) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to detect PII in.

Returns:

GalacticDataset: Modified dataset with detected PII.

Example:

ds.detect_pii(['email_field', 'phone_field'])

`count_tokens`

Counts tokens for each of the specified fields using the provided tokenizer (which is a string path to a Hugging Face tokenizer). If no tokenizer is provided, counts bytes instead.

def count_tokens(self, fields: Sequence[str], tokenizer: Optional[str] = None) -> 'GalacticDataset':

Parameters:

fields (Sequence[str]): List of fields to count tokens for.
tokenizer (str, optional): Tokenizer to use for token counting.

Returns:

GalacticDataset: Modified dataset with token (or byte) counts.

Example:

ds.count_tokens(['text_field'], tokenizer="some_tokenizer")

Embedding and Clustering

`get_embeddings`

def get_embeddings(self, field: str, backend: str = "auto") -> 'GalacticDataset':

Parameters:

field (str): The field to create embeddings for.
backend (str, default='auto'): The backend to use for generating embeddings. Currently, options are limited to "cpu" and "openai". If "auto", will use "cpu". If using "openai", you need to first set the openai_api_key attribute on the dataset.

Returns:

GalacticDataset: Modified dataset with added embeddings.

Example:

ds.get_embeddings('text_field')

`get_nearest_neighbors`

def get_nearest_neighbors(self, query: Union[str, np.ndarray], k: int = 5) -> pd.DataFrame:

Parameters:

query (str or np.ndarray): The query to find the nearest neighbors for.
k (int, default=5): Number of nearest neighbors to return.

Returns:

pd.DataFrame: DataFrame containing the top-k nearest neighbors.

Example:

ds.get_nearest_neighbors('sample query')

`cluster`

def cluster(self, n_clusters: int, method: str = "kmeans", batch_size: int = 1024, n_epochs: int = 5) -> None:

Parameters:

n_clusters (int): Number of clusters to form.
method (str, default='kmeans'): Clustering method to use. Options are 'kmeans' or 'minibatch_kmeans'.
batch_size (int, default=1024): Batch size for 'minibatch_kmeans'.
n_epochs (int, default=5): Number of epochs for 'minibatch_kmeans'.

Example:

ds.cluster(10)

`get_cluster_info`

def get_cluster_info(self) -> None:

Description:

Provides information about the clusters, such as their sizes and prototypical examples.

Example:

ds.get_cluster_info()

`remove_cluster`

def remove_cluster(self, cluster: int) -> None:

Parameters:

cluster (int): The cluster ID to remove.

Example:

ds.remove_cluster(1)

Great, let's document the semdedup method for the GalacticDataset class.

`semdedup`

def semdedup(
    self,
    target_retention: Optional[float] = 0.8,
    threshold: Optional[float] = None,
    inplace: bool = True
) -> 'GalacticDataset':

Parameters:

target_retention (float, optional): The fraction of data points to retain after deduplication. If specified, the method will automatically tune the similarity on a few clusters, targeting this level of retention. Default is 0.8.
threshold (float, optional): The similarity threshold for marking duplicates (cosine similarity). Ignored if target_retention is specified.
inplace (bool, default=True): Whether to modify the dataset in-place or return a new one.

Returns:

GalacticDataset: The dataset with semantic duplicates removed. Returns self if inplace=True.

Raises:

ValueError: If neither target_retention nor threshold are specified.

Example:

ds.semdedup(target_retention=0.8)

Project details

Release history Release notifications | RSS feed

0.2.16

Oct 14, 2023

0.2.15

Oct 9, 2023

0.2.14

Oct 9, 2023

0.2.13

Oct 9, 2023

0.2.12

Oct 9, 2023

0.2.11

Oct 9, 2023

0.2.10

Oct 4, 2023

0.2.9

Oct 4, 2023

0.2.8

Oct 4, 2023

0.2.7

Sep 30, 2023

0.2.6

Sep 27, 2023

0.2.5

Sep 27, 2023

0.2.4

Sep 24, 2023

0.2.3

Sep 24, 2023

0.2.2

Sep 21, 2023

0.2.1

Sep 16, 2023

0.2.0

Sep 16, 2023

0.1.9

Sep 16, 2023

0.1.8

Sep 15, 2023

0.1.7

Sep 15, 2023

0.1.6

Sep 15, 2023

0.1.5

Sep 15, 2023

0.1.4

Sep 15, 2023

This version

0.1.3

Sep 15, 2023

0.1.2

Sep 15, 2023

0.1.1

Sep 15, 2023

0.1.0

Sep 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

galactic-ai-0.1.3.tar.gz (20.9 kB view hashes)

Uploaded Sep 15, 2023 Source

Built Distribution

galactic_ai-0.1.3-py3-none-any.whl (22.1 kB view hashes)

Uploaded Sep 15, 2023 Python 3

Hashes for galactic-ai-0.1.3.tar.gz

Hashes for galactic-ai-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5f5498f2d2671b0280b34faa543f71612296001bcaaf26a40a06757f6d9f0527`
MD5	`dc337399c0f76b31842eec1e80b55753`
BLAKE2b-256	`10385fce0040463b883c0d454e9285179a869db2369ec2cbfcd70801df1003dd`

Hashes for galactic_ai-0.1.3-py3-none-any.whl

Hashes for galactic_ai-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd3f06cc65531bcafc7b708b98c7be448aed4044dd9d8268e27a598b25aba2cc`
MD5	`1c3942456ff1add9dd99a67ab0bd4b01`
BLAKE2b-256	`754703fb8fbdac8c31f4cb782051b9f14ca2320f23c045855154aa9bee7f3e43`

galactic-ai 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Galactic

Loading and Saving Data

from_csv

from_jsonl

from_parquet

from_pandas

from_hugging_face

from_hugging_face_stream

save

Filtering Data

apply_bloom_filter

filter_string

filter_regex

Processing Data

trim_whitespace

tag_string

tag_regex

detect_language

calc_perplexity

detect_pii

count_tokens

Embedding and Clustering

get_embeddings

get_nearest_neighbors

cluster

get_cluster_info

remove_cluster

semdedup

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`from_csv`

`from_jsonl`

`from_parquet`

`from_pandas`

`from_hugging_face`

`from_hugging_face_stream`

`save`

`apply_bloom_filter`

`filter_string`

`filter_regex`

`trim_whitespace`

`tag_string`

`tag_regex`

`detect_language`

`calc_perplexity`

`detect_pii`

`count_tokens`

`get_embeddings`

`get_nearest_neighbors`

`cluster`

`get_cluster_info`

`remove_cluster`

`semdedup`