Curate, annotate, and clean massive unstructured text datasets for machine learning and AI systems.
Project description
Galactic
Cleaning and curation tools for massive unstructured text datasets.
To get started, install the package (pip install galactic-ai
) and import it:
from galactic import GalacticDataset
Loading and Saving Data
from_csv
@classmethod
def from_csv(cls, path: str) -> 'GalacticDataset':
Parameters:
path (str)
: The path to the CSV file.
Returns:
GalacticDataset
: A dataset instance initialized from the CSV file.
Example:
ds = GalacticDataset.from_csv("data.csv")
from_jsonl
@classmethod
def from_jsonl(cls, path: str, **kwargs) -> 'GalacticDataset':
Parameters:
path (str)
: The path to the JSONL file.**kwargs
: Additional parameters passed todatasets.load_dataset
.
Returns:
GalacticDataset
: A dataset instance initialized from the JSONL file.
Example:
ds = GalacticDataset.from_jsonl("data.jsonl")
from_parquet
@classmethod
def from_jsonl(cls, path: str) -> 'GalacticDataset':
Parameters:
path (str)
: The path to the Parquet file.
Returns:
GalacticDataset
: A dataset instance initialized from the Parquet file.
Example:
ds = GalacticDataset.from_parquet("data.parquet")
from_pandas
@classmethod
def from_pandas(cls, df, **kwargs) -> 'GalacticDataset':
Parameters:
df
: A Pandas DataFrame.**kwargs
: Additional parameters passed todatasets.Dataset.from_pandas
.
Returns:
GalacticDataset
: A dataset instance initialized from the DataFrame.
Example:
import pandas as pd
df = pd.read_csv("data.csv")
ds = GalacticDataset.from_pandas(df)
from_hugging_face
@classmethod
def from_hugging_face(
cls,
path: str,
split: str,
config_name: Optional[str] = None,
**kwargs
) -> 'GalacticDataset':
Parameters:
path (str)
: The identifier of the Hugging Face dataset.split (str)
: The desired split ('train', 'validation', 'test').config_name (str, optional)
: Specific dataset configuration name. (For example, for C4, a config name likeen
orrealnewslike
is required).**kwargs
: Additional parameters passed todatasets.load_dataset
.
Returns:
GalacticDataset
: A dataset instance initialized from the Hugging Face dataset.
Example:
ds = GalacticDataset.from_hugging_face("squad", split="train")
from_hugging_face_stream
@classmethod
def from_hugging_face_stream(
cls,
path: str,
split: str,
config_name: Optional[str] = None,
filters: list[Callable[[dict], bool]] = [],
dedup_fields: Optional[list[str]] = None,
max_samples: Optional[int] = 200000,
**kwargs
) -> 'GalacticDataset':
Parameters:
path (str)
: The identifier of the Hugging Face dataset.split (str)
: The desired split ('train', 'validation', 'test').config_name (str, optional)
: Specific dataset configuration name. (For example, for C4, a config name likeen
orrealnewslike
is required).filters (list[Callable], optional)
: List of filter functions to apply.dedup_fields (list[str], optional)
: Fields to check for duplicates.max_samples (int, optional)
: Maximum number of samples to load, after filtering.**kwargs
: Additional parameters passed to HuggingFacedatasets.load_dataset
.
Returns:
GalacticDataset
: A dataset instance initialized from the Hugging Face dataset.
Example:
filters = [lambda x: x['field'] > 1]
ds = GalacticDataset.from_hugging_face_stream("squad", split="train", filters=filters)
save
def save(self, path: str, overwrite: bool = False) -> None:
Parameters:
path (str)
: The path to save the dataset to.overwrite (bool, optional)
: Whether to overwrite the file if it already exists.
Returns:
None
Example:
ds.save("data.parquet")
Filtering Data
apply_bloom_filter
A Bloom filter is a memory-efficient, probabilistic data structure for exact-deduplication, which allows you to deduplicate with a single pass over the dataset rather than comparing all possible pairs. It guarantees there will be no false negatives--all actual duplicates will be removed. But there is a small probability of false positives, which means a small number of non-duplicates may be removed.
def apply_bloom_filter(self, fields: Sequence[str], inplace: bool = True) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to apply the Bloom filter on. Two records will be considered duplicates if they match all of these fields.inplace (bool, default=True)
: Whether to modify the dataset in-place.
Returns:
GalacticDataset
: Modified dataset with filtered records.
Example:
ds.apply_bloom_filter(['text_field'])
filter_string
def filter_string(self, fields: Sequence[str], values: Sequence[str], inplace: bool = True) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to apply the filter on.values (Sequence[str])
: List of string values to filter out.inplace (bool, default=True)
: Whether to modify the dataset in-place.
Returns:
GalacticDataset
: Modified dataset with filtered records.
Example:
ds.filter_string(['text_field'], ['exclude_this', 'and_this'])
filter_regex
def filter_regex(self, fields: Sequence[str], regex: str, inplace: bool = True) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to apply the regex-based filter on.regex (str)
: The regex pattern to filter out.inplace (bool, default=True)
: Whether to modify the dataset in-place.
Returns:
GalacticDataset
: Modified dataset with filtered records.
Example:
ds.filter_regex(['text_field'], r'\d+')
Feel free to modify the descriptions and examples as you see fit.
Processing Data
trim_whitespace
def trim_whitespace(self, fields: Sequence[str], inplace: bool = True) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to trim whitespace for.inplace (bool, default=True)
: Whether to modify the dataset in-place.
Returns:
GalacticDataset
: Modified dataset with trimmed fields.
Example:
ds.trim_whitespace(['text_field'])
tag_string
def tag_string(self, fields: Sequence[str], values: Sequence[str], tag: str) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to apply the tag.values (Sequence[str])
: List of values to tag.tag (str)
: The tag to be applied.
Returns:
GalacticDataset
: Modified dataset with new tags.
Example:
ds.tag_string(['text_field'], ['value1', 'value2'], 'my_tag')
tag_regex
def tag_regex(self, fields: Sequence[str], regex: str, tag: str) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to apply the regex-based tag.regex (str)
: The regex pattern.tag (str)
: The tag to be applied.
Returns:
GalacticDataset
: Modified dataset with new tags.
Example:
ds.tag_regex(['text_field'], r'\d+', 'contains_number')
detect_language
def detect_language(self, field: str) -> 'GalacticDataset':
Parameters:
field (str)
: Field to detect the language for.
Returns:
GalacticDataset
: Modified dataset with detected languages.
Example:
ds.detect_language('text_field')
calc_perplexity
def calc_perplexity(self, field: str) -> 'GalacticDataset':
Parameters:
field (str)
: Field to calculate the perplexity for.
Returns:
GalacticDataset
: Modified dataset with calculated perplexities.
Example:
ds.calc_perplexity('text_field')
detect_pii
def detect_pii(self, fields: Sequence[str]) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to detect PII in.
Returns:
GalacticDataset
: Modified dataset with detected PII.
Example:
ds.detect_pii(['email_field', 'phone_field'])
count_tokens
Counts tokens for each of the specified fields using the provided tokenizer (which is a string path to a Hugging Face tokenizer). If no tokenizer is provided, counts bytes instead.
def count_tokens(self, fields: Sequence[str], tokenizer: Optional[str] = None) -> 'GalacticDataset':
Parameters:
fields (Sequence[str])
: List of fields to count tokens for.tokenizer (str, optional)
: Tokenizer to use for token counting.
Returns:
GalacticDataset
: Modified dataset with token (or byte) counts.
Example:
ds.count_tokens(['text_field'], tokenizer="some_tokenizer")
Embedding and Clustering
get_embeddings
def get_embeddings(self, field: str, backend: str = "auto") -> 'GalacticDataset':
Parameters:
field (str)
: The field to create embeddings for.backend (str, default='auto')
: The backend to use for generating embeddings. Currently, options are limited to "cpu" and "openai". If "auto", will use "cpu". If using "openai", you need to first set theopenai_api_key
attribute on the dataset.
Returns:
GalacticDataset
: Modified dataset with added embeddings.
Example:
ds.get_embeddings('text_field')
get_nearest_neighbors
def get_nearest_neighbors(self, query: Union[str, np.ndarray], k: int = 5) -> pd.DataFrame:
Parameters:
query (str or np.ndarray)
: The query to find the nearest neighbors for.k (int, default=5)
: Number of nearest neighbors to return.
Returns:
pd.DataFrame
: DataFrame containing the top-k nearest neighbors.
Example:
ds.get_nearest_neighbors('sample query')
cluster
def cluster(self, n_clusters: int, method: str = "kmeans", batch_size: int = 1024, n_epochs: int = 5) -> None:
Parameters:
n_clusters (int)
: Number of clusters to form.method (str, default='kmeans')
: Clustering method to use. Options are 'kmeans' or 'minibatch_kmeans'.batch_size (int, default=1024)
: Batch size for 'minibatch_kmeans'.n_epochs (int, default=5)
: Number of epochs for 'minibatch_kmeans'.
Example:
ds.cluster(10)
get_cluster_info
def get_cluster_info(self) -> None:
Description:
- Provides information about the clusters, such as their sizes and prototypical examples.
Example:
ds.get_cluster_info()
remove_cluster
def remove_cluster(self, cluster: int) -> None:
Parameters:
cluster (int)
: The cluster ID to remove.
Example:
ds.remove_cluster(1)
Great, let's document the semdedup
method for the GalacticDataset
class.
semdedup
def semdedup(
self,
target_retention: Optional[float] = 0.8,
threshold: Optional[float] = None,
inplace: bool = True
) -> 'GalacticDataset':
Parameters:
-
target_retention (float, optional)
: The fraction of data points to retain after deduplication. If specified, the method will automatically tune the similarity on a few clusters, targeting this level of retention. Default is 0.8. -
threshold (float, optional)
: The similarity threshold for marking duplicates (cosine similarity). Ignored iftarget_retention
is specified. -
inplace (bool, default=True)
: Whether to modify the dataset in-place or return a new one.
Returns:
GalacticDataset
: The dataset with semantic duplicates removed. Returnsself
ifinplace=True
.
Raises:
ValueError
: If neithertarget_retention
northreshold
are specified.
Example:
ds.semdedup(target_retention=0.8)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for galactic_ai-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5c765812ae9400cb383895c94d19210d03bb391510333d3444fef28446eb92f |
|
MD5 | 50248f048c6668f2dd58b6dfc87b5ff6 |
|
BLAKE2b-256 | 6f7a4e41c2deae0736245c1c3abfaff70aa551f2209e7ef4639c7b4404015d1a |