Curate, annotate, and clean massive unstructured text datasets for machine learning and AI systems.
Project description
Galactic
Cleaning and curation tools for massive unstructured text datasets
Getting Started
- Clone the repo:
git clone https://github.com/taylorai/docent.git
- Create a new Jupyter notebook
- Install the dependencies:
!pip install -r requirements.txt
- Import Docent:
from dataset import Docent
- Load your dataset:
ds = Docent.from_disk("c4-with_embs")
Utilities
Preprocessing
- Trim whitespace
ds.trim_whitespace(fields=["field1", "field2"])
- Tag text on string
ds.tag_string(fields=["field1"], values=["value1", "value2"], tag="desired_tag")
- Tag text with RegEx
ds.tag_regex(fields=["field1"], regex="some_regex", tag="desired_tag")
- Filter on string
ds.filter_string(fields=["field1"], values=["value1", "value2"])
- Filter with RegEx
ds.filter_regex(fields=["field1"], regex="some_regex")
Exploration
- Count tokens
ds.count_tokens(fields=["text_field"])
- Detect PII
ds.detect_pii(fields=["name", "description"])
- Detect the language
ds.detect_language(field="text_field")
Manipulation
- Generate embeddings
ds.get_embeddings(field="text_field")
- Retrieve the nearest neighbors
results = ds.get_nearest_neighbors(query="sample text", k=5)
- Create clusters
ds.cluster(n_clusters=5, method="kmeans")
- Remove a cluster
ds.remove_cluster(cluster=3)
- Semantically Dedeplucate
doc.semdedup(threshold=0.95)
Example
See example.ipynb
for an example
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
galactic-ai-0.1.0.tar.gz
(8.8 kB
view hashes)
Built Distribution
Close
Hashes for galactic_ai-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ad202d116fda85da64a9f13534a9ee6a8d67fb85000ab2de079cf3662df66f3 |
|
MD5 | 4600d28ed6623b91c33d6fb6fe084edd |
|
BLAKE2b-256 | bbf55c39618bb5cd35069813a399d662ee095725f374779acf460c6a0de7c0ca |