Organize unstructured data

These details have not been verified by PyPI

Project links

Project description

Lilac

Better data, better AI

Lilac is a tool for exploration, curation and quality control of datasets for training, fine-tuning and monitoring LLMs.

Lilac is used by companies like Cohere and Databricks to visualize, quantify and improve the quality of pre-training and fine-tuning data.

Lilac runs on-device using open-source LLMs with a UI and Python API.

🆒 New

Lilac Garden is our hosted platform for blazing fast dataset-level computations. Sign up to join the pilot.
Cluster & title millions of documents with the power of LLMs. Explore and search over 36,000 clusters of 4.3M documents in OpenOrca

Why use Lilac?

Explore your data interactively with LLM-powered search, filter, clustering and annotation.
Curate AI data, applying best practices like removing duplicates, PII and obscure content to reduce dataset size and lower training cost and time.
Inspect and collaborate with your team on a single, centralized dataset to improve data quality.
Understand how data changes over time.

Lilac can offload expensive computations to Lilac Garden, our hosted platform for blazing fast dataset-level computations.

See our 3min walkthrough video

🔥 Getting started

💻 Install

pip install lilac[all]

If you prefer no local installation, you can duplicate our Spaces demo by following documentation here.

For more detailed instructions, see our installation guide.

🌐 Start a webserver

Start a Lilac webserver with our lilac CLI:

lilac start ~/my_project

Or start the Lilac webserver from Python:

import lilac as ll

ll.start_server(project_dir='~/my_project')

This will open start a webserver at http://localhost:5432/ where you can now load datasets and explore them.

Lilac Garden

Lilac Garden is our hosted platform for running dataset-level computations. We utilize powerful GPUs to accelerate expensive signals like Clustering, Embedding, and PII. Sign up to join the pilot.

Cluster and title a million data points in 20 mins
Embed your dataset at half a billion tokens per min
Run your own signal

📊 Load data

Datasets can be loaded directly from HuggingFace, Parquet, CSV, JSON, LangSmith from LangChain, SQLite, LLamaHub, Pandas, Parquet, and more. More documentation here.

import lilac as ll

ll.set_project_dir('~/my_project')
dataset = ll.from_huggingface('imdb')

If you prefer, you can load datasets directly from the UI without writing any Python:

🔎 Explore

[!NOTE] 🔗 Explore OpenOrca and its clusters before installing!

Once we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data. More documentation here.

✨ Clustering

Cluster any text column to get automated dataset insights:

dataset = ll.get_dataset('local', 'imdb')
dataset.cluster('text') # add `use_garden=True` to offload to Lilac Garden

[!TIP] Clustering on device can be slow or impractical, especially on machines without a powerful GPU or large memory. Offloading the compute to Lilac Garden, our hosted data processing platform, can speedup clustering by more than 100x.

⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)

Annotating data with signals will produce another column in your data.

dataset = ll.get_dataset('local', 'imdb')
dataset.compute_signal(ll.LangDetectionSignal(), 'text') # Detect language of each doc.

# [PII] Find emails, phone numbers, ip addresses, and secrets.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Near Duplicates] Computes clusters based on minhash LSH.
dataset.compute_signal(ll.NearDuplicateSignal(), 'text')

# Print the resulting manifest, with the new field added.
print(dataset.manifest())

We can also compute signals from the UI:

🔎 Search

Semantic and conceptual search requires computing an embedding first:

dataset.compute_embedding('gte-small', path='text')

Semantic search

In the UI, we can search by semantic similarity or by classic keyword search to find chunks of documents similar to a query:

We can run the same search in Python:

rows = dataset.select_rows(
  columns=['text', 'label'],
  searches=[
    ll.SemanticSearch(
      path='text',
      embedding='gte-small')
  ],
  limit=1)

print(list(rows))

Conceptual search

Conceptual search is a much more controllable and powerful version of semantic search, where "concepts" can be taught to Lilac by providing positive and negative examples of that concept.

Lilac provides a set of built-in concepts, but you can create your own for very specif

We can create a concept in Python with a few examples, and search by it:

concept_db = ll.DiskConceptDB()
db.create(namespace='local', name='spam')
# Add examples of spam and not-spam.
db.edit('local', 'spam', ll.concepts.ConceptUpdate(
  insert=[
    ll.concepts.ExampleIn(label=False, text='This is normal text.'),
    ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),
    ll.concepts.ExampleIn(label=True, text='11757578jfdjja')
  ]
))

# Search by the spam concept.
rows = dataset.select_rows(
  columns=['text', 'label'],
  searches=[
    ll.ConceptSearch(
      path='text',
      concept_namespace='lilac',
      concept_name='spam',
      embedding='gte-small')
  ],
  limit=1)

print(list(rows))

🏷️ Labeling

Lilac allows you to label individual points, or slices of data:

We can also label all data given a filter. In this case, adding the label "short" to all text with a small amount of characters. This field was produced by the automatic text_statistics signal.

We can do the same in Python:

dataset.add_labels(
  'short',
  filters=[
    (('text', 'text_statistics', 'num_characters'), 'less', 1000)
  ]
)

Labels can be exported for downstream tasks. Detailed documentation here.

💬 Contact

For bugs and feature requests, please file an issue on GitHub.

For general questions, please visit our Discord.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.9

Feb 29, 2024

0.3.8

Feb 28, 2024

0.3.7

Feb 23, 2024

0.3.6

Feb 23, 2024

This version

0.3.5

Feb 14, 2024

0.3.4

Feb 2, 2024

0.3.3

Jan 29, 2024

0.3.2

Jan 24, 2024

0.3.1

Jan 23, 2024

0.3.0

Jan 23, 2024

0.2.5

Jan 19, 2024

0.2.4

Jan 17, 2024

0.2.3

Jan 12, 2024

0.2.2

Jan 8, 2024

0.2.1

Jan 5, 2024

0.2.0

Jan 3, 2024

0.1.26

Dec 19, 2023

0.1.25

Dec 18, 2023

0.1.24

Dec 12, 2023

0.1.23

Dec 7, 2023

0.1.22

Nov 29, 2023

0.1.21

Nov 23, 2023

0.1.20

Nov 16, 2023

0.1.19

Nov 15, 2023

0.1.18

Nov 14, 2023

0.1.17

Nov 7, 2023

0.1.16

Nov 3, 2023

0.1.15

Nov 2, 2023

0.1.14

Nov 2, 2023

0.1.13

Oct 31, 2023

0.1.12

Oct 27, 2023

0.1.11

Oct 26, 2023

0.1.10

Oct 24, 2023

0.1.9

Oct 12, 2023

0.1.8

Oct 12, 2023

0.1.7

Oct 12, 2023

0.1.6

Oct 11, 2023

0.1.5

Oct 4, 2023

0.1.4

Sep 29, 2023

0.1.3

Sep 29, 2023

0.1.2

Sep 27, 2023

0.1.1

Sep 26, 2023

0.1.0

Sep 21, 2023

0.0.20

Sep 20, 2023

0.0.19

Sep 14, 2023

0.0.18

Sep 6, 2023

0.0.17

Sep 2, 2023

0.0.16

Aug 31, 2023

0.0.15

Aug 29, 2023

0.0.14

Aug 28, 2023

0.0.13

Aug 28, 2023

0.0.12

Aug 22, 2023

0.0.11

Aug 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lilac-0.3.5.tar.gz (2.3 MB view details)

Uploaded Feb 14, 2024 Source

Built Distribution

lilac-0.3.5-py3-none-any.whl (2.5 MB view details)

Uploaded Feb 14, 2024 Python 3

File details

Details for the file lilac-0.3.5.tar.gz.

File metadata

Download URL: lilac-0.3.5.tar.gz
Upload date: Feb 14, 2024
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.9.17 Darwin/22.5.0

File hashes

Hashes for lilac-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`2c78bda372e462725ed21ffd5b90d860683055beb20b4dca1bbc12950882b550`
MD5	`050575cf0ec4993225384e49532d95ab`
BLAKE2b-256	`db2c996c62e881c4ae431a1f011ad8ce23f2d858adc70607429bfa6068495bd9`

See more details on using hashes here.

File details

Details for the file lilac-0.3.5-py3-none-any.whl.

File metadata

Download URL: lilac-0.3.5-py3-none-any.whl
Upload date: Feb 14, 2024
Size: 2.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.9.17 Darwin/22.5.0

File hashes

Hashes for lilac-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5966b61fca6198fe63f75295432073e693899fe965cc7f51b505ee4b61d8089`
MD5	`bc39e10cdb7a3d2db12bcf950667b5d5`
BLAKE2b-256	`0f979d204ea8ed7d3f06bc5e1871d37ce4f78ac0f57727651215715d5e4d25ef`