Skip to main content

Organize unstructured data

Project description

🌸 Lilac

Curate better data for LLMs

🔗 Try the Lilac web demo!

Site Discord License Apache 2.0
Follow on Twitter

Lilac helps you curate data for LLMs, from RAGs to fine-tuning datasets.

Lilac runs on-device using open-source LLMs with a UI and Python API for:

  • Exploring datasets with natural language (documents)
  • Annotating & structuring data (e.g. PII detection, profanity, text statistics)
  • Semantic search to find similar results to a query
  • Conceptual search to find and tag results that match a fuzzy concept (e.g. low command of English language)
  • Clustering data semantically for understanding & deduplication
  • Labeling and Bulk Labeling to curate data

https://github.com/lilacai/lilac/assets/2294279/cb1378f8-92c1-4f2a-9524-ce5ddd8e0c53

🔥 Getting started

💻 Install

pip install lilac[all]

If you prefer no local installation, you can fork the fork the HuggingFace Spaces demo. Documentation here.

🌐 Start a webserver

Start a Lilac webserver from the CLI:

lilac start ~/my_project

Or start the Lilac webserver from Python:

import lilac as ll

ll.start_server(project_dir='~/my_project')

This will open start a webserver at http://localhost:5432/ where you can now load datasets and explore them.

Run via Docker

We publish images for linux/amd64 and linux/arm64 on Docker Hub under lilacai.

The container runs on the virtual port 8000, this command maps it to the host machine port 5432.

If you have an existing lilac project, mount it and set the LILAC_PROJECT_DIR environment variable:

docker run -it \
  -p 5432:8000 \
  --volume /host/path/to/data:/data \
  -e LILAC_PROJECT_DIR="/data" \
  --gpus all \ # Remove if you don't have a GPU, or on MacOS.
  lilacai/lilac

To build your own custom image run the following command, otherwise skip to the next step.

docker build -t lilac .

📊 Load data

Datasets can be loaded directly from HuggingFace, CSV, JSON, LangSmith from LangChain, SQLite, LLamaHub, Pandas, Parquet, and more. More documentation here.

import lilac as ll

ll.set_project_dir('~/my_project')

config = ll.DatasetConfig(
  namespace='local',
  name='imdb',
  source=ll.HuggingFaceSource(dataset_name='imdb'))

dataset = ll.create_dataset(config)

If you prefer, you can load datasets directly from the UI without writing any Python:

image

🔎 Explore

🔗 Try OpenOrca-100K before installing!

Once we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data. More documentation here.

image

⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)

Annotating data with signals will produce another column in your data.

import lilac as ll

ll.set_project_dir('~/my_project')

dataset = ll.get_dataset('local', 'imdb')

# [Language detection] Detect the language of each document.
dataset.compute_signal(ll.LangDetectionSignal(), 'text')

# [PII] Find emails, phone numbers, ip addresses, and secrets.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Near Duplicates] Computes clusters based on minhash LSH.
dataset.compute_signal(ll.NearDuplicateSignal(), 'text')

# Print the resulting manifest, with the new field added.
print(dataset.manifest())

We can also compute signals from the UI:

image

🔎 Search

Semantic and conceptual search requires computing an embedding first:

dataset.compute_embedding('gte-small', path='text')

Semantic search

In the UI, we can search by semantic similarity or by classic keyword search to find chunks of documents similar to a query:

image image

We can run the same search in Python:

rows = dataset.select_rows(
  columns=['text', 'label'],
  searches=[
    ll.SemanticSearch(
      path='text',
      embedding='gte-small')
  ],
  limit=1)

print(list(rows))

Conceptual search

Conceptual search is a much more controllable and powerful version of semantic search, where "concepts" can be taught to Lilac by providing positive and negative examples of that concept.

Lilac provides a set of built-in concepts, but you can create your own for very specif

image

We can create a concept in Python with a few examples, and search by it:

concept_db = ll.DiskConceptDB()
db.create(namespace='local', name='spam')
# Add examples of spam and not-spam.
db.edit('local', 'spam', ll.concepts.ConceptUpdate(
  insert=[
    ll.concepts.ExampleIn(label=False, text='This is normal text.'),
    ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),
    ll.concepts.ExampleIn(label=True, text='11757578jfdjja')
  ]
))

# Search by the spam concept.
rows = dataset.select_rows(
  columns=['text', 'label'],
  searches=[
    ll.ConceptSearch(
      path='text',
      concept_namespace='lilac',
      concept_name='spam',
      embedding='gte-small')
  ],
  limit=1)

print(list(rows))

🏷️ Labeling

Lilac allows you to label individual points, or slices of data: image

We can also label all data given a filter. In this case, adding the label "short" to all text with a small amount of characters. This field was produced by the automatic text_statistics signal.

image

We can do the same in Python:

dataset.add_labels(
  'short',
  filters=[
    (('text', 'text_statistics', 'num_characters'), 'less', 1000)
  ]
)

Labels can be exported for downstream tasks. Detailed documentation here.

💬 Contact

For bugs and feature requests, please file an issue on GitHub.

For general questions, please visit our Discord.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lilac-0.1.11.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

lilac-0.1.11-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file lilac-0.1.11.tar.gz.

File metadata

  • Download URL: lilac-0.1.11.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.17 Darwin/22.5.0

File hashes

Hashes for lilac-0.1.11.tar.gz
Algorithm Hash digest
SHA256 36630975166967b8faee3821e1eebc756d5fb1b4cad23d4b43dc4429b854f35c
MD5 5e351b2d36cd1a5b244a5fc416966592
BLAKE2b-256 b4f5195020fc7de2fdbd072f3715bf6fe6ff63340dc0551c14705c9b7f72e722

See more details on using hashes here.

File details

Details for the file lilac-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: lilac-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.17 Darwin/22.5.0

File hashes

Hashes for lilac-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 580923b499fee3623b8b721151496c0ec587f827fb6939958f4cf7686b87849e
MD5 f2d5e71a7ea12a01c9b3d2ca24ed3f34
BLAKE2b-256 836a4cc92495f2e5da9c6dbb8d808617219bd1246a297ea4d3ab72ef5c54cb5d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page