Skip to main content

Elemelek is designed to sample subsets of instructions gathered/generated from various sources (and so with various quality/diversity) for LLM fine-tuning tasks.

Project description

Installation

pip install elemelek 

What does elemelek do ?

Elemelek is designed to sample subsets of instructions gathered/generated from various sources (and so with various quality/diversity) for LLM fine-tuning tasks. Under the hood elemelek does the following:

  • creates sqlite database to keep instructions / features in
  • computes embeddings of instructions
  • indexes the embeddings in HNSW index via usearch
  • clusters the embeddings
  • compute features of each instruction in dataset (basic text statistics + rerank score)

Once created it provides simple interface to sample filtered data

How to use it:

First you need to "build" your dataset.

Create YAML config file:

dataset_jsonl_path: /path/to/your/file.jsonl
db:
  database_insert_batch_size: 1000 # chunk size dataset will be written to db with 
  remove_duplicates: true # do not keep duplicated entries in database 
semantic_index:
  embeddings_model_name: sentence-transformers/all-MiniLM-L6-v2 # sentence-transformer model used to compute embeddings of instructions 
  embeddings_computation_batch_size: 32 # batch-size used for embeddings computation 
  metric: cos # metric used for HNSW index 
  connectivity: 128 # HNSW connectivity parameter  
  dtype: 'f32' # index dtype (`f16` `i8` could be used for perfomance reasons)  
  expansion_add: 128  # expansion factor used for index construction when adding vectors
  expansion_search: 256 # expansion factor used for index construction during search operations.
  n_clusters: 10000 # number of clusters to compute once index is created 
features:
  basic: true # whether or not to compute basic features 
  reranker:
    model_name: cross-encoder/ms-marco-MiniLM-L-2-v2 # reranker to score relevance of (instruction, input) => output pairs 
    batch_size: 16 # batch size used to reranking computation 
    strategy: truncate # strategy of how to treat long text (currently truncate only) 
  language_tool: # set it to false if you don't need this one  [a bit experimental + it takes a while] 
    lang: pl-PL # language of your instructions
    n_threads: 4 # number of threads that will check your texts with language_tool 

Run

from elemelek.nest import Elemelek, Egg
# read config file  
egg = Egg.from_yaml("config.yaml")
# create your elemelek - this will take a moment, strong GPU is required for embeddings and rerank relevance scores computation 
elemelek = Elemelek(egg)

Once your dataset is built you can start sampling

from elemelek.features import RERANKER_RELEVANCE_SCORE, IS_QUESTION
from elemelek.model import SubsetChoiceMethod

# start sampling 
sample = elemelek.start_sampling(shuffle=True)

# filter questions with relevance score > 0.9 
substantive_questions  = sample.filter(
    lambda x: ( 
        x.get_feature(IS_QUESTION).value == True and 
        x.get_feature(RERANKER_RELEVANCE_SCORE).value > 0.9
    )
)

# get subset of 25k diverse substantive questions  
diverse_questions = substantive_questions.sample_diverse(
    k=25000,
    method=SubsetChoiceMethod.VARIABILITY_FACTOR,  
    within_cluster_diversity_factor=1.0
    # within_cluster_diversity_factor=0.0 => the least diverse subset
    # within_cluster_diversity_factor=1.0 => the most diverse subset 
)

# get non-questions 
non_questions = sample.filter(
    lambda x: x.get_feature(IS_QUESTION).value == False, 
)

# get 25k diverse non questions 
diverse_non_questions = non_questions.sample_diverse(
    k=25000,
    method=SubsetChoiceMethod.VARIABILITY_FACTOR,  
    within_cluster_diversity_factor=1.0
)

# compose final sample 
final_sample = diverse_non_questions + diverse_questions

# get DF and play with it 
df = final_sample.to_pandas()

# dump your data to JSONL (and hopefully train your great fine-tuned LLM)
# features will be included to output jsonl + you will find __instruction_text field 
# representing formatted instruction (using apply_chat_template from tokenizer of your choice) 
final_sample.to_jsonl(
    "my-awasome-sample.jsonl", 
    include_features=True, 
    include_instruction_using_chat_template_from="mistralai/Mistral-7B-Instruct-v0.2"
)

Your awesome-sample.jsonl entries will look like this:

{
  "id": 476137,
  "instruction": "Jakie są produkty z mleka?",
  "input": "",
  "output": "Do nabiału należą również produkty mleczne...",
  "feature_source_name": "almost_like_an_alpaca",
  "feature_category": "ALMOST_LIKE_AN_ALPACA",
  "feature_median_word_length": 6,
  "feature_quantile_0.9_word_length": 9.6,
  "feature_quantile_0.1_word_length": 1.8,
  "feature_total_length": 292,
  "feature_is_question": true,
  "feature_has_input": true,
  "feature_numeric_chars_ratio": 0,
  "feature_non_alpha_numeric_chars_ratio": 0.18835616438356165,
  "feature_reranker-relevance-score": 0.9645681381225586,
  "__instruction_text": "<s>[INST] Jakie są produkty z mleka? \n  [/INST]Do nabiału należą również produkty mleczne...</s> "
}

Additionally, you can

# search through your instructions semtantically  
matched_instructions = elemelek.search("How much wood would the woodchuck chuck?",  k = 10)

# examine clustering requested in your config 
clustering = elemelek.clustering
centroid_instruction_id = clustering[0].centroid_id
example_centroid_instruction = elemelek.db[centroid_instruction_id] # access your instruction like this 

# list all precomputed feature names  
feature_names = elemelek.feature_names

Once you are done you can resume your work later

from elemelek.nest import Elemelek
datasets = Elemelek.list_datasets()
# >> {'7ff7a3107f44d545c9ac6703c3893e0b': Egg(...)}
elemelek = Elemelek.from_dataset_id('7ff7a3107f44d545c9ac6703c3893e0b')

have fun

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elemelek-0.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

elemelek-0.1.0-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file elemelek-0.1.0.tar.gz.

File metadata

  • Download URL: elemelek-0.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/5.15.0-117-generic

File hashes

Hashes for elemelek-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b6983b677e0ee104f40a3ad0ff97df9ee8c43eaaddf07de385af00ae1e643da4
MD5 e7ea014fc17dce7ae9460bcec1c29a47
BLAKE2b-256 1ab4284571d0bf571462ba43586643530c33584ca13bab03c88bb028b4ba09de

See more details on using hashes here.

File details

Details for the file elemelek-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: elemelek-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/5.15.0-117-generic

File hashes

Hashes for elemelek-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85506b79cf2171643ea385dadf705e343161371ca340a25dca6b7a237b311f80
MD5 a4ad4ab094c3acb222804aa57bb2f593
BLAKE2b-256 04a18a870be030abaf742c7b0240292faec83cc5551a9364c7ccbc7d3c9d236d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page