Skip to main content

Elemelek is designed to sample subsets of instructions gathered/generated from various sources (and so with various quality/diversity) for LLM fine-tuning tasks.

Project description

Installation

pip install elemelek 

What does elemelek do ?

Elemelek is designed to sample subsets of instructions gathered/generated from various sources (and so with various quality/diversity) for LLM fine-tuning tasks. Under the hood elemelek does the following:

  • creates sqlite database to keep instructions / features in
  • computes embeddings of instructions
  • indexes the embeddings in HNSW index via usearch
  • clusters the embeddings
  • compute features of each instruction in dataset (basic text statistics + rerank score)

Once created it provides simple interface to sample filtered data

How to use it:

First you need to "build" your dataset.

Create YAML config file:

dataset_jsonl_path: /path/to/your/file.jsonl
db:
  database_insert_batch_size: 1000 # chunk size dataset will be written to db with 
  remove_duplicates: true # do not keep duplicated entries in database 
semantic_index:
  embeddings_model_name: sentence-transformers/all-MiniLM-L6-v2 # sentence-transformer model used to compute embeddings of instructions 
  embeddings_computation_batch_size: 32 # batch-size used for embeddings computation 
  metric: cos # metric used for HNSW index 
  connectivity: 128 # HNSW connectivity parameter  
  dtype: 'f32' # index dtype (`f16` `i8` could be used for perfomance reasons)  
  expansion_add: 128  # expansion factor used for index construction when adding vectors
  expansion_search: 256 # expansion factor used for index construction during search operations.
  n_clusters: 10000 # number of clusters to compute once index is created 
features:
  basic: true # whether or not to compute basic features 
  reranker:
    model_name: cross-encoder/ms-marco-MiniLM-L-2-v2 # reranker to score relevance of (instruction, input) => output pairs 
    batch_size: 16 # batch size used to reranking computation 
    strategy: truncate # strategy of how to treat long text (currently truncate only) 
  language_tool: # set it to false if you don't need this one  [a bit experimental + it takes a while] 
    lang: pl-PL # language of your instructions
    n_threads: 4 # number of threads that will check your texts with language_tool 

Run

from elemelek.nest import Elemelek, Egg
# read config file  
egg = Egg.from_yaml("config.yaml")
# create your elemelek - this will take a moment, strong GPU is required for embeddings and rerank relevance scores computation 
elemelek = Elemelek(egg)

Once your dataset is built you can start sampling

from elemelek.features import RERANKER_RELEVANCE_SCORE, IS_QUESTION
from elemelek.model import SubsetChoiceMethod

# start sampling 
sample = elemelek.start_sampling(shuffle=True)

# filter questions with relevance score > 0.9 
substantive_questions  = sample.filter(
    lambda x: ( 
        x.get_feature(IS_QUESTION).value == True and 
        x.get_feature(RERANKER_RELEVANCE_SCORE).value > 0.9
    )
)

# get subset of 25k diverse substantive questions  
diverse_questions = substantive_questions.sample_diverse(
    k=25000,
    method=SubsetChoiceMethod.VARIABILITY_FACTOR,  
    within_cluster_diversity_factor=1.0
    # within_cluster_diversity_factor=0.0 => the least diverse subset
    # within_cluster_diversity_factor=1.0 => the most diverse subset 
)

# get non-questions 
non_questions = sample.filter(
    lambda x: x.get_feature(IS_QUESTION).value == False, 
)

# get 25k diverse non questions 
diverse_non_questions = non_questions.sample_diverse(
    k=25000,
    method=SubsetChoiceMethod.VARIABILITY_FACTOR,  
    within_cluster_diversity_factor=1.0
)

# compose final sample 
final_sample = diverse_non_questions + diverse_questions

# get DF and play with it 
df = final_sample.to_pandas()

# dump your data to JSONL (and hopefully train your great fine-tuned LLM)
# features will be included to output jsonl + you will find __instruction_text field 
# representing formatted instruction (using apply_chat_template from tokenizer of your choice) 
final_sample.to_jsonl(
    "my-awasome-sample.jsonl", 
    include_features=True, 
    include_instruction_using_chat_template_from="mistralai/Mistral-7B-Instruct-v0.2"
)

Your awesome-sample.jsonl entries will look like this:

{
  "id": 476137,
  "instruction": "Jakie są produkty z mleka?",
  "input": "",
  "output": "Do nabiału należą również produkty mleczne...",
  "feature_source_name": "almost_like_an_alpaca",
  "feature_category": "ALMOST_LIKE_AN_ALPACA",
  "feature_median_word_length": 6,
  "feature_quantile_0.9_word_length": 9.6,
  "feature_quantile_0.1_word_length": 1.8,
  "feature_total_length": 292,
  "feature_is_question": true,
  "feature_has_input": true,
  "feature_numeric_chars_ratio": 0,
  "feature_non_alpha_numeric_chars_ratio": 0.18835616438356165,
  "feature_reranker-relevance-score": 0.9645681381225586,
  "__instruction_text": "<s>[INST] Jakie są produkty z mleka? \n  [/INST]Do nabiału należą również produkty mleczne...</s> "
}

Additionally, you can

# search through your instructions semtantically  
matched_instructions = elemelek.search("How much wood would the woodchuck chuck?",  k = 10)

# examine clustering requested in your config 
clustering = elemelek.clustering
centroid_instruction_id = clustering[0].centroid_id
example_centroid_instruction = elemelek.db[centroid_instruction_id] # access your instruction like this 

# list all precomputed feature names  
feature_names = elemelek.feature_names

Once you are done you can resume your work later

from elemelek.nest import Elemelek
datasets = Elemelek.list_datasets()
# >> {'7ff7a3107f44d545c9ac6703c3893e0b': Egg(...)}
elemelek = Elemelek.from_dataset_id('7ff7a3107f44d545c9ac6703c3893e0b')

have fun

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elemelek-0.1.1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

elemelek-0.1.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file elemelek-0.1.1.tar.gz.

File metadata

  • Download URL: elemelek-0.1.1.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/5.15.0-117-generic

File hashes

Hashes for elemelek-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9adfea123bf14a65f4048149d569a643280daf3cfded01e05b4d23ad9efa83a2
MD5 8735550b3273ae0c30714627038afa75
BLAKE2b-256 c29cfe621c8b512356cb8728741254dad104733928c5b53e77865110b224d17f

See more details on using hashes here.

File details

Details for the file elemelek-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: elemelek-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/5.15.0-117-generic

File hashes

Hashes for elemelek-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6326be46c8d6a8e6fce1025f2e57886a18bd2169213c21532bcb010fe3c8ac7
MD5 5bbee7303ccc0d0a9a324745ad04c413
BLAKE2b-256 3ca735f56b6a67f5c6cb32c050b5edc3103f6142edb7c3c9be1801ecad8fe4df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page