A framework for creating and curating high-quality code datasets tailored for large language models
Project description
CodableLLM
CodableLLM is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.
Installation
PyPI
Install CodableLLM directly from PyPI:
pip install codablellm
Docker Compose (Recommended)
CodableLLM uses Prefect for orchestration and parallel processing. Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.
Run an example extraction using Docker Compose:
docker compose run --rm app \
codablellm \
--url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
/tmp/demo-c-repo \
./demo-c-repo.csv \
/tmp/demo-c-repo \
--strip \
--transform my_transform.transform \
--generation-mode temp-append \
--build make
This command does the following:
- Downloads and extracts a compressed C project archive from the given --url to
/tmp/demo-c-repo. - Uses
/tmp/demo-c-repoas both the source of extracted code and the location of compiled binaries. - Outputs a dataset to
./demo-c-repo.csv(relative to your host machine). - Runs the build command (
make) inside the extracted repo directory to generate binaries. - Applies transformations using the function defined in
my_transform.py(i.e.,my_transform.transform). - Uses --generation-mode
temp-append, which appends transformed outputs to the original dataset, preserving both.
This uses the
appservice defined indocker-compose.yml, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.
Features
- Extracts functions and methods from source code repositories using tree-sitter.
- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
- Language-agnostic design with support for plugin-based extractor and decompiler extensions.
- Extendable API for building your own workflows and datasets.
- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.
Documentation
Complete documentation is available on Read the Docs:
Contributing
We welcome contributions from the community! See CONTRIBUTING.md for guidelines, development setup, and how to get started.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codablellm-1.1.0.tar.gz.
File metadata
- Download URL: codablellm-1.1.0.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df83c4f7e85d49239c167b834d9d2cb8a233e5a42cbcaf116e76fd2d9f9a81bd
|
|
| MD5 |
80ccd4c16c3a74852726fd27d184e24d
|
|
| BLAKE2b-256 |
14b7548c487bb6a8abce8956c21279825c3761450e8b9b498cba501b644dcf77
|
File details
Details for the file codablellm-1.1.0-py3-none-any.whl.
File metadata
- Download URL: codablellm-1.1.0-py3-none-any.whl
- Upload date:
- Size: 44.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a92335ee7a901fc42fbe0097589f12f1c468888572b509c7366782904bf5c08
|
|
| MD5 |
b17539ea1772f43394cb96052aa0708c
|
|
| BLAKE2b-256 |
866c83474d4ee39d9a978c7b659521f0d58fff974d7a7e7f6ff59be31b4be3a9
|