Skip to main content

A framework for creating and curating high-quality code datasets tailored for large language models

Project description

Build Status Python Version PyPI Downloads License Documentation Status

CodableLLM

CodableLLM is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.

Installation

PyPI

Install CodableLLM directly from PyPI:

pip install codablellm

Docker Compose (Recommended)

CodableLLM uses Prefect for orchestration and parallel processing. Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.

Run an example extraction using Docker Compose:

docker compose run --rm app \
  codablellm \
  --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
  /tmp/demo-c-repo \
  ./demo-c-repo.csv \
  /tmp/demo-c-repo \
  --strip \
  --transform my_transform.transform \
  --generation-mode temp-append \
  --build make

This command does the following:

  • Downloads and extracts a compressed C project archive from the given --url to /tmp/demo-c-repo.
  • Uses /tmp/demo-c-repo as both the source of extracted code and the location of compiled binaries.
  • Outputs a dataset to ./demo-c-repo.csv (relative to your host machine).
  • Runs the build command (make) inside the extracted repo directory to generate binaries.
  • Applies transformations using the function defined in my_transform.py (i.e., my_transform.transform).
  • Uses --generation-mode temp-append, which appends transformed outputs to the original dataset, preserving both.

This uses the app service defined in docker-compose.yml, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.

Features

  • Extracts functions and methods from source code repositories using tree-sitter.
  • Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
  • Language-agnostic design with support for plugin-based extractor and decompiler extensions.
  • Extendable API for building your own workflows and datasets.
  • Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.

Documentation

Complete documentation is available on Read the Docs:

Citation

If you use this tool in your research, please cite the paper associated with it:

@misc{manuel2025codablellmautomatingdecompiledsource,
      title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, 
      author={Dylan Manuel and Paul Rad},
      year={2025},
      eprint={2507.22066},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.22066}, 
}

Contributing

We welcome contributions from the community! See CONTRIBUTING.md for guidelines, development setup, and how to get started.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codablellm-1.3.2.tar.gz (45.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codablellm-1.3.2-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file codablellm-1.3.2.tar.gz.

File metadata

  • Download URL: codablellm-1.3.2.tar.gz
  • Upload date:
  • Size: 45.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for codablellm-1.3.2.tar.gz
Algorithm Hash digest
SHA256 6c15be1cca74bebd447e75919d7cfe9df6a2c051b090095a117e093aee05a542
MD5 4fc530969eff807f6e54b46c7177c4a4
BLAKE2b-256 b280001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4

See more details on using hashes here.

File details

Details for the file codablellm-1.3.2-py3-none-any.whl.

File metadata

  • Download URL: codablellm-1.3.2-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for codablellm-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 df2bd28e326e666ec7d438916a6d505f17e1510d1c7bba2f83558b16d9e06dfe
MD5 323d81192cdf76f4a0420eef3820ae88
BLAKE2b-256 91dd2d17dfb146cd6cef4bc0d1d47da076769b1bfdf0bf9a36bfa0477ce87e16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page