Skip to main content

Add your description here

Project description

CodeFinetuner

Build PyPI License Coverage

Create your own local code autocomplete model, fine-tuned on your custom code repository, for use in editors like VS Code or Vim/Neovim.

Fine-tuning is achieved by training a Low-Rank Adapter (LoRA) to perform Fill-In-the-Middle (FIM) completion.

Table of Contents

Architecture

Raw Code Files
     |
     v
[Preprocess]  -- tree-sitter parsing -> FIM examples -> tokenized JSONL
     |
     v
[Finetune]    -- LoRA adapter training -> merged model
     |
     v
[Evaluate]    -- CodeBLEU, SentenceBLEU, exact match, line match, perplexity
     |
     v
[Convert]      -- GGUF conversion -> quantized model for deployment

Project Structure

.
├── src/
│   └── codefinetuner/           # Core packages
│       ├── preprocess/
│       ├── finetune/
│       ├── evaluate/
│       └── convert/             
├── config/                      # User configuration
│   └── codefinetuner_config.yaml
├── data/                        # Default data directory (Workspace root)
├── outputs/                     # Pipeline artifacts (Workspace root)
├── scripts/                     # Utility scripts
├── tests/                       # Unit tests 
├── third_party/                 # External submodules (e.g., custom parsers)
└── docs/                        # Documentation and assets

How Training Examples Are Created

To generate high-quality FIM examples, high-level structural code blocks are extracted (e.g., functions, classes). From these blocks, logical sub-blocks (e.g., statements, expressions) are masked to serve as the "middle" section for the model to predict.

Here is an example illustrating how a single FIM example is created:

Source Code File
code_file
Code Block
code_block
One Subblock
code_subblock
<|fim_prefix|>uint32_t count_bits(uint32_t value){\n  uint32_t count = 0;\n  while(value){\n    
<|fim_suffix|>    }\n    return count;
<|fim_middle|>count = count + (value & 1);\n    value = (value >> 1);

Using this technique, rather than randomly splitting code into unrelated text chunks, helps the model learn the logical patterns and structure of your specific codebase.

Installation

From PyPI

uv add codefinetuner
# or
pip install codefinetuner

From Source (Development)

git clone --recurse-submodules https://github.com/cuolm/codefinetuner
cd codefinetuner

# Using uv (Recommended)
uv sync

# Using pip
pip install -r requirements.txt
pip install -e .

Quick Start

Create a configuration file according to the Configuration section.

import codefinetuner

# Run the complete pipeline
codefinetuner.run_pipeline("codefinetuner_config.yaml")

Configuration

The pipeline uses a single-source-of-truth YAML configuration file. It utilizes YAML anchors (&globals) to share core parameters across all stages (preprocess, finetune, evaluate), ensuring consistency and reducing redundancy.

Configuration Structure

Create a codefinetuner_config.yaml using the template below. For a full list of all available parameters and their effects, see the Configuration Reference Guide.

# globals contain all the mandatory parameters.
globals: &globals
  workspace_path: null  # null: defaults to current working directory (CWD)
  model_name: "Qwen/Qwen2.5-Coder-1.5B" 
  fim_prefix_token: "<|fim_prefix|>"
  fim_middle_token: "<|fim_middle|>"
  fim_suffix_token: "<|fim_suffix|>"
  fim_pad_token: "<|fim_pad|>"
  eos_token: "<|endoftext|>"
  label_pad_token_id: -100
  data_language: "c"
  data_extensions: [".c", ".h"]

preprocess:
  <<: *globals                   # Inherits all global parameters
  split_mode: "manual"
  max_token_sequence_length: 1024
  # ... (preprocess specific settings)

finetune:
  <<: *globals
  lora_r: 32
  trainer_num_train_epochs: 1
  # ... (finetune specific settings)

evaluate:
  <<: *globals
  benchmark_sample_size: 4
  # ... (evaluate specific settings)

Note: For a complete, production-ready example, see config/codefinetuner_config.yaml.

Data Preparation

Place source files in your raw_data_path (default: workspace_path/data).

  • Auto Split: Place files directly in the directory.
  • Manual Split: Create train, eval, and test subfolders inside raw_data_path and assign files according to your manual split preferences.

Usage

CLI Usage

Run the pipeline using the unified CLI:

uv run codefinetuner --config="config/codefinetuner_config.yaml"

Pipeline Flags:

  • --config: Specify path to a different config file.
  • --skip-preprocess, --skip-finetune, --skip-evaluate, --skip-convert: Skip specific stages.

Python Module Usage

import codefinetuner

# Full pipeline
codefinetuner.run_pipeline("path/to/codefinetuner_config.yaml")

# Skip stages
codefinetuner.run_pipeline(
    "path/to/codefinetuner_config.yaml",
    skip_preprocess=True,
    skip_convert=True
)

# Individual stages
from codefinetuner import preprocess, finetune
preprocess.run("config.yaml")

Deployment: Using the Model

The convert stage converts the model to GGUF format. The final GGUF file is located under outputs/convert/results/lora_model.gguf.
For a detailed guide on how to use the gguf model with the VS Code extension llama.vscode, check out the inference-vscode guide.

Create And Run Docker Image

1. Build the Docker Image

Build the image from the Dockerfile, tagging it as codefinetuner-image.

docker build -t codefinetuner-image .

2. Prepare Data and Run the Container

To allow the container to access your data for fine-tuning, use a bind mount to link your host machine's data directory to the container.

  • On your host machine (where you run Docker), create a folder named data if it doesn't already exist.
  • Put all files you want to use for fine-tuning inside the data directory. For manual mode, include train, eval, and test subdirectories containing your manually splitted files.
  • Start the container with the bind mount, and open a Bash shell depending on your host machine hardware:

NVIDIA GPU (Recommended)

Use this command to enable CUDA support for torch and bitsandbytes. Requires the NVIDIA Container Toolkit installed on the host machine.

docker run --gpus all -it --rm \
  -v $(pwd)/data:/app/data \
  codefinetuner-image /bin/bash

CPU Only

Use this command if no compatible GPU is available. Note that fine-tuning will be significantly slower.

docker run -it --rm \
  -v $(pwd)/data:/app/data \
  codefinetuner-image /bin/bash

Tree-sitter Customization

Tree-sitter parses code into structural blocks for generating FIM training examples. Customize for new languages or build missing parsers.

Tests

pytest

Useful Resources

License

Licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codefinetuner-0.1.0.tar.gz (7.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codefinetuner-0.1.0-py3-none-any.whl (154.1 kB view details)

Uploaded Python 3

File details

Details for the file codefinetuner-0.1.0.tar.gz.

File metadata

  • Download URL: codefinetuner-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codefinetuner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c858c4fd914e58fa0564b6a90cf24b98a03a9578d13a5083779f7b24beca4079
MD5 4d3dfefa338d7fafa1ca219a0fa89532
BLAKE2b-256 119df2bd42b0ace783beae041d0406b48ae6abbaa4a981abe856aec1a7706406

See more details on using hashes here.

Provenance

The following attestation bundles were made for codefinetuner-0.1.0.tar.gz:

Publisher: release.yaml on cuolm/codefinetuner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file codefinetuner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: codefinetuner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 154.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codefinetuner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a20a5141af2e4a4c1c1b52063850d3a4d53767b637593c3c118bce190de9fb9
MD5 cfde1aa100c0178f218730aaf3bb5a4f
BLAKE2b-256 f7e8943517d597ebaa47fc567aaf719e68cacdd530685e75314e19fd83d2a44c

See more details on using hashes here.

Provenance

The following attestation bundles were made for codefinetuner-0.1.0-py3-none-any.whl:

Publisher: release.yaml on cuolm/codefinetuner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page