Add your description here
Project description
CodeFinetuner
Create your own local code autocomplete model, fine-tuned on your custom code repository, for use in editors like VS Code or Vim/Neovim.
Fine-tuning is achieved by training a Low-Rank Adapter (LoRA) to perform Fill-In-the-Middle (FIM) completion.
Table of Contents
- Architecture
- Project Structure
- How Training Examples Are Created
- Installation
- Quick Start
- Configuration
- Usage
- Deployment
- Docker
- Tree-sitter Setup
- Tests
- Resources
- License
Architecture
Raw Code Files
|
v
[Preprocess] -- tree-sitter parsing -> FIM examples -> tokenized JSONL
|
v
[Finetune] -- LoRA adapter training -> merged model
|
v
[Evaluate] -- CodeBLEU, SentenceBLEU, exact match, line match, perplexity
|
v
[Convert] -- GGUF conversion -> quantized model for deployment
Project Structure
.
├── src/
│ └── codefinetuner/ # Core packages
│ ├── preprocess/
│ ├── finetune/
│ ├── evaluate/
│ └── convert/
├── config/ # User configuration
│ └── codefinetuner_config.yaml
├── data/ # Default data directory (Workspace root)
├── outputs/ # Pipeline artifacts (Workspace root)
├── scripts/ # Utility scripts
├── tests/ # Unit tests
├── third_party/ # External submodules (e.g., custom parsers)
└── docs/ # Documentation and assets
How Training Examples Are Created
To generate high-quality FIM examples, high-level structural code blocks are extracted (e.g., functions, classes). From these blocks, logical sub-blocks (e.g., statements, expressions) are masked to serve as the "middle" section for the model to predict.
Here is an example illustrating how a single FIM example is created:
|
Source Code File |
Code Block |
One Subblock |
<|fim_prefix|>uint32_t count_bits(uint32_t value){\n uint32_t count = 0;\n while(value){\n
<|fim_suffix|> }\n return count;
<|fim_middle|>count = count + (value & 1);\n value = (value >> 1);
Using this technique, rather than randomly splitting code into unrelated text chunks, helps the model learn the logical patterns and structure of your specific codebase.
Installation
From PyPI
uv add codefinetuner
# or
pip install codefinetuner
From Source (Development)
git clone --recurse-submodules https://github.com/cuolm/codefinetuner
cd codefinetuner
# Using uv (Recommended)
uv sync
# Using pip
pip install -r requirements.txt
pip install -e .
Quick Start
Create a configuration file according to the Configuration section.
import codefinetuner
# Run the complete pipeline
codefinetuner.run_pipeline("codefinetuner_config.yaml")
Configuration
The pipeline uses a single-source-of-truth YAML configuration file. It utilizes YAML anchors (&globals) to share core parameters across all stages (preprocess, finetune, evaluate), ensuring consistency and reducing redundancy.
Configuration Structure
Create a codefinetuner_config.yaml using the template below. For a full list of all available parameters and their effects, see the Configuration Reference Guide.
# globals contain all the mandatory parameters.
globals: &globals
workspace_path: null # null: defaults to current working directory (CWD)
model_name: "Qwen/Qwen2.5-Coder-1.5B"
fim_prefix_token: "<|fim_prefix|>"
fim_middle_token: "<|fim_middle|>"
fim_suffix_token: "<|fim_suffix|>"
fim_pad_token: "<|fim_pad|>"
eos_token: "<|endoftext|>"
label_pad_token_id: -100
data_language: "c"
data_extensions: [".c", ".h"]
preprocess:
<<: *globals # Inherits all global parameters
split_mode: "manual"
max_token_sequence_length: 1024
# ... (preprocess specific settings)
finetune:
<<: *globals
lora_r: 32
trainer_num_train_epochs: 1
# ... (finetune specific settings)
evaluate:
<<: *globals
benchmark_sample_size: 4
# ... (evaluate specific settings)
Note: For a complete, production-ready example, see
config/codefinetuner_config.yaml.
Data Preparation
Place source files in your raw_data_path (default: workspace_path/data).
- Auto Split: Place files directly in the directory.
- Manual Split: Create
train,eval, andtestsubfolders insideraw_data_pathand assign files according to your manual split preferences.
Usage
CLI Usage
Run the pipeline using the unified CLI:
uv run codefinetuner --config="config/codefinetuner_config.yaml"
Pipeline Flags:
--config: Specify path to a different config file.--skip-preprocess,--skip-finetune,--skip-evaluate,--skip-convert: Skip specific stages.
Python Module Usage
import codefinetuner
# Full pipeline
codefinetuner.run_pipeline("path/to/codefinetuner_config.yaml")
# Skip stages
codefinetuner.run_pipeline(
"path/to/codefinetuner_config.yaml",
skip_preprocess=True,
skip_convert=True
)
# Individual stages
from codefinetuner import preprocess, finetune
preprocess.run("config.yaml")
Deployment: Using the Model
The convert stage converts the model to GGUF format. The final GGUF file is located under outputs/convert/results/lora_model.gguf.
For a detailed guide on how to use the gguf model with the VS Code extension llama.vscode, check out the inference-vscode guide.
Create And Run Docker Image
1. Build the Docker Image
Build the image from the Dockerfile, tagging it as codefinetuner-image.
docker build -t codefinetuner-image .
2. Prepare Data and Run the Container
To allow the container to access your data for fine-tuning, use a bind mount to link your host machine's data directory to the container.
- On your host machine (where you run Docker), create a folder named
dataif it doesn't already exist. - Put all files you want to use for fine-tuning inside the
datadirectory. Formanualmode, includetrain,eval, andtestsubdirectories containing your manually splitted files. - Start the container with the bind mount, and open a Bash shell depending on your host machine hardware:
NVIDIA GPU (Recommended)
Use this command to enable CUDA support for torch and bitsandbytes. Requires the NVIDIA Container Toolkit installed on the host machine.
docker run --gpus all -it --rm \
-v $(pwd)/data:/app/data \
codefinetuner-image /bin/bash
CPU Only
Use this command if no compatible GPU is available. Note that fine-tuning will be significantly slower.
docker run -it --rm \
-v $(pwd)/data:/app/data \
codefinetuner-image /bin/bash
Tree-sitter Customization
Tree-sitter parses code into structural blocks for generating FIM training examples. Customize for new languages or build missing parsers.
- Add Language Definitions - Define
block_types/subblock_typesin JSON. - Build Custom Parser - Compile from source (e.g., Mojo).
Tests
pytest
Useful Resources
- Qwen2.5-Coder Technical Report
- Structure-Aware Fill-in-the-Middle Pretraining for Code
- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Efficient Training of Language Models to Fill in the Middle
- From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
- CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
- HF LLM Course
- llama.vscode
License
Licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codefinetuner-0.1.0.tar.gz.
File metadata
- Download URL: codefinetuner-0.1.0.tar.gz
- Upload date:
- Size: 7.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c858c4fd914e58fa0564b6a90cf24b98a03a9578d13a5083779f7b24beca4079
|
|
| MD5 |
4d3dfefa338d7fafa1ca219a0fa89532
|
|
| BLAKE2b-256 |
119df2bd42b0ace783beae041d0406b48ae6abbaa4a981abe856aec1a7706406
|
Provenance
The following attestation bundles were made for codefinetuner-0.1.0.tar.gz:
Publisher:
release.yaml on cuolm/codefinetuner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codefinetuner-0.1.0.tar.gz -
Subject digest:
c858c4fd914e58fa0564b6a90cf24b98a03a9578d13a5083779f7b24beca4079 - Sigstore transparency entry: 1317483751
- Sigstore integration time:
-
Permalink:
cuolm/codefinetuner@1813ca507f581dc4fb05c979a4b4237c43c714eb -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/cuolm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@1813ca507f581dc4fb05c979a4b4237c43c714eb -
Trigger Event:
push
-
Statement type:
File details
Details for the file codefinetuner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: codefinetuner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 154.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a20a5141af2e4a4c1c1b52063850d3a4d53767b637593c3c118bce190de9fb9
|
|
| MD5 |
cfde1aa100c0178f218730aaf3bb5a4f
|
|
| BLAKE2b-256 |
f7e8943517d597ebaa47fc567aaf719e68cacdd530685e75314e19fd83d2a44c
|
Provenance
The following attestation bundles were made for codefinetuner-0.1.0-py3-none-any.whl:
Publisher:
release.yaml on cuolm/codefinetuner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codefinetuner-0.1.0-py3-none-any.whl -
Subject digest:
3a20a5141af2e4a4c1c1b52063850d3a4d53767b637593c3c118bce190de9fb9 - Sigstore transparency entry: 1317483803
- Sigstore integration time:
-
Permalink:
cuolm/codefinetuner@1813ca507f581dc4fb05c979a4b4237c43c714eb -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/cuolm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@1813ca507f581dc4fb05c979a4b4237c43c714eb -
Trigger Event:
push
-
Statement type: