Code AutoComplete
Project description
🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models
CodeAssist: Advanced Code Completion Tool
Introduction
CodeAssist is an advanced code completion tool that intelligently provides high-quality code completions for Python, Java, and C++ and so on.
CodeAssist 是一个高级代码补全工具,高质量为 Python、Java 和 C++ 等编程语言补全代码
Features
- GPT based code completion
- Code completion for
Python
,Java
,C++
,javascript
and so on - Line and block code completion
- Train(Fine-tuning) and predict model with your own data
Release Models
Arch | BaseModel | Model | Model Size |
---|---|---|---|
GPT | gpt2 | shibing624/code-autocomplete-gpt2-base | 487MB |
GPT | distilgpt2 | shibing624/code-autocomplete-distilgpt2-python | 319MB |
GPT | bigcode/starcoder | WizardLM/WizardCoder-15B-V1.0 | 29GB |
Demo
HuggingFace Demo: https://huggingface.co/spaces/shibing624/code-autocomplete
backend model: shibing624/code-autocomplete-gpt2-base
Install
pip install torch # conda install pytorch
pip install -U codeassist
or
git clone https://github.com/shibing624/codeassist.git
cd CodeAssist
python setup.py install
Usage
WizardCoder model
WizardCoder-15b is fine-tuned bigcode/starcoder
with alpaca code data, you can use the following code to generate code:
example: examples/wizardcoder_demo.py
import sys
sys.path.append('..')
from codeassist import WizardCoder
m = WizardCoder("WizardLM/WizardCoder-15B-V1.0")
print(m.generate('def load_csv_file(file_path):')[0])
output:
import csv
def load_csv_file(file_path):
"""
Load data from a CSV file and return a list of dictionaries.
"""
# Open the file in read mode
with open(file_path, 'r') as file:
# Create a CSV reader object
csv_reader = csv.DictReader(file)
# Initialize an empty list to store the data
data = []
# Iterate over each row of data
for row in csv_reader:
# Append the row of data to the list
data.append(row)
# Return the list of data
return data
model output is impressively effective, it currently supports English and Chinese input, you can enter instructions or code prefixes as required.
distilgpt2 model
distilgpt2 fine-tuned code autocomplete model, you can use the following code:
example: examples/distilgpt2_demo.py
import sys
sys.path.append('..')
from codeassist import GPT2Coder
m = GPT2Coder("shibing624/code-autocomplete-distilgpt2-python")
print(m.generate('import torch.nn as')[0])
output:
import torch.nn as nn
import torch.nn.functional as F
Use with huggingface/transformers:
example: examples/use_transformers_gpt2.py
Train Model
Train WizardCoder model
example: examples/training_wizardcoder_mydata.py
cd examples
CUDA_VISIBLE_DEVICES=0,1 python training_wizardcoder_mydata.py --do_train --do_predict --num_epochs 1 --output_dir outputs-wizard --model_name WizardLM/WizardCoder-15B-V1.0
- GPU memory: 31GB
- finetune need 2*V100(32GB)
- inference need 1*V100(32GB)
Train distilgpt2 model
example: examples/training_gpt2_mydata.py
cd examples
python training_gpt2_mydata.py --do_train --do_predict --num_epochs 15 --output_dir outputs-gpt2 --model_name gpt2
PS: fine-tuned result model is GPT2-python: shibing624/code-autocomplete-gpt2-base, I spent about 24 hours with V100 to fine-tune it.
Server
start FastAPI server:
example: examples/server.py
cd examples
python server.py
open url: http://0.0.0.0:8001/docs
Dataset
This allows to customize dataset building. Below is an example of the building process.
Let's use Python codes from Awesome-pytorch-list
- We want the model to help auto-complete codes at a general level. The codes of The Algorithms suits the need.
- This code from this project is well written (high-quality codes).
dataset tree:
examples/download/python
├── train.txt
└── valid.txt
└── test.txt
There are three ways to build dataset:
- Use the huggingface/datasets library load the dataset huggingface datasets https://huggingface.co/datasets/shibing624/source_code
from datasets import load_dataset
dataset = load_dataset("shibing624/source_code", "python") # python or java or cpp
print(dataset)
print(dataset['test'][0:10])
output:
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 5215412
})
validation: Dataset({
features: ['text'],
num_rows: 10000
})
test: Dataset({
features: ['text'],
num_rows: 10000
})
})
{'text': [
" {'max_epochs': [1, 2]},\n",
' refit=False,\n', ' cv=3,\n',
" scoring='roc_auc',\n", ' )\n',
' search.fit(*data)\n',
'',
' def test_module_output_not_1d(self, net_cls, data):\n',
' from skorch.toy import make_classifier\n',
' module = make_classifier(\n'
]}
- Download dataset from Cloud
Name | Source | Download | Size |
---|---|---|---|
Python+Java+CPP source code | Awesome-pytorch-list(5.22 Million lines) | github_source_code.zip | 105M |
download dataset and unzip it, put to examples/
.
- Get source code from scratch and build dataset
cd examples
python prepare_code_data.py --num_repos 260
Contact
- Issue(建议) :
- 邮件我:xuming: xuming624@qq.com
- 微信我: 加我微信号:xuming624, 备注:个人名称-公司-NLP 进NLP交流群。
Citation
如果你在研究中使用了codeassist,请按如下格式引用:
APA:
Xu, M. codeassist: Code AutoComplete with GPT model (Version 1.0.0) [Computer software]. https://github.com/shibing624/codeassist
BibTeX:
@software{Xu_codeassist,
author = {Ming Xu},
title = {CodeAssist: Code AutoComplete with Generation model},
url = {https://github.com/shibing624/codeassist},
version = {1.0.0}
}
License
This repository is licensed under the The Apache License 2.0.
Please follow the Attribution-NonCommercial 4.0 International to use the WizardCoder model.
Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在
tests
添加相应的单元测试 - 使用
python setup.py test
来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file codeassist-1.0.0.tar.gz
.
File metadata
- Download URL: codeassist-1.0.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.12.0 pkginfo/1.7.0 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0eabb37ee94a879149f0d00ddda6d00c848ca28b9300c1205ea50571466c49d0 |
|
MD5 | 350deb531ea5733de6f80cf177c82078 |
|
BLAKE2b-256 | bf6bfda02855792a18070447029027e7877b6b9131aae1ab08f70dac9aa77575 |