An embedding tool for all state of the art code language models
Project description
CodeClarity- Code Embeddings Made Easy
About CodeClarity
This repository contains [CodeClarity] a lightweight app for creating contextual embeddings of source code in a format that is optimized and designed with code search and understanding. in mind. This repository is part of a larger application providing a free exploration into the documatic codesearch tools capabilities.
Installation
We recommend Python 3.7 or higher, PyTorch 1.6.0 or higher and transformers v4.6.0 or higher. The code does not work with Python 2. Install with pip (not currently live, coming soon.)
Install the codclarity with pip
:
pip install -U sentence-transformers
Install from sources
Alternatively, you can also clone the latest version from the repository and install it directly from the source code:
pip install -e .
PyTorch with CUDA
If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.
Getting Started
First download a pretrained code model.
from encoder import CodeEmbedder
model = CodeEmbedder(base_model = "microsoft/unixcoder-base")
Then provide some code snippits to the model. These can be full functions that could be parsed by an Abstract Syntax Tree, or small snippits.
code_snippits = ['def read_csvs(dir) : return [pd.read_csv(fp) for fp in os.listdir(dir)]',
"def set_pytorch_device(): return torch.device('cuda') if torch.cuda.is_available() else 'cpu",
'read file from disk into pandas dataframe']
code_embeddings = model.encode(code_snippits)
And that's it! We now have a list of returned embeddings of default type numpy array.
for code, embedding in zip(code_snippits, code_embeddings):
print("Sentence:", code)
print("Embedding:", embedding)
print("")
API Drop in
This project additionally impliments a docker container that serves a python REST api with the package running in it to serve a given model. to automatically build the container with any of the supported models for code search by running the following
git clone https://github.com/DocumaticAI/code-embeddings-api.git
cd api && bash ./setup.sh
Equally, to run the API outside the docker container, just clone the repository, navigate to the API folder and run the API file directly
git clone https://github.com/DocumaticAI/code-embeddings-api.git
pip install -r requirements-dev.txt
cd api
python predictor.py
Pre-Trained Models
We provide implimentations of a range of code embedding models that are currently the state of the art in various tasks, including code semantic search, code clustering, code program detection, synthesis and more. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: CodeEmbedder('model_name')
.
Currently supported models
- CodeBERT (base model): A Pre-Trained Model for Programming and Natural Languages
- CodeBERT (python finetuned on codesearchnet corpus): A Pre-Trained Model for Programming and Natural Languages
- UniXcoder (base model): Unified Cross-Modal Pre-training for Code Representation
- UniXcoder (9 language varient): Unified Cross-Modal Pre-training for Code Representation
- UniXcoder (unimodal varient): Unified Cross-Modal Pre-training for Code Representation
- InCoder 1B parameter model: A Generative Model for Code Infilling and Synthesis
- InCoder 6B parameter model: A Generative Model for Code Infilling and Synthesis
Internals of docker API
CodeClarity is designed to be a simple, modular dockerized python application that can be used to optain dense vector representations of natrual language code queries, and source code jointly to empower semantic search of codebases.
It is comprised of a lightweight, async fastapi application running on a guicorn webserver. On startup, any of the supported models will be injected into the container, converted to an optimized serving format, and run on a REST API.
CodeClarity automatically handles checking for supported languages for code models, dynamic batching of both code and natrual language snippits and conversions of code models to torchscript all in an asyncronous manner!
Publications
The following papers are implimented or used heavily in this repo and this project would not be possible without their work:
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- UniXcoder: Unified Cross-Modal Pre-training for Code Representation (EMNLP 2020)
- InCoder: A Generative Model for Code Infilling and Synthesis
- A Conversational Paradigm for Program Synthesis
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation (EMNLP 2021)
About Documatic
Documatic is the company that deliversa more efficient codebase in 5 minutes. While you focus on coding, Documatic handles, creates and deploys the documentation so it's always up to date.
Getting help
If you have any questions about, feedback for or a problem with Codeclarity:
- Read documatic website.
- Sign up for the documatic Waitlist.
- File an issue or request a feature.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file codeclarity-0.1.tar.gz
.
File metadata
- Download URL: codeclarity-0.1.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96ed569369ba2d5f9036dc90bd253553739060d3e938a6b8ff8498392f3e5ac6 |
|
MD5 | b693b6d8438154fa8f98af4293e35e3d |
|
BLAKE2b-256 | 5b4fd4118c2ca7fbdc807520cafb96f606d36c8452b743d5eaa8524ce775720b |