Skip to main content

An embedding tool for all state of the art code language models

Project description

CodeClarity- Code Embeddings Made Easy

About CodeClarity

This repository contains [CodeClarity] a lightweight app for creating contextual embeddings of source code in a format that is optimized and designed with code search and understanding. in mind. This repository is part of a larger application providing a free exploration into the documatic codesearch tools capabilities.

Installation

We recommend Python 3.7 or higher, PyTorch 1.6.0 or higher and transformers v4.6.0 or higher. The code does not work with Python 2. Install with pip (not currently live, coming soon.)

Install the codclarity with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

First download a pretrained code model.

from encoder import CodeEmbedder
model = CodeEmbedder(base_model = "microsoft/unixcoder-base")

Then provide some code snippits to the model. These can be full functions that could be parsed by an Abstract Syntax Tree, or small snippits.

code_snippits = ['def read_csvs(dir) : return [pd.read_csv(fp) for fp in os.listdir(dir)]',
    "def set_pytorch_device(): return torch.device('cuda') if torch.cuda.is_available() else 'cpu", 
    'read file from disk into pandas dataframe']
code_embeddings = model.encode(code_snippits)

And that's it! We now have a list of returned embeddings of default type numpy array.

for code, embedding in zip(code_snippits, code_embeddings):
    print("Sentence:", code)
    print("Embedding:", embedding)
    print("")

API Drop in

This project additionally impliments a docker container that serves a python REST api with the package running in it to serve a given model. to automatically build the container with any of the supported models for code search by running the following

git clone https://github.com/DocumaticAI/code-embeddings-api.git 
cd api && bash ./setup.sh

Equally, to run the API outside the docker container, just clone the repository, navigate to the API folder and run the API file directly

git clone https://github.com/DocumaticAI/code-embeddings-api.git 
pip install -r requirements-dev.txt
cd api
python predictor.py

Pre-Trained Models

We provide implimentations of a range of code embedding models that are currently the state of the art in various tasks, including code semantic search, code clustering, code program detection, synthesis and more. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: CodeEmbedder('model_name').

Currently supported models

Internals of docker API

CodeClarity is designed to be a simple, modular dockerized python application that can be used to optain dense vector representations of natrual language code queries, and source code jointly to empower semantic search of codebases.

It is comprised of a lightweight, async fastapi application running on a guicorn webserver. On startup, any of the supported models will be injected into the container, converted to an optimized serving format, and run on a REST API.

CodeClarity automatically handles checking for supported languages for code models, dynamic batching of both code and natrual language snippits and conversions of code models to torchscript all in an asyncronous manner!

Publications

The following papers are implimented or used heavily in this repo and this project would not be possible without their work:

About Documatic

Documatic is the company that deliversa more efficient codebase in 5 minutes. While you focus on coding, Documatic handles, creates and deploys the documentation so it's always up to date.

Getting help

If you have any questions about, feedback for or a problem with Codeclarity:

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codeclarity-0.1.tar.gz (3.9 kB view details)

Uploaded Source

File details

Details for the file codeclarity-0.1.tar.gz.

File metadata

  • Download URL: codeclarity-0.1.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for codeclarity-0.1.tar.gz
Algorithm Hash digest
SHA256 96ed569369ba2d5f9036dc90bd253553739060d3e938a6b8ff8498392f3e5ac6
MD5 b693b6d8438154fa8f98af4293e35e3d
BLAKE2b-256 5b4fd4118c2ca7fbdc807520cafb96f606d36c8452b743d5eaa8524ce775720b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page