Skip to main content

Inference and training for multiple languages of code2seq

Project description

pycode2seq

Pure Python library for code2seq embeddings.

Support extension of existing pretrained code2seq embeddings to multilingual models. We provided an example of the Java model extension with Kotlin. Pretrained model and its usage example provided below.

Installation

pip install pycode2seq

Inference

File embeddings example

from pycode2seq import Code2Seq

model = Code2Seq.load("kt_java")
method_embeddings = model.methods_embeddings("File.kt")

Pretrained Java and Kotlin common model will be downloaded automatically.

Full functionality

import sys
from pycode2seq import Code2Seq

def main(argv):
    model = Code2Seq.load("kt_java")

    # Dictionary of method names with their embeddings
    method_embeddings = model.methods_embeddings("File.kt", "kt") 

    #Code2seq predictions
    predictions = model.run_on_file(argv[1], "kt")

    #Predicted method names
    names = [model.prediction_to_text(prediction) for prediction in predictions]

if __name__ == "__main__":
    main(sys.argv)

Available models

  • Java (java)
  • Kotlin (kt or kotlin)
  • Java & Kotlin (kt_java)

kt_java is compatible with java model and should have the same embeddings. kotlin model is a part of kt_java model, so they are compatible too.

So you can use the common kt_java model and get embeddings in one vector space for both languages.

Training

Download astminer and run:

./gradelw shadowJar

Mine projects for paths:

python training/mine_projects.py <data folder> <output folder> <path to astminer's cli.sh>

Combine mined paths:

python training/astminer_to_code2seq.py <data folder/holdout> <output folder> <holdout>

Build vocabulary with build_vocabulary.py from code2seq module

Combine vocabularies:

python training/combine_vocabularies.py

Expand weights:

python training/expand_weights.py

Using speedy-antlr-tool

You can use speedy-antlr to speed up file parsing speed.

Clone and install modified example.

Replace parser call with:

stream = antlr4.FileStream(input_file)
tree = sa_kotlin.parse(stream, "kotlinFile", sa_kotlin.SA_ErrorListener())

You still need lexer to recover token values, though.

Note, that to make Java parser you will need to follow speedy-antlr tutorial and make another package.

Using astminer to parse files

Clone astminer fork with kotlin support and run

./gradlew shadowJar

Extract methods with cli.sh arguments and usage can be found in training/mine_projects.py.

Pass path to folder with csvs to run_model_on_astminer_csv().

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycode2seq-0.0.6.tar.gz (164.3 kB view hashes)

Uploaded Source

Built Distribution

pycode2seq-0.0.6-py3-none-any.whl (177.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page