Skip to main content

Inference and training for multiple languages of code2seq

Project description

pycode2seq

Pure Python library for code2seq embeddings.

Support extension of existing pretrained code2seq embeddings to multilingual models. We provided an example of the Java model extension with Kotlin. Pretrained model and its usage example provided below.

Installation

pip install pycode2seq

Inference

File embeddings example

from pycode2seq import Code2Seq

model = Code2Seq.load("kt_java")
method_embeddings = model.methods_embeddings("File.kt")

Pretrained Java and Kotlin common model will be downloaded automatically.

Full functionality

import sys
from pycode2seq import Code2Seq

def main(argv):
    model = Code2Seq.load("kt_java")

    # Dictionary of method names with their embeddings
    method_embeddings = model.methods_embeddings("File.kt", "kt") 

    #Code2seq predictions
    predictions = model.run_on_file(argv[1], "kt")

    #Predicted method names
    names = [model.prediction_to_text(prediction) for prediction in predictions]

if __name__ == "__main__":
    main(sys.argv)

Available models

  • Java (java)
  • Kotlin (kt or kotlin)
  • Java & Kotlin (kt_java)

kt_java is compatible with java model and should have the same embeddings. kotlin model is a part of kt_java model, so they are compatible too.

So you can use the common kt_java model and get embeddings in one vector space for both languages.

Training

Download astminer and run:

./gradelw shadowJar

Mine projects for paths:

python training/mine_projects.py <data folder> <output folder> <path to astminer's cli.sh>

Combine mined paths:

python training/astminer_to_code2seq.py <data folder/holdout> <output folder> <holdout>

Build vocabulary with build_vocabulary.py from code2seq module

Combine vocabularies:

python training/combine_vocabularies.py

Expand weights:

python training/expand_weights.py

Using speedy-antlr-tool

You can use speedy-antlr to speed up file parsing speed.

Clone and install modified example.

Replace parser call with:

stream = antlr4.FileStream(input_file)
tree = sa_kotlin.parse(stream, "kotlinFile", sa_kotlin.SA_ErrorListener())

You still need lexer to recover token values, though.

Note, that to make Java parser you will need to follow speedy-antlr tutorial and make another package.

Using astminer to parse files

Clone astminer fork with kotlin support and run

./gradlew shadowJar

Extract methods with cli.sh arguments and usage can be found in training/mine_projects.py.

Pass path to folder with csvs to run_model_on_astminer_csv().

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycode2seq-0.0.6.tar.gz (164.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycode2seq-0.0.6-py3-none-any.whl (177.6 kB view details)

Uploaded Python 3

File details

Details for the file pycode2seq-0.0.6.tar.gz.

File metadata

  • Download URL: pycode2seq-0.0.6.tar.gz
  • Upload date:
  • Size: 164.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3

File hashes

Hashes for pycode2seq-0.0.6.tar.gz
Algorithm Hash digest
SHA256 60abc27d29fcdbd8b9abe7dc9c64e0aac6b79b59ffe401b3cf9387eeae8aa319
MD5 08850cae86573d37ccfc456abb763570
BLAKE2b-256 a4e459955dd096015b4256171b857953d686ea9ea1be7d7054e1f5e833dfda1d

See more details on using hashes here.

File details

Details for the file pycode2seq-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pycode2seq-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 177.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3

File hashes

Hashes for pycode2seq-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 dca1d7ee15555e2fbaf08427905ad654296a6780968126b27bd6ac890d336d8f
MD5 c8db2af9b225de346d20df757f1843bb
BLAKE2b-256 7a364b8a8d726bbdc75ed0bd069e02cfb57c5da6e801a1a6610e3c4b051c8973

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page