Inference and training for multiple languages of code2seq
Project description
pycode2seq
Pure Python library for code2seq
embeddings.
Support extension of existing pretrained code2seq embeddings to multilingual models. We provided an example of the Java model extension with Kotlin. Pretrained model and its usage example provided below.
Installation
pip install pycode2seq
Inference
File embeddings example
from pycode2seq import Code2Seq
model = Code2Seq.load("kt_java")
method_embeddings = model.methods_embeddings("File.kt")
Pretrained Java and Kotlin common model will be downloaded automatically.
Full functionality
import sys
from pycode2seq import Code2Seq
def main(argv):
model = Code2Seq.load("kt_java")
# Dictionary of method names with their embeddings
method_embeddings = model.methods_embeddings("File.kt", "kt")
#Code2seq predictions
predictions = model.run_on_file(argv[1], "kt")
#Predicted method names
names = [model.prediction_to_text(prediction) for prediction in predictions]
if __name__ == "__main__":
main(sys.argv)
Available models
- Java (
java
) - Kotlin (
kt
orkotlin
) - Java & Kotlin (
kt_java
)
kt_java
is compatible with java
model and should have the same embeddings.
kotlin
model is a part of kt_java
model, so they are compatible too.
So you can use the common kt_java
model and get embeddings in one vector space for both languages.
Training
Download astminer and run:
./gradelw shadowJar
Mine projects for paths:
python training/mine_projects.py <data folder> <output folder> <path to astminer's cli.sh>
Combine mined paths:
python training/astminer_to_code2seq.py <data folder/holdout> <output folder> <holdout>
Build vocabulary with build_vocabulary.py from code2seq module
Combine vocabularies:
python training/combine_vocabularies.py
Expand weights:
python training/expand_weights.py
Using speedy-antlr-tool
You can use speedy-antlr to speed up file parsing speed.
Clone and install modified example.
Replace parser call with:
stream = antlr4.FileStream(input_file)
tree = sa_kotlin.parse(stream, "kotlinFile", sa_kotlin.SA_ErrorListener())
You still need lexer to recover token values, though.
Note, that to make Java parser you will need to follow speedy-antlr tutorial and make another package.
Using astminer to parse files
Clone astminer fork with kotlin support and run
./gradlew shadowJar
Extract methods with cli.sh
arguments and usage can be found in training/mine_projects.py
.
Pass path to folder with csvs to run_model_on_astminer_csv()
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pycode2seq-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dca1d7ee15555e2fbaf08427905ad654296a6780968126b27bd6ac890d336d8f |
|
MD5 | c8db2af9b225de346d20df757f1843bb |
|
BLAKE2b-256 | 7a364b8a8d726bbdc75ed0bd069e02cfb57c5da6e801a1a6610e3c4b051c8973 |