Inference and training for multiple languages of code2seq
Project description
pycode2seq
Pure Python library for code2seq embeddings.
Support extension of existing pretrained code2seq embeddings to multilingual models. We provided an example of the Java model extension with Kotlin. Pretrained model and its usage example provided below.
Installation
pip install pycode2seq
Inference
File embeddings example
from pycode2seq import Code2Seq
model = Code2Seq.load("kt_java")
method_embeddings = model.methods_embeddings("File.kt")
Pretrained Java and Kotlin common model will be downloaded automatically.
Full functionality
import sys
from pycode2seq import Code2Seq
def main(argv):
model = Code2Seq.load("kt_java")
# Dictionary of method names with their embeddings
method_embeddings = model.methods_embeddings("File.kt", "kt")
#Code2seq predictions
predictions = model.run_on_file(argv[1], "kt")
#Predicted method names
names = [model.prediction_to_text(prediction) for prediction in predictions]
if __name__ == "__main__":
main(sys.argv)
Available models
- Java (
java) - Kotlin (
ktorkotlin) - Java & Kotlin (
kt_java)
kt_java is compatible with java model and should have the same embeddings.
kotlin model is a part of kt_java model, so they are compatible too.
So you can use the common kt_java model and get embeddings in one vector space for both languages.
Training
Download astminer and run:
./gradelw shadowJar
Mine projects for paths:
python training/mine_projects.py <data folder> <output folder> <path to astminer's cli.sh>
Combine mined paths:
python training/astminer_to_code2seq.py <data folder/holdout> <output folder> <holdout>
Build vocabulary with build_vocabulary.py from code2seq module
Combine vocabularies:
python training/combine_vocabularies.py
Expand weights:
python training/expand_weights.py
Using speedy-antlr-tool
You can use speedy-antlr to speed up file parsing speed.
Clone and install modified example.
Replace parser call with:
stream = antlr4.FileStream(input_file)
tree = sa_kotlin.parse(stream, "kotlinFile", sa_kotlin.SA_ErrorListener())
You still need lexer to recover token values, though.
Note, that to make Java parser you will need to follow speedy-antlr tutorial and make another package.
Using astminer to parse files
Clone astminer fork with kotlin support and run
./gradlew shadowJar
Extract methods with cli.sh arguments and usage can be found in training/mine_projects.py.
Pass path to folder with csvs to run_model_on_astminer_csv().
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycode2seq-0.0.6.tar.gz.
File metadata
- Download URL: pycode2seq-0.0.6.tar.gz
- Upload date:
- Size: 164.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60abc27d29fcdbd8b9abe7dc9c64e0aac6b79b59ffe401b3cf9387eeae8aa319
|
|
| MD5 |
08850cae86573d37ccfc456abb763570
|
|
| BLAKE2b-256 |
a4e459955dd096015b4256171b857953d686ea9ea1be7d7054e1f5e833dfda1d
|
File details
Details for the file pycode2seq-0.0.6-py3-none-any.whl.
File metadata
- Download URL: pycode2seq-0.0.6-py3-none-any.whl
- Upload date:
- Size: 177.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dca1d7ee15555e2fbaf08427905ad654296a6780968126b27bd6ac890d336d8f
|
|
| MD5 |
c8db2af9b225de346d20df757f1843bb
|
|
| BLAKE2b-256 |
7a364b8a8d726bbdc75ed0bd069e02cfb57c5da6e801a1a6610e3c4b051c8973
|