Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data
Project description
Code-Text data toolkit contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).
Installation
Setup environment and install dependencies and setup by using install_env.sh
bash -i ./install_env.sh
then activate conda environment named "code-text-env"
conda activate code-text-env
Setup for using parser
pip install codetext
Getting started
Build your language
Auto build tree-sitter into <language>.so located in /tree-sitter/
from codetext.utils import build_language
language = 'rust'
build_language(language)
# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
Language Parser
We supported 10 programming languages, namely Python, Java, JavaScript, Golang, Ruby, PHP, C#, C++, C and Rust.
Setup
from codetext.utils import parse_code
raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""
root = parse_code(raw_code, 'cpp')
root_node = root.root_node
Get all function nodes inside a specific node, use:
from codetext.utils.parser import CppParser
function_list = CppParser.get_function_list(root_node)
print(function_list)
# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]
Get function metadata (e.g. function's name, parameters, (optional) return type)
function = function_list[0]
metadata = CppParser.get_function_metadata(function, raw_code)
# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}
Get docstring (documentation) of a function
docstring = CppParser.get_docstring(function, code_sample)
# ['Sum of 2 number \n@param a int number \n@param b int number']
We also provide 2 command for extract class object
class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)
Data collection and Preprocessing
The dataset we used to extract was collected by codeparrot. They host the raw dataset in here codeparrot/github-code.
You can create your own dataset using Google Bigquery and the query here
Getting started
Process custom dataset
For start preprocessing data, define a .yaml file to declare raw data format. (More detail: /data/format/README.md)
python -m codetext.processing
<DATASET_PATH>
--save_path <SAVE_PATH> # path to save dir
--load_from_file # load from file instead load from dataset cache
--language Python # or Java, JavaScript, ...
--data_format './data/format/codeparot-format.yaml' # load raw data format
--n_split 20 # split original dataset into N subset
--n_core -1 # number of multiple processor (default to 1) (-1 == using all core)
NOTES: <DATASET_PATH> dir must contains raw data store in .jsonl extension if you pass argument --load_from_file or contains huggingface dataset's
Analyse and split dataset
The code process is going to save cleaned sample by batch, you can merge it using postprocess.py. We also provide analyse tool for get total number of sample, blank_line(*), comment(*) and code(*). You can also split your dataset into train, valid, test.
python -m codetext.postprocessing
<DATASET_PATH> # path to dir contains /extracted, /filered, /raw
--save_path <SAVE_PATH> # path to save final output
--n_core 10 # number of core for multiprocessing analyzer
--analyze # Analyze trigger
--split # Split train/test/valid trigger
--ratio 0.05 # Test and valid ratio (defaul to equal)
--max_sample 20000 # Max size of test set and valid set
NOTES: (*) We run cloc underneath the program to count blank, comment and code. See more github.com/AlDanial/cloc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codetext-0.0.4.tar.gz.
File metadata
- Download URL: codetext-0.0.4.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
136759a785988ca48ac1ee98fac02d9ab6a282f189d43b4465a5f436e7fcdfe3
|
|
| MD5 |
f4a9d03b24aa08df71afd2f30552334c
|
|
| BLAKE2b-256 |
2446670d4732cad5bd060890682872b924475217d3533550eed435f3b4917e5e
|
File details
Details for the file codetext-0.0.4-py3-none-any.whl.
File metadata
- Download URL: codetext-0.0.4-py3-none-any.whl
- Upload date:
- Size: 47.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e8124e5f8c4bfa250be30e3188f7fbc4383003ee6744f423a57c53c3f94fcc9
|
|
| MD5 |
8c92805d56251ae5d8138046c08adb80
|
|
| BLAKE2b-256 |
9680eff9b4eaaa74173c0afda4c19259bdd816d80be2b129316dc41293382e06
|