Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Code-text data toolkit

This repo contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).

Installation

Install dependencies and setup by using install_env.sh

bash -i ./install_env.sh

then activate conda environment named "code-text-env"

conda activate code-text-env

Getting started

Build your language

Auto build tree-sitter into <language>.so located in /tree-sitter/

from src.utils import build_language

language = 'rust'
build_language(language)


# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Language Parser

We supported 8 programming languages, namely Python, Java, JavaScript, Golang, Ruby, PHP, C#, C++ and C.

Setup

from tree_sitter import Parser, Language

raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
    return a + b;
}
"""

parser = Parser()
language = Language("/tree-sitter/cpp.so", 'cpp')
parser.set_language(language)
root_node = parser.parse(bytes(raw_code, 'utf8'))

Get all function nodes inside a specific node, use:

from src.utils.parser import CppParser

function_list = CppParser.get_function_list(root_node)
print(function_list)

# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]

metadata = CppParser.get_function_metadata(function, raw_code)

# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)

# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Data collection and Preprocessing

The dataset we used to extract was collected by codeparrot. They host the raw dataset in here codeparrot/github-code.

You can create your own dataset using Google Bigquery and the query here

Getting started

For start preprocessing data, define a .yaml file to declare raw data format. (More detail: /data/format/README.md)

python -m src.processing 
<DATASET_PATH>
--save_path <SAVE_PATH>  # path to save dir

--load_from_file  # load from file instead load from dataset cache
--language Python  # or Java, JavaScript, ...
--data_format './data/format/codeparot-format.yaml'  # load raw data format

--n_split 20  # split original dataset into N subset
--n_core -1  # number of multiple processor (default to -1 == using all core)

NOTES: <DATASET_PATH> dir must contains raw data store in .jsonl extension if you pass argument --load_from_file or contains huggingface dataset's

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.8.1

Oct 11, 2023

0.0.8

Aug 17, 2023

0.0.7

Jul 5, 2023

0.0.6

Feb 9, 2023

0.0.5

Jan 9, 2023

0.0.4

Dec 2, 2022

0.0.3

Dec 2, 2022

0.0.2

Nov 25, 2022

This version

0.0.1

Nov 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codetext-0.0.1.tar.gz (28.4 kB view hashes)

Uploaded Nov 9, 2022 Source

Built Distribution

codetext-0.0.1-py3-none-any.whl (41.3 kB view hashes)

Uploaded Nov 9, 2022 Python 3

Hashes for codetext-0.0.1.tar.gz

Hashes for codetext-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`02d0c472010c89b4178fec497ca657ba5f2592e6b5124335b506a641ac67ed51`
MD5	`b83a5bafb07c92037ff415002822c4fe`
BLAKE2b-256	`7ba5892b117b4b50f638bb66aa2bbfa4c1c16edfb0584d454b7861f9def5d521`

Hashes for codetext-0.0.1-py3-none-any.whl

Hashes for codetext-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b0d2303659a99a4fc033ae77fc5c4ba37e03f78b9a425080e63464b475c0f62`
MD5	`681bd257c3e93ebdc90e6c7c6d3cdc53`
BLAKE2b-256	`566b0fefb704250ac093e8466c39fbeb18f98322a0e46c029861bf9c0519cd87`