Skip to main content

Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data

Project description

logo

CodeText-parser


Branch Build Unittest Linting Release License
main Unittest release pyversion license

Code-Text data toolkit contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).

Installation

Setup environment and install dependencies and setup by using install_env.sh

bash -i ./install_env.sh

then activate conda environment named "code-text-env"

conda activate code-text-env

Setup for using parser

pip install codetext

Getting started

Build your language

Auto build tree-sitter into <language>.so located in /tree-sitter/

from codetext.utils import build_language

language = 'rust'
build_language(language)


# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Language Parser

We supported 10 programming languages, namely Python, Java, JavaScript, Golang, Ruby, PHP, C#, C++, C and Rust.

Setup

from codetext.utils import parse_code

raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
    return a + b;
}
"""

root = parse_code(raw_code, 'cpp')
root_node = root.root_node

Get all function nodes inside a specific node, use:

from codetext.utils.parser import CppParser

function_list = CppParser.get_function_list(root_node)
print(function_list)

# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]

metadata = CppParser.get_function_metadata(function, raw_code)

# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)

# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codetext-0.0.6-1.tar.gz (22.7 kB view hashes)

Uploaded Source

Built Distribution

codetext-0.0.6-1-py3-none-any.whl (30.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page