Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

logo

Code-Text data toolkit

Branch	Build	Unittest	Linting	Release	License
main

Code-Text data toolkit contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).

Installation

Setup environment and install dependencies and setup by using install_env.sh

bash -i ./install_env.sh

then activate conda environment named "code-text-env"

conda activate code-text-env

Setup for using parser

pip install codetext

Getting started

Build your language

Auto build tree-sitter into <language>.so located in /tree-sitter/

from codetext.utils import build_language

language = 'rust'
build_language(language)


# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Language Parser

We supported 10 programming languages, namely Python, Java, JavaScript, Golang, Ruby, PHP, C#, C++, C and Rust.

Setup

from codetext.utils import parse_code

raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
    return a + b;
}
"""

root = parse_code(raw_code, 'cpp')
root_node = root.root_node

Get all function nodes inside a specific node, use:

from codetext.utils.parser import CppParser

function_list = CppParser.get_function_list(root_node)
print(function_list)

# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]

metadata = CppParser.get_function_metadata(function, raw_code)

# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)

# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Data collection and Preprocessing

The dataset we used to extract was collected by codeparrot. They host the raw dataset in here codeparrot/github-code.

You can create your own dataset using Google Bigquery and the query here

Getting started

Process custom dataset

For start preprocessing data, define a .yaml file to declare raw data format. (More detail: /data/format/README.md)

python -m codetext.processing 
<DATASET_PATH>
--save_path <SAVE_PATH>  # path to save dir

--load_from_file  # load from file instead load from dataset cache
--language Python  # or Java, JavaScript, ...
--data_format './data/format/codeparot-format.yaml'  # load raw data format

--n_split 20  # split original dataset into N subset
--n_core -1  # number of multiple processor (default to 1) (-1 == using all core)

NOTES: <DATASET_PATH> dir must contains raw data store in .jsonl extension if you pass argument --load_from_file or contains huggingface dataset's

Analyse and split dataset

The code process is going to save cleaned sample by batch, you can merge it using postprocess.py. We also provide analyse tool for get total number of sample, blank_line(*), comment(*) and code(*). You can also split your dataset into train, valid, test.

python -m codetext.postprocessing 
<DATASET_PATH>  # path to dir contains /extracted, /filered, /raw
--save_path <SAVE_PATH>  # path to save final output

--n_core 10  # number of core for multiprocessing analyzer
--analyze  # Analyze trigger
--split  # Split train/test/valid trigger
--ratio 0.05  # Test and valid ratio (defaul to equal)
--max_sample 20000  # Max size of test set and valid set

NOTES: (*) We run cloc underneath the program to count blank, comment and code. See more github.com/AlDanial/cloc

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.9

Jul 1, 2024

0.0.8.1

Oct 11, 2023

0.0.8

Aug 17, 2023

0.0.7

Jul 5, 2023

0.0.6

Feb 9, 2023

0.0.5

Jan 9, 2023

This version

0.0.4

Dec 2, 2022

0.0.3

Dec 2, 2022

0.0.2

Nov 25, 2022

0.0.1

Nov 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codetext-0.0.4.tar.gz (36.9 kB view details)

Uploaded Dec 2, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codetext-0.0.4-py3-none-any.whl (47.9 kB view details)

Uploaded Dec 2, 2022 Python 3

File details

Details for the file codetext-0.0.4.tar.gz.

File metadata

Download URL: codetext-0.0.4.tar.gz
Upload date: Dec 2, 2022
Size: 36.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for codetext-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`136759a785988ca48ac1ee98fac02d9ab6a282f189d43b4465a5f436e7fcdfe3`
MD5	`f4a9d03b24aa08df71afd2f30552334c`
BLAKE2b-256	`2446670d4732cad5bd060890682872b924475217d3533550eed435f3b4917e5e`

See more details on using hashes here.

File details

Details for the file codetext-0.0.4-py3-none-any.whl.

File metadata

Download URL: codetext-0.0.4-py3-none-any.whl
Upload date: Dec 2, 2022
Size: 47.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for codetext-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e8124e5f8c4bfa250be30e3188f7fbc4383003ee6744f423a57c53c3f94fcc9`
MD5	`8c92805d56251ae5d8138046c08adb80`
BLAKE2b-256	`9680eff9b4eaaa74173c0afda4c19259bdd816d80be2b129316dc41293382e06`

See more details on using hashes here.

codetext 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Getting started

Build your language

Language Parser

Data collection and Preprocessing

Getting started

Process custom dataset

Analyse and split dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes