Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data
Project description
| Branch | Build | Unittest | Release | License |
|---|---|---|---|---|
| main |
Code-Text parser is a custom tree-sitter's grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
- Python
- Java
- JavaScript
- PHP
- Ruby
- Rust
- C
- C++
- C#
- Go
Installation
codetext package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .
Or install via pypi package:
pip install codetext
Getting started
codetext CLI Usage
codetext [options] [PATH or FILE] ...
For example extract any python file in src/ folder:
codetext src/ --language Python
If you want to store extracted class and function, use flag --json and give a path to destination file:
codetext src/ --language Python --output_file ./python_report.json --json
Options
positional arguments:
paths list of the filename/paths.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANGUAGE, --language LANGUAGE
Target the programming languages you want to analyze.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file (e.g report.json).
--json Generate json output as a transform of the default
output
--verbose Print progress bar
Example
File circle_linkedlist.py analyzed:
==================================================
Number of class : 1
Number of function : 2
--------------------------------------------------
Class summary:
+-----+---------+-------------+
| # | Class | Arguments |
+=====+=========+=============+
| 0 | Node | |
+-----+---------+-------------+
Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| # | Method name | Paramters | Type | Return type |
+=====+===============+=============+========+===============+
| 0 | __init__ | self | | |
| | | data | | |
+-----+---------------+-------------+--------+---------------+
Function analyse:
+-----+-----------------+-------------+--------+---------------+
| # | Function name | Paramters | Type | Return type |
+=====+=================+=============+========+===============+
| 0 | push | head_ref | | Node |
| | | data | Any | Node |
| 1 | countNodes | head | Node | |
+-----+-----------------+-------------+--------+---------------+
Using codetext as Python module
Build your language
codetext need tree-sitter language file (i.e .so file) to work properly. You can manually compile language (see more) or automatically build use our pre-defined function (the <language>.so will saved in a folder name /tree-sitter/):
from codetext.utils import build_language
language = 'rust'
build_language(language)
# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
Using Language Parser
Each programming language we supported are correspond to a custome language_parser. (e.g Python is PythonParser()). language_parser take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:
from codetext.utils import parse_code
raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""
# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node
Get all function nodes inside a specific node:
from codetext.utils.parser import CppParser
function_list = CppParser.get_function_list(root_node)
print(function_list)
# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]
Get function metadata (e.g. function's name, parameters, (optional) return type)
function = function_list[0]
metadata = CppParser.get_function_metadata(function, raw_code)
# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}
Get docstring (documentation) of a function
docstring = CppParser.get_docstring(function, code_sample)
# ['Sum of 2 number \n@param a int number \n@param b int number']
We also provide 2 command for extract class object
class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)
Limitations
codetext heavly depends on tree-sitter syntax:
-
Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc.
codetextis easily vulnerable by tree-sitter update patch or syntax change in future. -
While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codetext-0.0.9.tar.gz.
File metadata
- Download URL: codetext-0.0.9.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f91f1ffc485aa24d97cf2ded723c6aaa93a2524242abadc0f3a8c2e52d09adc2
|
|
| MD5 |
456bf22287116b6c0a4c239a0488397d
|
|
| BLAKE2b-256 |
cf35374d854737e482c897b98371270871f48e1b7f879178526fe1fb457750ea
|
File details
Details for the file codetext-0.0.9-py3-none-any.whl.
File metadata
- Download URL: codetext-0.0.9-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1d0a739f780a2dce2d8f083208b15f57f068d1c47bccdde906d0ec57e8ed4b5
|
|
| MD5 |
a1036b1c2b03b7616a25954b0dfa8eab
|
|
| BLAKE2b-256 |
35d36218d294ce89cc12038145b10b233713b074ef7502632de329d925811d67
|