Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data
Project description
______________________________________________________________________
Branch | Build | Unittest | Release | License |
---|---|---|---|---|
main |
Code-Text parser is a custom tree-sitter's grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
- Python
- Java
- JavaScript
- PHP
- Ruby
- Rust
- C
- C++
- C#
- Go
Installation
codetext package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .
Or install via pypi
package:
pip install codetext
Getting started
codetext
CLI Usage
codetext [options] [PATH or FILE] ...
For example extract any python file in src/
folder:
codetext src/ --language Python
If you want to store extracted class and function, use flag --json
and give a path to destination file:
codetext src/ --language Python --output_file ./python_report.json --json
Options
positional arguments:
paths list of the filename/paths.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANGUAGE, --language LANGUAGE
Target the programming languages you want to analyze.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file (e.g report.json).
--json Generate json output as a transform of the default
output
--verbose Print progress bar
Example
File circle_linkedlist.py analyzed:
==================================================
Number of class : 1
Number of function : 2
--------------------------------------------------
Class summary:
+-----+---------+-------------+
| # | Class | Arguments |
+=====+=========+=============+
| 0 | Node | |
+-----+---------+-------------+
Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| # | Method name | Paramters | Type | Return type |
+=====+===============+=============+========+===============+
| 0 | __init__ | self | | |
| | | data | | |
+-----+---------------+-------------+--------+---------------+
Function analyse:
+-----+-----------------+-------------+--------+---------------+
| # | Function name | Paramters | Type | Return type |
+=====+=================+=============+========+===============+
| 0 | push | head_ref | | Node |
| | | data | Any | Node |
| 1 | countNodes | head | Node | |
+-----+-----------------+-------------+--------+---------------+
Using codetext
as Python module
Build your language
codetext
need tree-sitter language file (i.e .so
file) to work properly. You can manually compile language (see more) or automatically build use our pre-defined function (the <language>.so
will saved in a folder name /tree-sitter/
):
from codetext.utils import build_language
language = 'rust'
build_language(language)
# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
Using Language Parser
Each programming language we supported are correspond to a custome language_parser
. (e.g Python is PythonParser()
). language_parser
take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:
from codetext.utils import parse_code
raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""
# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node
Get all function nodes inside a specific node:
from codetext.utils.parser import CppParser
function_list = CppParser.get_function_list(root_node)
print(function_list)
# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]
Get function metadata (e.g. function's name, parameters, (optional) return type)
function = function_list[0]
metadata = CppParser.get_function_metadata(function, raw_code)
# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}
Get docstring (documentation) of a function
docstring = CppParser.get_docstring(function, code_sample)
# ['Sum of 2 number \n@param a int number \n@param b int number']
We also provide 2 command for extract class object
class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)
Limitations
codetext
heavly depends on tree-sitter syntax:
-
Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc.
codetext
is easily vulnerable by tree-sitter update patch or syntax change in future. -
While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file codetext-0.0.9.tar.gz
.
File metadata
- Download URL: codetext-0.0.9.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f91f1ffc485aa24d97cf2ded723c6aaa93a2524242abadc0f3a8c2e52d09adc2 |
|
MD5 | 456bf22287116b6c0a4c239a0488397d |
|
BLAKE2b-256 | cf35374d854737e482c897b98371270871f48e1b7f879178526fe1fb457750ea |
File details
Details for the file codetext-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: codetext-0.0.9-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1d0a739f780a2dce2d8f083208b15f57f068d1c47bccdde906d0ec57e8ed4b5 |
|
MD5 | a1036b1c2b03b7616a25954b0dfa8eab |
|
BLAKE2b-256 | 35d36218d294ce89cc12038145b10b233713b074ef7502632de329d925811d67 |