Skip to main content

Multilingual programming language parsers for the extract from raw source code into multiple levels of pair data

Project description

logo

______________________________________________________________________
Branch Build Unittest Release License
main Unittest release pyversion license

Code-Text parser is a custom tree-sitter's grammar parser for extract raw source code into class and function level. We support 10 common programming languages:

  • Python
  • Java
  • JavaScript
  • PHP
  • Ruby
  • Rust
  • C
  • C++
  • C#
  • Go

Installation

codetext package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:

git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .

Or install via pypi package:

pip install codetext

Getting started

codetext CLI Usage

codetext [options] [PATH or FILE] ...

For example extract any python file in src/ folder:

codetext src/ --language Python

If you want to store extracted class and function, use flag --json and give a path to destination file:

codetext src/ --language Python --output_file ./python_report.json --json

Options

positional arguments:
  paths                 list of the filename/paths.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -l LANGUAGE, --language LANGUAGE
                        Target the programming languages you want to analyze.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output file (e.g report.json).
  --json                Generate json output as a transform of the default
                        output
  --verbose             Print progress bar

Example

File circle_linkedlist.py analyzed:
==================================================
Number of class    : 1
Number of function : 2
--------------------------------------------------

Class summary:
+-----+---------+-------------+
|   # | Class   | Arguments   |
+=====+=========+=============+
|   0 | Node    |             |
+-----+---------+-------------+

Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| #   | Method name   | Paramters   | Type   | Return type   |
+=====+===============+=============+========+===============+
| 0   | __init__      | self        |        |               |
|     |               | data        |        |               |
+-----+---------------+-------------+--------+---------------+

Function analyse:
+-----+-----------------+-------------+--------+---------------+
| #   | Function name   | Paramters   | Type   | Return type   |
+=====+=================+=============+========+===============+
| 0   | push            | head_ref    |        | Node          |
|     |                 | data        | Any    | Node          |
| 1   | countNodes      | head        | Node   |               |
+-----+-----------------+-------------+--------+---------------+

Using codetext as Python module

Build your language

codetext need tree-sitter language file (i.e .so file) to work properly. You can manually compile language (see more) or automatically build use our pre-defined function (the <language>.so will saved in a folder name /tree-sitter/):

from codetext.utils import build_language

language = 'rust'
build_language(language)

# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Using Language Parser

Each programming language we supported are correspond to a custome language_parser. (e.g Python is PythonParser()). language_parser take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:

from codetext.utils import parse_code

raw_code = """
    /**
    * Sum of 2 number
    * @param a int number
    * @param b int number
    */
    double sum2num(int a, int b) {
        return a + b;
    } 
"""

# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node

Get all function nodes inside a specific node:

from codetext.utils.parser import CppParser

function_list = CppParser.get_function_list(root_node)
print(function_list)

# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]

metadata = CppParser.get_function_metadata(function, raw_code)

# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)

# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Limitations

codetext heavly depends on tree-sitter syntax:

  • Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. codetext is easily vulnerable by tree-sitter update patch or syntax change in future.

  • While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codetext-0.0.8.1.tar.gz (28.3 kB view hashes)

Uploaded Source

Built Distribution

codetext-0.0.8.1-py3-none-any.whl (35.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page