Parsers and tools for extracting method/class-level features from source code
Project description
Source Parser
This package contains tools for parsing source code into annotated json data structure: we extracted
import statements, global assignments, top-level methods, classes, class methods and attributes, and annotated
and separate each method and class into signature, docstring, body, and other language-specific attributes
for downstream modeling purposes. The tool provides a high-performance multiprocessing tool called repo_parse
which is capable of cloning and performing this parsing for all files in a repository at the rate of thousands
of repositories per minute. See subsequence sections for installation, usage, and at the end a general
description of the structural annotated schema.
Currently supported languages
- Python
- Java
- Javascript/Typescript
- C#
- C++
- Ruby
Installation
NOTE: this tool is only supported on a NIX-style OS (Linux, MacOS, FreeBSD, etc)
PyPI installation
After following the above instructions, simply invoke
python -m pip install source-parser
Usage
Scripting
Simply load the source file contents and hand to a parser, e.g.
from source_parser.parsers import PythonParser
pp = PythonParser(open('source_parser/crawler.py').read())
print(pp.schema)
will print the schema extracted from source_parser/crawler.py
.
Parsing at scale
The real intention of this tool is to run massively at scale with 100s of thousands of git repositories. Two CLI tools are added upon installation:
repo_parse -h
: semantically parses code usingsource_parser
repo_scrape -h
: just grabs all files matching some patterns
for example:
repo_parse <language> <repo_list.json> <outdir> [--tmpdir <temporary_directory>]
where <language>
is one of the supported languages indicated in the help message,
<repo_list.json>
is a path to a .json
file containin a list of dictionaries with at
least a 'url'
key for a git
repository and optionally a 'license'
key. <outdir>
is the directory in which to place the saved results as a lz4
compressed jsonlines
file, and --tmpdir
is an optional place to save temporary data like cloned
repositories.
_Protip: mount a RAMdisk and hand it to --tmpdir
to remove the IO bottleneck
and double parsing speeds! Further, you can set outdir
to be in the RAMdisk as well,
so no disk is necessary (if you have enough memory).
sudo mount -t tmpfs -o size=<size in Gigabytes>G <name-ramdisk> /path/to/ramdisk`
Reading the data
The default compression algorithm used is lz4
for its high speed and reasonable
compression ratio. Because the data is highly compressible, DO read the data in streaming
fashion and not saving it all in memory uncompressed at once. The JSON dictionaries are
highly compressible so you can generally expect the uncompressed data to be 2-3x as large.
To this end there is a nice tool in source_parser/__init__.py
, importable
via from source_parser import load_zip_json
, which returns an iterator object
which uncompresses and returns only one file-level schema dictionary at a time.
To use:
from source_parser import load_zip_json
for example in load_zip_json('file_saved_from_repocontext.lz4'):
process_file_example(example)
If you'd like to load it all into memory at once:
from source_parser import load_zip_json
all_data = list(load_zip_json('file_saved_from_repocontext.lz4'))
Data Schema
This is a description of the JSON schema into which source_parser
will
transform source code files, for use in method and class-level code-natural
language modeling. The data will consist of JSON lines, that is valid JSON
separated by newline characters. Each line of JSON will be the features
extracted from a single source code file. The proposed JSON schema for each
individual file is as follows:
NOTE: See individual language parsers in source_parser/parsers
for the langauge-specific method and class attributes.
{
'file_name': 'name_of_file.extension',
'file_hash': 'hash of file for literal deduplication',
'relative_path': 'repo_top_level/path/to/file/name_of_file.extension',
'repo_name': 'owner/repo-name',
'commit-hash': 'hash of the commit being analyzed',
'license': {
'label': 'label provided by github API or in json list',
'files': [
'relative_path': 'path/to/license/file',
'file_contents': 'license file contents',
],
}
'original_string': 'origina string of file',
'file_docstring': 'string containing first docstring for all of file',
'contexts': [
'import statement 1',
'import statement 2',
'global variable expression 1',
...
],
'language_version_details': [
'e.g. python2 syntax detected', 'another languages idiosyncracies'
]
'methods': [ # list of dictionaries annotating each method
{
'original_string': 'verbatim code of whole method',
'byte_span': (start_byte, end_byte),
'start_point': (start_line_number, start_column),
'end_point': (end_line_number, end_column),
'signature': 'string corresponding to definition, name, arguments of method',
'name': 'name of method',
'docstring': 'verbatim docstring corresponding to this method',
'body': 'verbatim code body',
'original_string_normed': 'code of whole method with string-literal, numeral normalization',
'signature_normed': 'signature with string-literals/numerals normalized',
'body_normed': 'code of body with string-literals/numerals normalized',
'default_arguments': ['arg1': 'default value 1', ...],
'syntax_pass': 'True/False whether the method is syntactically correct',
'attributes': [
'language_specific_keys': 'language_specific_values',
'decorators': ['@wrap', '@abstractmethod'],
...
],
...
},
...
]
'classes': [
{
'original_string': 'verbatim code of class',
'byte_span': (start_byte, end_byte),
'start_point': (start_line_number, start_column),
'end_point': (end_line_number, end_column),
'name': 'class name',
'definition': 'class definition statement',
'class_docstring': 'docstring corresponding to to-level class definition,
'attributes': { # language specific keys and values, e.g.
'expression_statements': [
{
'expression': 'attribute = 1',
'comment': 'comment associated'
},
'classes': [ # classes defined within classes
{
# same structure as classes
}
]
...
]
},
'methods': [
'# list of class methods of the same form as top-level methods',
...
]
}
...
]
]
Contributing
We welcome contributions. Please follow this guideline.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file source_parser-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: source_parser-1.2.0-py3-none-any.whl
- Upload date:
- Size: 6.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94ccffb1f906b4616716258009c125428bebd998a2cb10af4b471cc3a802f54f |
|
MD5 | b3cf3260383d4c6b79de1fc2df0d93a6 |
|
BLAKE2b-256 | abcda92cc43eb19804b2d15c190cff7f7ae16049cb31e3439cb5e6a7b0ac4ad0 |