Skip to main content

Linguado is a tool which compares the AST of two or more files

Project description

Linguado

Linguado is a tool which compares the abstract syntax trees (AST) of two or more scripts to measure the similarity. The main goal intended for this tool is to detect two variants of the same malware.

Background

This tool was developed by Guzmán Cernadas Pérez (@DonCaralludo) working for BE:SEC (@BESEC_byEmetel). It was shown by Marcos Carro Fernández and Guzmán Cernadas Pérez at the VICON in april 2023.

Installation

From pypi:

pip3 install linguado

From repo:

git clone https://github.com/caralludo/linguado.git
cd linguado
pip3 install .

Usage

In order to execute this tool, you have to have two or more source codes in different files, and you have to know in which language they are made.

The help is as follows:

usage: main.py [-h] [-p PAR] [-o OUTPUT] Files [Files ...] Language

Linguado is a tool which compares the AST of two or more files. Created by Guzmán Cernadas Pérez (@DonCaralludo)
working for BE:SEC (@BESEC_byEmetel)

positional arguments:
  Files                 Files to analyze
  Language              Language of the files. Options: c, javascript, nasm, php, python2, python3, vba

options:
  -h, --help            show this help message and exit
  -p PAR, --par PAR     Changes the number of iterations of the Weisfeiler-Lehman algorithm (default: 3)
  -o OUTPUT, --output OUTPUT
                        Changes the base name of the output files (default: result.csv)

Examples

Compare two source codes made in python3:

linguado source1.py source2.py python3

Compare two or more files made in python3:

linguado source* python3

Compare two or more files made in python3 and change the number of iterations of the Weisfeiler-Lehman algorithm:

linguado source* python3 -p 10

Compare two source codes made in python3 and changing the output name:

linguado source1.py source2.py python3 -o output.csv

Available languages

For the moment, the tool can compare the following programming languages:

  • C
  • JavaScript
  • NASM
  • PHP
  • Python2
  • Python3
  • VBA

Adding new languages

To add a new language you have to do the following steps:

  1. Install ANTLR
  2. Create or obtain a grammar in ANTLR4 format.
  3. Generate the files with the following command:
antlr4 -Dlanguage=Python3 *.g4
  1. Save the files in a new folder in the path ./linguado/[language name]
  2. Import the Lexer and Parser in the file linguado/main.py
from mygrammar.MyGrammarLexer import MyGrammarLexer
from mygrammar.MyGrammarParser import MyGrammarParser
from linguado.c.CLexer import CLexer
from linguado.c.CParser import CParser
from javascript.JavaScriptLexer import JavaScriptLexer
from javascript.JavaScriptParser import JavaScriptParser
from php.PhpLexer import PhpLexer
from php.PhpParser import PhpParser
from python2.Python2Lexer import Python2Lexer
from python2.Python2Parser import Python2Parser
from python3.Python3Lexer import Python3Lexer
from python3.Python3Parser import Python3Parser
from vba.vbaLexer import vbaLexer
from vba.vbaParser import vbaParser
  1. Modify the dictionary in the file linguado/main.py putting the lerxer, the parser and the first rule of the grammar
    language_functions = {
        "c": [CLexer, CParser, "translationUnit"],
        "javascript": [JavaScriptLexer, JavaScriptParser, "program"],
        "mygrammar": [MyGrammarLexer, MyGrammarParser, "first_rule"],
        "php": [PhpLexer, PhpParser, "htmlDocument"],
        "python2": [Python2Lexer, Python2Parser, "file_input"],
        "python3": [Python3Lexer, Python3Parser, "file_input"],
        "vba": [vbaLexer, vbaParser, "startRule"]
    }

Output

A possible output example could be:

Generating AST's
100%|██████████| 4/4 [00:03<00:00,  1.16it/s]
Calculating Weisfeiler-Lehman matrix
100%|██████████| 3/3 [00:00<00:00, 28.09it/s]
Checking isomorphism (igraph)
100%|██████████| 4/4 [00:02<00:00,  1.98it/s]
Weisfeiler-Lehman:
[[58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]]
Weisfeiler-Lehman (%):
[[100. 100. 100. 100.]
 [100. 100. 100. 100.]
 [100. 100. 100. 100.]
 [100. 100. 100. 100.]]
Mean: 58162880.0 , Standard deviation: +- 0.0 ,  0.0
Isomorphism test (igraph):
[[ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

In each matrix, the columns represents each source code file ordered by name, and each row represents the source code file ordered by name. So, in each intersection is represented the comparation between the two files.

             source1.py  source2.py  source3.py  source4.py
source1.py [[ 58162880.   58162880.   58162880.   58162880.]
source2.py  [ 58162880.   58162880.   58162880.   58162880.]
source3.py  [ 58162880.   58162880.   58162880.   58162880.]
source4.py  [ 58162880.   58162880.   58162880.   58162880.]]
           source1.py  source2.py  source3.py  source4.py
source1.py [[ True        True        True        True]
source2.py  [ True        True        True        True]
source3.py  [ True        True        True        True]
source4.py  [ True        True        True        True]]

Also, the tool creates two csv files with the same information in the terminal.

Measuring similarity

Two codes will have the same abstract syntax tree if:

  • The isomorphism test matrix has a True in the intersection of the two sources.

Two codes will not have the same abstract syntax tree if:

  • The Weisfeiler-Lehman matrix has different values.

If the sources do not have the same abstract syntax tree, we can use the standard deviation to know if they are similar:

  • If the standard deviation is close to zero (less than 5%), then the sources will be very similar.
  • If the standard deviation is around the 20%, then could be a chance that the sources are sharing some code.
  • If the standard deviation is more than 50%, then the sources will not be the same.

Behavior

  1. Generates the abstract syntax tree with ANTLR4.
  2. From the abstract syntax tree generates a graph which we can work with.
  3. Calculates the Weisfeiler-Lehman matrix.
  4. Performs the isomorphism test (igraph).
  5. Prints on the screen and writes in a CSV the results of the Weisfeiler-Lehman algorithm and the isomorphism test.

Other uses

This tool can be used to look for plagiarism in academic environments.

Links of interest

ANTLR

Grammars for ANTLR

VX-Underground Malware Repository

JavaScript Malware Repository

SPTH Repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linguado-0.2.0.tar.gz (743.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linguado-0.2.0-py3-none-any.whl (743.0 kB view details)

Uploaded Python 3

File details

Details for the file linguado-0.2.0.tar.gz.

File metadata

  • Download URL: linguado-0.2.0.tar.gz
  • Upload date:
  • Size: 743.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for linguado-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f8ace3c412b2b8eb9f9abdfeb8ceaba0b02b54a82352b8fdc6883a06e65f5f11
MD5 4236abd284811bf60e950c8a46e5c66d
BLAKE2b-256 87a1c5e21721aca1ee9d0c1b364f65f09417552f028e10b8445638737e09c35b

See more details on using hashes here.

File details

Details for the file linguado-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: linguado-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 743.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for linguado-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3099abcc0c886a35b539d19649453924de891bf3da12d361629374be3e3cdde0
MD5 0bc045d810cb3452ca512ec1475c3ac5
BLAKE2b-256 180d88e764293349f10b57aac5dc6fb021fc2ac4e91d9a70934026f5cf45f6de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page