Skip to main content

Detects possibly suspicious stuff in your source files

Project description

Suspicious

Sniffs possibly suspicious stuff in your source code. 100% local, no data leaves your computer.

GitHub PyPi

🤔 WTF is this🪄 Demos🔧 Installation💻 Usage🧠 How it works


WTF is this?

This is a CLI application that analyzes a source code file using an AI model. It then shows you parts that look suspicious to it.

It does not use rules or static analysis the way a linter tool would. Instead, the model generates its own code suggestions based on the surrounding context. Check out how it works.

NB: All processing is done on your hardware and no data is transmitted to the Internet

Example output:

example results

Demo

Here's the output of running the application on its own source files (so meta).

Have I seen this before?

There was this post AI found a bug in my code on Hacker News which was pretty cool. I wanted to try it on my own code, so I went ahead and built my implementation of the idea.

Installation

You can install sus via pip or from the source.

Pip (MacOS, Linux, Windows)

pip3 install suspicious

From source

git clone git@github.com:sturdy-dev/suspicious.git
cd suspicious
python -m pip install .

Usage

You can run the program like this:

sus /path/to/file.py

This will generate and open an .html file with the results.

  • grey means prediction is the same as the original
  • light grey means the model had a different prediction but with super low confidence
  • light red means things are looking a little sus
  • red means there was a different prediction and confidence was higher

Practical usage

Unclear. You run sus on a file and skim over the red stuff, maybe it spots something you missed. Ping me on twitter if you catch something cool with it.

How does it work?

In a nutshell, it feeds a tokenized representation of your source text into a Transformer model and asks the model to predict one token at a time using Masked Language Modelling.

For a general overview about Transformer models, check out The Illustrated Transformer article by Jay Alammar, which helped me out in understanding the core ideas.

sus uses a model called UniXcoder which has been trained on the CodeSearchNet dataset. To do the MLM (masked language modelling) we are adding a lm_head layer.

When sus processes your code, it first tokenizes the text, where a token could be a special character or programming language keyword, English word or part of a word.

Before feeding the sequence of token ids to the model, one or multiple tokens are replaced with a special <mask> token. After feeding the input through the network, we extract just the value at the masked location. This masking is done in a loop for each token to generate individual predictions.

Since this process is impractically slow, instead of masking one token at a time, sus masks 10% of the tokens, making sure that the masked locations are spread out (so that there is sufficient context around each prediction site).

The output of this entire process is a list of structs that contain the original and predicted values for each token. Example:

{
    "idx": 0, // position in sequence
    "original": "foo", // as originally written in the source file
    "predicted": "bar", // what the model predicted
    "cosine_similarity": 0.23, // how different the prediction is from the original in the vector space
    "probability": 0.92, // how confident the model is in it's prediction
}

This is then fed into an html template to be rendered for the user. Easy-peasy.

Model

sus uses the decoder of UniXcoder, specifically the unixcoder-base-nine checkpoint. What's cool is that it's only 500 MB and ~120M parameters, which means it's quick to download and fast enough to run locally.

Larger models produce higher quality outputs, but you need to run the inference on a server.

Supported languages

You can try sus on any source file, but you can expect best results with the following languages:

  • java
  • ruby
  • python
  • php
  • javascript
  • go
  • c
  • c++
  • c#

Bugs and limitations

  • Accuracy — sus is meant to be executed locally (aka not sending code to a server), which puts some constraints on the AI model size. Larger models will produce higher quality results, but they can be tens of GB in size and without a beefy GPU could take a long time to generate the output. Because of this, sus uses a modestly sized model.
  • Large files — The model also puts constraints on the input size (analyzed file size). sus works around this by batching the input, but as a result of this, batches are not aware of the 'context' / code that is in other batches. Files are split in batches of 2500 characters which is super crude and is meant to correspond to ~1024 tokens.
  • Masking is done on per token basis. It could be interesting to first generate syntax tree from the code and then mask the entire node instead.

License

Semantic Code Search is distributed under AGPL-3.0-only. For Apache-2.0 exceptions — kiril@codeball.ai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suspicious-0.1.1.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

suspicious-0.1.1-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file suspicious-0.1.1.tar.gz.

File metadata

  • Download URL: suspicious-0.1.1.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for suspicious-0.1.1.tar.gz
Algorithm Hash digest
SHA256 77dd31520a6604937a49ee496c2adcd7ebd2df78b3f9812ee0f436c12b5b1e68
MD5 48c1d812ec9f32aae1d5a56edacc9253
BLAKE2b-256 f3ffeea149b1f844558457ddf2bb7d00dc54129ddf871d6be60c47a16ca8452a

See more details on using hashes here.

File details

Details for the file suspicious-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: suspicious-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for suspicious-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a49ff205559773c7a2d48f7e1d748beb985914880866d5bcd243a0fc0ae9374e
MD5 12c00e2b089f00bb3631837f1d4befe9
BLAKE2b-256 333e4915905d0938df6d9e829c215d654e407754f123b77d8ab1db9bc707ec3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page