Skip to main content

malwi - AI Python Malware Scanner

Project description

malwi - AI Python Malware Scanner

Logo โ€‚

malwi detects Python malware using AI.

It specializes in finding zero-day vulnerabilities and can classify code as malicious or benign without requiring internet access.

Key Features

  • ๐Ÿ” Detects unknown malware patterns through AI analysis
  • ๐Ÿ”’ Runs completely offline - no data leaves your machine
  • โšก Fast scanning of entire codebases
  • ๐Ÿšซ No external dependencies or cloud services required
  • ๐Ÿ“– Open-source project built on research and open data ๐Ÿ‡ช๐Ÿ‡บ

1) Install

pip install --user malwi

2) Run

malwi scan examples/malicious

3) Evaluate: a recent zero-day detected with high confidence

                  __          __
  .--------.---.-|  .--.--.--|__|
  |        |  _  |  |  |  |  |  |
  |__|__|__|___._|__|________|__|
     AI Python Malware Scanner


- target: examples/malicious
- seconds: 0.42
- files: 13
  โ”œโ”€โ”€ scanned: 3
  โ”œโ”€โ”€ skipped: 10
  โ””โ”€โ”€ suspicious:
      โ”œโ”€โ”€ examples/malicious/discordpydebug-0.0.4/setup.py
      โ”‚   โ””โ”€โ”€ <module>
      โ”‚       โ”œโ”€โ”€ archive compression
      โ”‚       โ””โ”€โ”€ package installation execution
      โ””โ”€โ”€ examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
          โ”œโ”€โ”€ <module>
          โ”‚   โ”œโ”€โ”€ process management
          โ”‚   โ”œโ”€โ”€ system interaction
          โ”‚   โ”œโ”€โ”€ deserialization
          โ”‚   โ””โ”€โ”€ user io
          โ”œโ”€โ”€ run
          โ”‚   โ””โ”€โ”€ fs linking
          โ”œโ”€โ”€ debug
          โ”‚   โ”œโ”€โ”€ fs linking
          โ”‚   โ””โ”€โ”€ archive compression
          โ””โ”€โ”€ runcommand
              โ””โ”€โ”€ process management

=> ๐Ÿ‘น malicious 0.98

PyPI Package Scanning

malwi can directly scan PyPI packages without executing malicious logic, typically placed in setup.py or __init__.py files:

malwi pypi requests
                  __          __
  .--------.---.-|  .--.--.--|__|
  |        |  _  |  |  |  |  |  |
  |__|__|__|___._|__|________|__|
     AI Python Malware Scanner


- target: downloads/requests-2.32.4.tar
- seconds: 3.10
- files: 84
  โ”œโ”€โ”€ scanned: 34
  โ””โ”€โ”€ skipped: 50

=> ๐ŸŸข good

Why malwi?

The number of malicious open-source packages is growing. This is not just a threat to your business but also to the open-source community.

Typical malware behaviors include:

  • Exfiltration of data: Stealing credentials, API keys, or sensitive user data.
  • Backdoors: Allowing remote attackers to gain unauthorized access to your system.
  • Destructive actions: Deleting files, corrupting databases, or sabotaging applications.

How does it work?

malwi applies DistilBert based on the design of Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application (2025). The malwi-samples dataset is used for training.

1. Compile Python files to bytecode

def runcommand(value):
    output = subprocess.run(value, shell=True, capture_output=True)
    return [output.stdout, output.stderr]
  0           RESUME                   0

  1           LOAD_CONST               0 (<code object runcommand at 0x5b4f60ae7540, file "example.py", line 1>)
              MAKE_FUNCTION
              STORE_NAME               0 (runcommand)
              RETURN_CONST             1 (None)
  ...

2. Map bytecode to tokens

TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value

3. Feed tokens into pre-trained DistilBert

=> Maliciousness: 0.92

This creates a list with malicious code objects. However malicious code might be split into chunks and spread across a package. This is why the next layers are needed.

4. Take final decision

The DistilBERT model makes the final maliciousness decision based on the token patterns.

=> Maliciousness: 0.92

Benchmarks?

DistilBert

Metric Value
F1 Score 0.944
Recall 0.906
Precision 0.984
Training time ~1 hour
Hardware NVIDIA RTX 4090
Epochs 3

Limitations

The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.

What's next?

The first iteration focuses on maliciousness of Python source code.

Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).

Contributing & Support

๐Ÿ› Report Issues

Found a bug or have a feature request? Open an issue

๐Ÿ“Š Share Malware Samples

Have access to malicious packages in Rust, Go, or other languages? Your contributions can help expand malwi's detection capabilities:

๐Ÿ’ฌ Community

  • Discussions: Share ideas and ask questions in GitHub Discussions
  • Security: Report security vulnerabilities privately via GitHub Security tab

Development

๐Ÿ› ๏ธ Prerequisites

  1. Package Manager: Install uv for fast Python dependency management
  2. Training Data: Clone malwi-samples in the parent directory:
    cd ..
    git clone https://github.com/schirrmacher/malwi-samples.git
    cd malwi
    

๐Ÿš€ Quick Start

# Install dependencies
uv sync

# Run tests
uv run pytest tests

# Train a model from scratch (full pipeline)
./cmds/preprocess_and_train_distilbert.sh

๐Ÿ“š Training Pipeline

The training pipeline consists of three stages that can be run together or independently:

Complete Pipeline (Recommended)

# Data preprocessing โ†’ Tokenizer training โ†’ Model training
./cmds/preprocess_and_train_distilbert.sh

Individual Stages

# 1. Data Preprocessing (parallel by default, ~5-7 min on 8 cores)
./cmds/preprocess_data.sh

# 2. Tokenizer Training (~2 min)
./cmds/train_tokenizer.sh

# 3. Model Training (~5 hours on NVIDIA RTX 4090)
./cmds/train_distilbert.sh

โš™๏ธ Configuration

# Customize parallel processing (preprocessing)
NUM_PROCESSES=16 ./cmds/preprocess_data.sh

# Train smaller/faster model
HIDDEN_SIZE=256 ./cmds/train_distilbert.sh

# Train larger/more accurate model  
HIDDEN_SIZE=512 EPOCHS=5 ./cmds/train_distilbert.sh

๐Ÿงช Testing & Quality

# Run tests
uv run pytest tests

# Code formatting
uv run ruff format .

# Linting
uv run ruff check .

# Regenerate test data (after compiler changes)
uv run python util/regenerate_test_data.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malwi-0.0.21.tar.gz (90.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malwi-0.0.21-py3-none-any.whl (93.8 kB view details)

Uploaded Python 3

File details

Details for the file malwi-0.0.21.tar.gz.

File metadata

  • Download URL: malwi-0.0.21.tar.gz
  • Upload date:
  • Size: 90.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.10

File hashes

Hashes for malwi-0.0.21.tar.gz
Algorithm Hash digest
SHA256 e0dfec099e5721047062ee18ab4e5ac0f9b593ba9f77695a39f7b4657683bfd9
MD5 ec684942ec1dfc46bd3cdf888488bd54
BLAKE2b-256 2ff1a79622496a72dc7f64549cea14cd7a2b1cd225d4dacb823747929bb69b60

See more details on using hashes here.

File details

Details for the file malwi-0.0.21-py3-none-any.whl.

File metadata

  • Download URL: malwi-0.0.21-py3-none-any.whl
  • Upload date:
  • Size: 93.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.10

File hashes

Hashes for malwi-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 5203bae48c7f6e31756492e1f2ff7adcc9bf4fc27ddcf9e4ef1301bce564087a
MD5 e3613dff2d7d52a6e67439c67673f43c
BLAKE2b-256 267e17fdbc6557b404d90f104b0eda81ab3dc80719a7b80fd4b918d8dd31054f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page