malwi - AI Python Malware Scanner

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

canvascomputing schirrmacher

These details have not been verified by PyPI

Operating System
- OS Independent
Programming Language

Project description

malwi - AI Python Malware Scanner

malwi detects Python malware using AI.

It specializes in finding zero-day vulnerabilities and can classify code as malicious or benign without requiring internet access.

Key Features

🔍 Detects unknown malware patterns through AI analysis
🔒 Runs completely offline - no data leaves your machine
⚡ Fast scanning of entire codebases
🚫 No external dependencies or cloud services required
📖 Open-source project built on research and open data 🇪🇺

1) Install

pip install --user malwi

2) Run

malwi scan examples/malicious

3) Evaluate: a recent zero-day detected with high confidence

                  __          __
  .--------.---.-|  .--.--.--|__|
  |        |  _  |  |  |  |  |  |
  |__|__|__|___._|__|________|__|
     AI Python Malware Scanner


- target: examples/malicious
- seconds: 0.42
- files: 13
  ├── scanned: 3
  ├── skipped: 10
  └── suspicious:
      ├── examples/malicious/discordpydebug-0.0.4/setup.py
      │   └── <module>
      │       ├── archive compression
      │       └── package installation execution
      └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
          ├── <module>
          │   ├── process management
          │   ├── system interaction
          │   ├── deserialization
          │   └── user io
          ├── run
          │   └── fs linking
          ├── debug
          │   ├── fs linking
          │   └── archive compression
          └── runcommand
              └── process management

=> 👹 malicious 0.98

PyPI Package Scanning

malwi can directly scan PyPI packages without executing malicious logic, typically placed in setup.py or __init__.py files:

malwi pypi requests

                  __          __
  .--------.---.-|  .--.--.--|__|
  |        |  _  |  |  |  |  |  |
  |__|__|__|___._|__|________|__|
     AI Python Malware Scanner


- target: downloads/requests-2.32.4.tar
- seconds: 3.10
- files: 84
  ├── scanned: 34
  └── skipped: 50

=> 🟢 good

Why malwi?

The number of malicious open-source packages is growing. This represents a threat to the open-source community.

Typical malware behaviors include:

Exfiltration of data: Stealing credentials, API keys, or sensitive user data.
Backdoors: Allowing remote attackers to gain unauthorized access to your system.
Destructive actions: Deleting files, corrupting databases, or sabotaging applications.

How does it work?

malwi applies DistilBert based on the design of Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application (2025). The malwi-samples dataset is used for training.

1. Compile Python files to bytecode

def runcommand(value):
    output = subprocess.run(value, shell=True, capture_output=True)
    return [output.stdout, output.stderr]

  0           RESUME                   0

  1           LOAD_CONST               0 (<code object runcommand at 0x5b4f60ae7540, file "example.py", line 1>)
              MAKE_FUNCTION
              STORE_NAME               0 (runcommand)
              RETURN_CONST             1 (None)
  ...

2. Map bytecode to tokens

TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value

3. Feed tokens into pre-trained DistilBert

=> Maliciousness: 0.92

This creates a list with malicious code objects. However malicious code might be split into chunks and spread across a package. This is why the next layers are needed.

4. Take final decision

The DistilBERT model makes the final maliciousness decision based on the token patterns.

=> Maliciousness: 0.92

Benchmarks?

DistilBert

Metric	Value
F1 Score	0.944
Recall	0.906
Precision	0.984
Training time	~1 hour
Hardware	NVIDIA RTX 4090
Epochs	3

Limitations

The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.

What's next?

The first iteration focuses on maliciousness of Python source code.

Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).

Contributing & Support

Report Issues

Found a bug or have a feature request? Open an issue

Share Malware Samples

Have access to malicious packages in Rust, Go, or other languages? Your contributions can help expand malwi's detection capabilities:

Email: Contact via GitHub profile
Submit samples: Follow responsible disclosure practices

Development

Prerequisites

Package Manager: Install uv for fast Python dependency management

Training Data: Clone malwi-samples in the parent directory:

cd ..
git clone https://github.com/schirrmacher/malwi-samples.git
cd malwi

Quick Start

# Install dependencies
uv sync

# Run tests
uv run pytest tests

# Train a model from scratch (full pipeline)
./cmds/preprocess_and_train_distilbert.sh

Training Pipeline

The training pipeline consists of three stages that can be run together or independently:

Complete Pipeline (With Data Download)

# Downloads benign samples from popular repos → Data preprocessing → Training
./cmds/download_and_preprocess_distilbert.sh  # Downloads training data first
./cmds/train_tokenizer.sh                      # Train tokenizer
./cmds/train_distilbert.sh                     # Train model

Complete Pipeline (Without Download)

# Data preprocessing → Tokenizer training → Model training
./cmds/preprocess_and_train_distilbert.sh

Individual Stages

# 1. Download benign samples from popular GitHub repositories
uv run python -m src.research.download_data

# 2. Data Preprocessing (parallel by default, ~5-7 min on 8 cores)
./cmds/preprocess_data.sh

# 3. Tokenizer Training (~2 min)
./cmds/train_tokenizer.sh

# 4. Model Training (~5 hours on NVIDIA RTX 4090)
./cmds/train_distilbert.sh

Training Data Sources

The preprocessing script (preprocess_data.sh) combines multiple data sources for robust model training:

Benign Samples

.repo_cache/benign_repos/ - Clean Python repositories (populated by download_data from popular GitHub repos)
../malwi-samples/python/benign/ - False-positives

Malicious Samples

../malwi-samples/python/malicious/ - Confirmed malware samples
../malwi-samples/python/suspicious/ - Suspicious code patterns, not necessarily malicious (used for future multi-category classification)

Testing & Quality

# Run tests
uv run pytest tests

# Code formatting
uv run ruff format .

# Linting
uv run ruff check .

# Regenerate test data (after compiler changes)
uv run python util/regenerate_test_data.py

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

canvascomputing schirrmacher

These details have not been verified by PyPI

Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.0.35

Apr 2, 2026

0.0.34

Apr 2, 2026

0.0.33

Apr 1, 2026

0.0.32

Apr 1, 2026

0.0.31

Apr 1, 2026

0.0.30

Mar 16, 2026

0.0.29

Mar 15, 2026

0.0.28

Mar 14, 2026

0.0.27

Mar 12, 2026

0.0.26

Mar 10, 2026

0.0.25

Mar 9, 2026

0.0.24

Mar 4, 2026

0.0.23

Aug 19, 2025

This version

0.0.22

Aug 15, 2025

0.0.21

Aug 14, 2025

0.0.20

Aug 14, 2025

0.0.19

Aug 12, 2025

0.0.18

Jul 2, 2025

0.0.17

Jul 2, 2025

0.0.15

Jun 20, 2025

0.0.14

Jun 16, 2025

0.0.13

May 30, 2025

0.0.12

May 28, 2025

0.0.11

May 26, 2025

0.0.10

May 26, 2025

0.0.9

May 26, 2025

0.0.8

May 26, 2025

0.0.7

May 15, 2025

0.0.6

May 12, 2025

0.0.5

May 12, 2025

0.0.4

May 11, 2025

0.0.3

May 11, 2025

0.0.2

May 11, 2025

0.0.1

May 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malwi-0.0.22.tar.gz (92.2 kB view details)

Uploaded Aug 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

malwi-0.0.22-py3-none-any.whl (95.4 kB view details)

Uploaded Aug 15, 2025 Python 3

File details

Details for the file malwi-0.0.22.tar.gz.

File metadata

Download URL: malwi-0.0.22.tar.gz
Upload date: Aug 15, 2025
Size: 92.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.8.11

File hashes

Hashes for malwi-0.0.22.tar.gz
Algorithm	Hash digest
SHA256	`b971ca314d019ebaca1af21064c20ba633a0a5730c832511b61fe362ae74260f`
MD5	`49db7cc756ca40657ced87b2d6a72cc4`
BLAKE2b-256	`5eed1f8c8604aff251a03c429345f89c014b488add13f691d7e4ea5947bf8ff1`

See more details on using hashes here.

File details

Details for the file malwi-0.0.22-py3-none-any.whl.

File metadata

Download URL: malwi-0.0.22-py3-none-any.whl
Upload date: Aug 15, 2025
Size: 95.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.8.11

File hashes

Hashes for malwi-0.0.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b1449da5d5d9f2432a596c6cc70227ed02d2dcf694493842f42a70d80fea739`
MD5	`2b22b62ef756b1fb61866a4ed4e4ef24`
BLAKE2b-256	`7f08c8ef726be5631b854aed372b1e982d97463a29fa4ecea9f018e1b9f0c139`

See more details on using hashes here.

malwi 0.0.22

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

malwi - AI Python Malware Scanner

malwi detects Python malware using AI.

Key Features

1) Install

2) Run

3) Evaluate: a recent zero-day detected with high confidence

PyPI Package Scanning

Why malwi?

How does it work?

1. Compile Python files to bytecode

2. Map bytecode to tokens

3. Feed tokens into pre-trained DistilBert

4. Take final decision

Benchmarks?

DistilBert

Limitations

What's next?

Contributing & Support

Report Issues

Share Malware Samples

Development

Prerequisites

Quick Start

Training Pipeline

Complete Pipeline (With Data Download)

Complete Pipeline (Without Download)

Individual Stages

Training Data Sources

Benign Samples

Malicious Samples

Testing & Quality

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes