Skip to main content

malwi - AI Python Malware Scanner

Project description

malwi - AI Python Malware Scanner

Logo

Detect Python malware fast - no internet, no expensive hardware, no fees.

malwi is specialized in detecting zero-day vulnerabilities, for classifying code as safe or harmful.

Open-source software made in Europe. Based on open research, open code, open data. 🇪🇺🤘🕊️

  1. Install
pip install --user malwi
  1. Run
malwi examples/malicious
  1. Evaluate: a recent zero-day detected with high confidence
  .--------.---.-|  .--.--.--|__|
  |        |  _  |  |  |  |  |  |
  |__|__|__|___._|__|________|__|
     AI Python Malware Scanner


- target: examples/malicious
- files: 13
  ├── scanned: 3
  ├── skipped: 10
  └── suspicious:
      └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
          └── <module>
              ├── deserialization
              ├── user io
              ├── system interaction
              └── process management

=> 👹 malicious 1.00

Why malwi?

The number of malicious open-source packages is growing. This is not just a threat to your business but also to the open-source community.

Typical malware behaviors include:

  • Exfiltration of data: Stealing credentials, API keys, or sensitive user data.
  • Backdoors: Allowing remote attackers to gain unauthorized access to your system.
  • Destructive actions: Deleting files, corrupting databases, or sabotaging applications.

How does it work?

malwi applies DistilBert based on the design of Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application (2025). The malwi-samples dataset is used for training.

1. Compile Python files to bytecode

def runcommand(value):
    output = subprocess.run(value, shell=True, capture_output=True)
    return [output.stdout, output.stderr]
  0           RESUME                   0

  1           LOAD_CONST               0 (<code object runcommand at 0x5b4f60ae7540, file "example.py", line 1>)
              MAKE_FUNCTION
              STORE_NAME               0 (runcommand)
              RETURN_CONST             1 (None)
  ...

2. Map bytecode to tokens

TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value

3. Feed tokens into pre-trained DistilBert

=> Maliciousness: 0.92

This creates a list with malicious code objects. However malicious code might be split into chunks and spread across a package. This is why the next layers are needed.

4. Take final decision

The DistilBERT model makes the final maliciousness decision based on the token patterns.

=> Maliciousness: 0.92

Benchmarks?

DistilBert

Metric Value
F1 Score 0.941
Recall 0.900
Precision 0.987
Training time ~5 hours
Hardware NVIDIA RTX 4090
Epochs 3

Limitations

The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.

What's next?

The first iteration focuses on maliciousness of Python source code.

Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).

Support

Do you have access to malicious Rust, Go, whatever packages? Contact me.

Develop

Prerequisites:

# Download and process data
cmds/download_and_preprocess_distilbert.sh

# Complete pipelines
cmds/preprocess_and_train_distilbert.sh  # Data → Tokenizer → DistilBERT

# Individual data preprocessing
cmds/preprocess_data.sh                  # Process data for ML training

# Individual model training
cmds/train_tokenizer.sh                  # Train custom tokenizer
cmds/train_distilbert.sh                 # Train DistilBERT model

Triage

malwi uses a pipeline that can be enhanced by triaging its results (see src/research/triage.py). For automated triaging, you can leverage open-source models in combination with Ollama.

Start LLM

ollama run gemma3

Start Triaging

uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malwi-0.0.19.tar.gz (100.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malwi-0.0.19-py3-none-any.whl (80.5 kB view details)

Uploaded Python 3

File details

Details for the file malwi-0.0.19.tar.gz.

File metadata

  • Download URL: malwi-0.0.19.tar.gz
  • Upload date:
  • Size: 100.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.9

File hashes

Hashes for malwi-0.0.19.tar.gz
Algorithm Hash digest
SHA256 9e9f386166c52618666cd96f9033eaa9e298109d254065a9bd41163ce908e52a
MD5 61b8f0aafa8c71a185b939fc6d719617
BLAKE2b-256 1658fac8df7af03d832f392e0025dab3a5cde0e48295d1c222027fc7c8772c06

See more details on using hashes here.

File details

Details for the file malwi-0.0.19-py3-none-any.whl.

File metadata

  • Download URL: malwi-0.0.19-py3-none-any.whl
  • Upload date:
  • Size: 80.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.9

File hashes

Hashes for malwi-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 53c3f8433b68a121372fde18f604159e2823a32dd12cf0754ef643dca0488685
MD5 63d9d0b076a581f587c63a8551f648bd
BLAKE2b-256 0e8b1191ed288dae186f3f2580ccd533f30d115794ca17adbbf6b6b6a48dd337

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page