Skip to main content

Antivirus for the AI Supply Chain. Scans models, datasets, notebooks, and RAG documents for threats.

Project description

๐Ÿ›ก๏ธ Veritensor: AI Data & Artifact Security

Hugging Face Spaces PyPI version Docker Image License CI Security Security: Veritensor

Veritensor is an end-to-end antivirus for the entire AI Life Cycle. It secures the entire AI Supply Chain by scanning artifacts that traditional tools miss: Models, Datasets, RAG Documents, and Notebooks.

Unlike standard SAST tools (which focus on code), Veritensor understands the binary and serialized formats used in Machine Learning:

  1. Models: Deep AST analysis of Pickle, PyTorch, Keras, Safetensors to block RCE and backdoors.
  2. Data & RAG: Streaming scan of Parquet, CSV, Excel, PDF to detect Data Poisoning, Prompt Injections, and PII.
  3. Notebooks: Hardening of Jupyter (.ipynb) files by detecting leaked secrets (using Entropy analysis), malicious magics, and XSS.
  4. Supply Chain: Audits dependencies (requirements.txt, poetry.lock) for Typosquatting and known CVEs (via OSV.dev).
  5. Governance: Generates cryptographic Data Manifests (Provenance) and signs containers via Sigstore.

๐Ÿš€ Features

  • Parallel Scanning: Utilizes all CPU cores to scan thousands of files in seconds. Includes robust SQLite Caching to skip unchanged files.
  • Stealth Detection: Finds attacks hidden from humans but visible to LLMs. Detects CSS Hiding (white text, zero font), Base64 Obfuscation, and Unicode Spoofing.
  • Dataset Security: Streams massive datasets (100GB+) to find "Poisoning" patterns (e.g., "Ignore previous instructions") and malicious URLs in Parquet, CSV, JSONL, and Excel.
  • Archive Inspection: Safely scans inside .zip, .tar.gz, .whl files without extracting them to disk (Zip Bomb protected).
  • Dependency Audit: Checks pyproject.toml, poetry.lock, and Pipfile.lock for malicious packages (Typosquatting) and vulnerabilities.
  • Data Provenance: Command veritensor manifest . creates a signed JSON snapshot of your data artifacts for compliance (EU AI Act).
  • Identity Verification: Automatically verifies model hashes against the official Hugging Face registry to detect Man-in-the-Middle attacks.

๐Ÿ“ฆ Installation

Veritensor is modular. Install only what you need to keep your environment lightweight (~50MB core).

Option Command Use Case
Core pip install veritensor Base scanner (Models, Notebooks, Dependencies)
Data pip install "veritensor[data]" Datasets (Parquet, Excel, CSV)
RAG pip install "veritensor[rag]" Documents (PDF, DOCX, PPTX)
PII pip install "veritensor[pii]" ML-based PII detection (Presidio)
AWS pip install "veritensor[aws]" Direct scanning from S3 buckets
All pip install "veritensor[all]" Full suite for enterprise security

Via Docker (Recommended for CI/CD)

docker pull arseniibrazhnyk/veritensor:latest

โšก Quick Start

1. Scan a local project (Parallel)

Recursively scan a directory for all supported threats using 4 CPU cores:

veritensor scan ./my-rag-project --recursive --jobs 4

2. Scan RAG Documents & Excel

Check for Prompt Injections and Formula Injections in business data:

veritensor scan ./finance_data.xlsx
veritensor scan ./docs/contract.pdf

3. Generate Data Manifest

Create a compliance snapshot of your dataset folder:

veritensor manifest ./data --output provenance.json

4. Verify Model Integrity

Ensure the file on your disk matches the official version from Hugging Face (detects tampering):

veritensor scan ./pytorch_model.bin --repo meta-llama/Llama-2-7b

5. Scan from Amazon S3

Scan remote assets without manual downloading:

veritensor scan s3://my-ml-bucket/models/llama-3.pkl

6. Verify against Hugging Face

Ensure the file on your disk matches the official version from the registry (detects tampering):

veritensor scan ./pytorch_model.bin --repo meta-llama/Llama-2-7b

7. License Compliance Check

Veritensor automatically reads metadata from safetensors and GGUF files. If a model has a Non-Commercial license (e.g., cc-by-nc-4.0), it will raise a HIGH severity alert.

To override this (Break-glass mode), use:

veritensor scan ./model.safetensors --force

8. Scan AI Datasets

Veritensor uses streaming to handle huge files. It samples 10k rows by default for speed.

veritensor scan ./data/train.parquet --full-scan

9. Scan Jupyter Notebooks

Check code cells, markdown, and saved outputs for threats:

veritensor scan ./research/experiment.ipynb

Example Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ๐Ÿ›ก๏ธ  Veritensor Security Scanner โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
                                    Scan Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ File         โ”ƒ Status โ”ƒ Threats / Details                    โ”ƒ SHA256 (Short) โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ model.pt     โ”‚  FAIL  โ”‚ CRITICAL: os.system (RCE Detected)   โ”‚ a1b2c3d4...    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โŒ BLOCKING DEPLOYMENT

๐Ÿ“Š Reporting & Compliance

Veritensor supports industry-standard formats for integration with security dashboards and audit tools.

1. GitHub Security (SARIF)

Generate a report compatible with GitHub Code Scanning:

veritensor scan ./models --sarif > veritensor-report.sarif

2. Software Bill of Materials (SBOM)

Generate a CycloneDX v1.5 SBOM to inventory your AI assets:

veritensor scan ./models --sbom > sbom.json

3. Raw JSON

For custom parsers and SOAR automation:

veritensor scan ./models --json

๐Ÿ” Supply Chain Security (Container Signing)

Veritensor integrates with Sigstore Cosign to cryptographically sign your Docker images only if they pass the security scan.

1. Generate Keys

Generate a key pair for signing:

veritensor keygen
# Output: veritensor.key (Private) and veritensor.pub (Public)

2. Scan & Sign

Pass the --image flag and the path to your private key (via env var).

# Set path to your private key
export VERITENSOR_PRIVATE_KEY_PATH=veritensor.key

# If scan passes -> Sign the image
veritensor scan ./models/my_model.pkl --image my-org/my-app:v1.0.0

3. Verify (In Kubernetes / Production)

Before deploying, verify the signature to ensure the model was scanned:

cosign verify --key veritensor.pub my-org/my-app:v1.0.0

๐Ÿ› ๏ธ Integrations

GitHub Actions

Add this to your .github/workflows/security.yml to block malicious models in Pull Requests:

name: AI Security Scan
on: [pull_request]
jobs:
  veritensor-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Veritensor Scan
        uses: ArseniiBrazhnyk/Veritensor@v1.5.1
        with:
          path: '.'
          args: '--jobs 4'

Pre-commit Hook

Prevent committing malicious models to your repository. Add this to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/arsbr/Veritensor
    rev: v1.5.1
    hooks:
      - id: veritensor-scan

๐Ÿ“‚ Supported Formats

Format Extension Analysis Method
Models .pt, .pth, .bin, .pkl, .joblib, .h5, .keras, .safetensors, .gguf, .whl AST Analysis, Pickle VM Emulation, Metadata Validation
Datasets .parquet, .csv, .tsv, .jsonl, .ndjson, .ldjson Streaming Regex Scan (URLs, Injections, PII)
Notebooks .ipynb JSON Structure Analysis + Code AST + Markdown Phishing
Documents .pdf, .docx, .pptx, .txt, .md, .html DOM Extraction, Stealth/CSS Detection, PII
Archives .zip, .tar, .gz, .tgz, .whl Recursive In-Memory Inspection
RAG Docs requirements.txt, poetry.lock, Pipfile.lock Typosquatting, OSV.dev CVE Lookup

โš™๏ธ Configuration

You can customize security policies by creating a veritensor.yaml file in your project root. Pro Tip: You can use regex: prefix for flexible matching.

# veritensor.yaml

# 1. Security Threshold
# Fail the build if threats of this severity (or higher) are found.
# Options: CRITICAL, HIGH, MEDIUM, LOW.
fail_on_severity: CRITICAL

# 2. Dataset Scanning
# Sampling limit for quick scans (default: 10000)
dataset_sampling_limit: 10000

# 3. License Firewall Policy
# If true, blocks models that have no license metadata.
fail_on_missing_license: false

# List of license keywords to block (case-insensitive).
custom_restricted_licenses:
  - "cc-by-nc"       # Non-Commercial
  - "agpl"           # Viral licenses
  - "research-only"

# 4. Static Analysis Exceptions (Pickle)
# Allow specific Python modules that are usually blocked by the strict scanner.
allowed_modules:
  - "my_company.internal_layer"
  - "sklearn.tree"

# 5. Model Whitelist (License Bypass)
# List of Repo IDs that are trusted. Veritensor will SKIP license checks for these.
# Supports Regex!
allowed_models:
  - "meta-llama/Meta-Llama-3-70B-Instruct"  # Exact match
  - "regex:^google-bert/.*"                 # Allow all BERT models from Google
  - "internal/my-private-model"

To generate a default configuration file, run: veritensor init


๐Ÿง  Threat Intelligence (Signatures)

Veritensor uses a decoupled signature database (signatures.yaml) to detect malicious patterns. This ensures that detection logic is separated from the core engine.

  • Automatic Updates: To get the latest threat definitions, simply upgrade the package:
    pip install --upgrade veritensor
    
  • Transparent Rules: You can inspect the default signatures in src/veritensor/engines/static/signatures.yaml.
  • Custom Policies: If the default rules are too strict for your use case (false positives), use veritensor.yaml to whitelist specific modules or models.

๐Ÿ“œ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veritensor-1.5.1.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

veritensor-1.5.1-py3-none-any.whl (71.7 kB view details)

Uploaded Python 3

File details

Details for the file veritensor-1.5.1.tar.gz.

File metadata

  • Download URL: veritensor-1.5.1.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for veritensor-1.5.1.tar.gz
Algorithm Hash digest
SHA256 bf53cf4a44def4c43c7d92f2ebb188b846707b8b9a4a17e5f9c858ddbe21253a
MD5 9dd12b384cdabd394dd3d019bc71d05c
BLAKE2b-256 78ba82ae6737810c3cb427dbaede2a1bffe55f77340f3124a65984e55912ce81

See more details on using hashes here.

File details

Details for the file veritensor-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: veritensor-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 71.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for veritensor-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d437cbd2fd4ea9a13df582d3ac2413c29d58bd6cb8ca07a31a07de5bc1409612
MD5 d5c7d6c3b528a1481a2e2d3bc10ad9b4
BLAKE2b-256 dc6bc6425c13a5435f54e9a93fa01fa079b36169729de3d7537ee8d62b4a2d74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page