Skip to main content

Security-focused linter for machine learning training code

Project description

lintML

The security linter for environments that shouldn't need linting.

Linters (and let's be honest with ourselves, any measures of code quality) have long been reserved for production environments. But we've increasingly seen that the most impactful machine learning attacks happen during training time. Traditional linters often rely on CI/CD pipelines or git commit hooks and are often opinionated on things like code formatting. However, many research projects never touch git until they are far down the path of productionization and researchers write some of the sloppiest code known to humankind (in the name of science). So how can we arm researchers with quick sanity checks for their research code? lintML.

Philosophy

lintML is a simple python script (backed by dockerized security tools) that can give researchers and security teams some quick insight into potential risk in machine learning research projects. It checks for valid, plaintext credentials and uses static analysis to identify risky code patterns.

Things we check for:

(today)

  1. Plaintext credentials.
  2. Unsafe deserialization.
  3. Serialization to unsafe formats.
  4. Using untrustworthy assets.

(WIP)

  1. Training without augmentation.
  2. Evidence of insecure services.

Things we don't check for:

  • Formatting

Many linters measure quality by the breadth of rules, leading to complicated CI/CD configurations where we're ignoring their flashing lights. With a linter for research and machine learning training code, we want to be high signal/low noise. Every rule represents a real exploitable vulnerability that you should seriously consider engineering around to preserve the integrity of your research. lintML shouldn't distract you from getting stuff done. Ideally, most times when you run lintML, you'll have no alerts. :thumbsup:

Compatibility

Currently lintML is focused on .py and .ipynb files (based solely on the author's personal preferences). TruffleHog supported both of these natively, but lintML uses nbconvert under the hood to support Semgrep on .ipynb.

Foundations

The checks in lintML are powered by TruffleHog and Semgrep. Since lintML wraps these tools in their docker containers, the first execution may take longer as those containers are initially pulled.

lintML uses Apache Avro for data serialization to support fast operations and evolving schemas.

Getting Started

  1. pip install -r requirements.txt
  2. python lintML.py <your directory> -- If you don't specify a directory, lintML will default to the current working directory.

When run from the CLI, lintML will return a summary report.

  1. To get a more detailed report, use the --full-report argument (python lintML.py <your directory> --full-report). Results are also persisted in .avro for later analysis and manipulation in your favorite data analysis tools.

Requirements

Requirements are listed in requirements.txt, but the most notable requirement is the ability to build and run docker containers.

Contributing

To immediately contribute security outcomes, consider contributing new rules to TruffleHog and/or Semgrep (and letting us know so we can import them).

Please also report any false positives or negatives to help us fine-tune rules or create new ones.

To add a new security tool to lintML, simply write an async function that returns Observations. PRs welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lintml-0.0.1.tar.gz (487.8 kB view details)

Uploaded Source

Built Distribution

lintml-0.0.1-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file lintml-0.0.1.tar.gz.

File metadata

  • Download URL: lintml-0.0.1.tar.gz
  • Upload date:
  • Size: 487.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for lintml-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ce47a997068180e9d56ea993a585b415d2b98d9a2acad8d6686422709e16ed1c
MD5 92db71d800f6d1986ac3613e7bbe0118
BLAKE2b-256 58281de7c2c7d9de8412b7a0a6061d628cc71bcdd0bbe6f1cc15e525ecb925a3

See more details on using hashes here.

Provenance

File details

Details for the file lintml-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: lintml-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for lintml-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4b6f0398ace183283404e77737795bd9fd10d14119905429df2e3ee42e6d4d3f
MD5 36662f5a82cbae35936febf0c2ab0185
BLAKE2b-256 f17b376df85ae06b8ace53653e61db8ba0a22d12f758017c663025d17b0a8f43

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page