Skip to main content

`dslinter` is a pylint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.

Project description

dslinter

build codecov.io PyPI version License: GPL v3 PyPI - Downloads - Monthly Code Grade

Hi! We’re currently researching the code smells in machine learning projects in the industry context and looking for feedback for dslinter! It would be a massive help if you could run dslinter on your machine learning project in an industry setting and send the generated txt file to dslinter2022@gmail.com . The steps and commands can be found here and it should take no more than 10 minutes. Feel free to send me an email if you want to go through the process together. The process is anonymous and we will remove any sensitive information before the results are published. Many thanks!

dslinter is a PyLint plugin for linting data science and machine learning code. It aims to help developers ensure the machine learning code quality and supports the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.

dslinter implements the detection rules for smells identified by our previous work. The smells are collected from papers, grey literature, GitHub commits, and Stack Overflow posts. The smells are also elaborated at a website :)

https://user-images.githubusercontent.com/26082974/166685300-71551c45-6d45-4646-bc34-084239b168bb.mov

The example project in the demo video can be found here.

Installation

To install from the Python Package Index:

pip install dslinter

Usage

To only use the checkers implemented in this plugin, run:

[For Linux/Mac OS Users]:

pylint \
--load-plugins=dslinter \
--disable=all \
--enable=import,unnecessary-iteration-pandas,unnecessary-iteration-tensorflow,\
nan-numpy,chain-indexing-pandas,datatype-pandas,\
column-selection-pandas,merge-parameter-pandas,inplace-pandas,\
dataframe-conversion-pandas,scaler-missing-scikitlearn,hyperparameters-scikitlearn,\
hyperparameters-tensorflow,hyperparameters-pytorch,memory-release-tensorflow,\
deterministic-pytorch,randomness-control-numpy,randomness-control-scikitlearn,\
randomness-control-tensorflow,randomness-control-pytorch,randomness-control-dataloader-pytorch,\
missing-mask-tensorflow,missing-mask-pytorch,tensor-array-tensorflow,\
forward-pytorch,gradient-clear-pytorch,pipeline-not-used-scikitlearn,\
dependent-threshold-scikitlearn,dependent-threshold-tensorflow,dependent-threshold-pytorch \
--output-format=text:report.txt,colorized \
--reports=y \
<path_to_sources>

[For Windows Users]:

pylint --load-plugins=dslinter --disable=all --enable=import,unnecessary-iteration-pandas,unnecessary-iteration-tensorflow,nan-numpy,chain-indexing-pandas,datatype-pandas,column-selection-pandas,merge-parameter-pandas,inplace-pandas,dataframe-conversion-pandas,scaler-missing-scikitlearn,hyperparameters-scikitlearn,hyperparameters-tensorflow,hyperparameters-pytorch,memory-release-tensorflow,deterministic-pytorch,randomness-control-numpy,randomness-control-scikitlearn,randomness-control-tensorflow,randomness-control-pytorch,randomness-control-dataloader-pytorch,missing-mask-tensorflow,missing-mask-pytorch,tensor-array-tensorflow,forward-pytorch,gradient-clear-pytorch,pipeline-not-used-scikitlearn,dependent-threshold-scikitlearn,dependent-threshold-tensorflow,dependent-threshold-pytorch --output-format=text:report.txt,colorized --reports=y <path_to_sources>

Or place a .pylintrc configuration file which contains above settings in the folder where you run your command on, and run:

pylint <path_to_sources>

To expand a current pylint configuration with the checkers from this plugin, run:

pylint --load-plugins=dslinter <other_options> <path_to_sources>

How to contribute

Contributions are welcome! If you want to contribute, please see the following steps:

  1. fork the repository and clone the repository you forked.
git clone https://github.com/your-github-account/dslinter.git
git submodule update --init --recursive
  1. dslinter uses poetry to manage dependencies, so you will need to install poetry first and then install the dependencies.
pip install poetry
poetry install
  • To install dslinter from source for development purposes, install it with:
poetry build
pip install ./dist/dslinter-version.tar.gz
  1. Assign yourself to the issue you want to solve. If you identify a new issue that needs to be solved, feel free to open a new issue.
  2. Make changes to the repository and run the tests. To run the tests using pytest:
poetry run pytest .
  1. Make a pull request. The pull request is expected to pass the tests. :)

Implemented Checkers:

  • C5501 - C5506 | import | Import Checker: Check whether data science modules are imported using the correct naming conventions.

  • R5501 | unnecessary-iteration-pandas | Unnecessary Iteration Checker(Pandas): Vectorized solutions are preferred over iterators for DataFrames. If iterations are used while there are vectorized APIs can be used, the rule is violated.

  • W5501 | dataframe-iteration-modification-pandas | Unnecessary Iteration Checker(Pandas): A dataframe where is iterated over should not be modified. If the dataframe is modified during iteration, the rule is violated.

  • R5502 | unnecessary-iteration-tensorflow | Unnecessary Iteration Checker(TensorFlow): If there is any augment assignment operation in the loop, the rule is violated. Augment assignment in the loop can be replaced with vectorized solution in TensorFlow APIs.

  • E5501 | nan-numpy | Nan Equality Checker(NumPy): Values cannot be compared with np.nan, as np.nan != np.nan.

  • W5502 | chain-indexing-pandas | Chain Indexing Checker(Pandas): Chain indexing is considered bad practice in pandas code and should be avoided. If chain indexing is used on a pandas dataframe, the rule is violated.

  • R5503 | datatype-pandas | Datatype Checker(Pandas): Datatype should be set when a dataframe is imported from data to ensure the data formats are imported as expected. If the datatype is not set when importing, the rule is violated.

  • R5504 | column-selection-pandas | Column Selection Checker(Pandas): Column should be selected after the dataframe is imported for better elaborating what to be expected in the downstream.

  • R5505 | merge-parameter-pandas | Merge Parameter Checker(Pandas): Parameters 'how', 'on' and 'validate' should be set for merge operations to ensure the correct usage of merging.

  • W5503 | inplace-pandas | InPlace Checker(Pandas): Operations on DataFrames return new DataFrames, and they should be assigned to a variable. Otherwise the result will be lost, and the rule is violated. Operations from the whitelist and with in_place parameter set are excluded.

  • W5504 | dataframe-conversion-pandas | Dataframe Conversion Checker(Pandas): For dataframe conversion in pandas code, use .to_numpy() instead of .values. If .values is used in pandas code, the rule is violated.

  • W5505 | scaler-missing-scikitlearn | Scaler Missing Checker(ScikitLearn): Check whether the scaler is used before every scaling-sensitive operation in scikit-learn codes. Scaling-sensitive operations includes Principal Component Analysis (PCA), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), Multi-layer Perceptron classifier and L1 and L2 regularization.

  • R5506 | hyperparameters-scikitlearn | Hyperparameter Checker(ScikitLearn): For scikit-learn learning algorithms, some important hyperparameters should be set.

  • R5507 | hyperparameter-tensorflow | Hyperparameter Checker(TensorFlow): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.

  • R5508 | hyperparameter-pytorch | Hyperparameter Checker(PyTorch): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.

  • W5506 | memory-release-tensorflow | Memory Release Checker(TensorFlow): If a neural network is created in the loop, and no memory clear operation is used, the rule is violated.

  • W5507 | deterministic-pytorch | Deterministic Algorithm Usage Checker(PyTorch): If use_deterministic algorithm is not used in a pytorch program, the rule is violated.

  • W5508 | randomness-control-numpy | Randomness Control Checker(NumPy): The np.seed() should be used to preserve reproducibility in a machine learning program.

  • W5509 | randomness-control-scikitlearn | Randomness Control Checker(ScikitLearn): For reproducible results across executions, remove any use of random_state=None in scikit-learn estimators.

  • W5510 | randomness-control-tensorflow | Randomness Control Checker(TensorFlow): The tf.random.set_seed() should be used to preserve reproducibility in a Tensorflow program.

  • W5511 | randomness-control-pytorch | Randomness Control Checker(PyTorch): The torch.manual_seed() should be used to preserve reproducibility in a Tensorflow program.

  • W5512 | randomness-control-dataloader-pytorch | Randomness Control Checker(PyTorch-Dataloader): The worker_init_fn() and generator should be set in dataloader to preserve reproducibility. If they're not set, the rule is violated.

  • W5513 | missing-mask-tensorflow | Mask Missing Checker(TensorFlow): If log function is used in the code, check whether the argument value is valid.

  • W5514 | missing-mask-pytorch | Mask Missing Checker(PyTorch): If log function is used in the code, check whether the argument value is valid.

  • W5515 | tensor-array-tensorflow | Tensor Array Checker(Tensorflow): Use tf.TensorArray() for growing array in the loop.

  • W5516 | forward-pytorch | Net Forward Checker(PyTorch): It is recommended to use self.net() rather than self.net.forward() in PyTorch code. If self.net.forward() is used in the code, the rule is violated.

  • W5517 | gradient-clear-pytorch | Gradient Clear Checker(PyTorch): The loss_fn.backward() and optimizer.step() should be used together with optimizer.zero_grad(). If the .zero_grad() is missing in the code, the rule is violated.

  • W5518 | pipeline-not-used-scikitlearn | Pipeline Checker(ScikitLearn): All scikit-learn estimators should be used inside Pipelines, to prevent data leakage between training and test data.

  • W5519 | dependent-threshold-scikitlearn | Dependent Threshold Checker(TensorFlow): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.

  • W5520 | dependent-threshold-tensorflow | Dependent Threshold Checker(PyTorch): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.

  • W5521 | dependent-threshold-pytorch | Dependent Threshold Checker(ScikitLearn): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.

Team

The dslinter is developed by Mark Haakman and Haiyin Zhang during our master theses at the Software Engineering Research Group (SERG) at TU Delft and ING's AI for FinTech Research Lab, supervised by Luís Cruz and Arie van Deursen.

Maintainer: Haiyin Zhang [hynn0633@gmail.com].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dslinter-2.0.9.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

dslinter-2.0.9-py3-none-any.whl (106.0 kB view details)

Uploaded Python 3

File details

Details for the file dslinter-2.0.9.tar.gz.

File metadata

  • Download URL: dslinter-2.0.9.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.8 Darwin/20.6.0

File hashes

Hashes for dslinter-2.0.9.tar.gz
Algorithm Hash digest
SHA256 e36f3811988e05a3a8e6f8c81e8d650b27ffd12d3a6b4cf708ad9cc3e750f293
MD5 59a8dbf0c26be47e5a22cd4157621119
BLAKE2b-256 2a62df9153ae57c514b8969d774bf70e31a638b944cae4493a49f086cc585cd1

See more details on using hashes here.

File details

Details for the file dslinter-2.0.9-py3-none-any.whl.

File metadata

  • Download URL: dslinter-2.0.9-py3-none-any.whl
  • Upload date:
  • Size: 106.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.8 Darwin/20.6.0

File hashes

Hashes for dslinter-2.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 9ddf49ce4fa9498201678902acc2225c5176359742b198bad01d357c68f58fd0
MD5 bda307aa4e52a42d649da7f4fa3ac4e5
BLAKE2b-256 60d1bebd1be074c0546b9e290b62f7dbe46673dde9242aa104effd8759f08d20

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page