`dslinter` is a pylint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.

These details have been verified by PyPI

Maintainers

cdegachi D.Wang H.Zhang luiscruz MarkHaakman

These details have not been verified by PyPI

Project links

Project description

dslinter

dslinter is a PyLint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.

Installation

To install from the Python Package Index:

pip install dslinter

Usage

To only use the checkers implemented in this plugin, run:

pylint --load-plugins=dslinter --disable=all --enable=import,data-leakage <other_options> <path_to_sources>

To expand a current pylint configuration with the checkers from this plugin, run:

pylint --load-plugins=dslinter <other_options> <path_to_sources>

How to contribute

Contributions are welcome! If you want to contribute, please see the following steps:

fork the repository and clone the repository you forked.

git clone https://github.com/your-github-account/dslinter.git
git submodule update --init --recursive

dslinter uses poetry to manage dependencies, so you will need to install poetry first and then install the dependencies.

pip install poerty
poetry install

To install dslinter from source for development purposes, install it with:

poetry build
pip install ./dist/dslinter-version.tar.gz

Assign yourself to the issue you want to solve. If you identify a new issue that needs to be solved, feel free to open a new issue.
Make changes to the repository and run the tests. To run the tests using pytest:

poetry run pytest .

Make a pull request. The pull request is expected to pass the tests. :)

Implemented Checkers:

C5501 - C5506 | import | Import Checker: Check whether data science modules are imported using the correct naming conventions.
R5501 | unnecessary-iteration-pandas | Unnecessary Iteration Checker(Pandas): Vectorized solutions are preferred over iterators for DataFrames. If iterations are used while there are vectorized APIs can be used, the rule is violated.
W5501 | dataframe-iteration-modification-pandas | Unnecessary Iteration Checker(Pandas): A dataframe where is iterated over should not be modified. If the dataframe is modified during iteration, the rule is violated.
R5502 | unnecessary-iteration-tensorflow | Unnecessary Iteration Checker(TensorFlow): If there is any augment assignment operation in the loop, the rule is violated. Augment assignment in the loop can be replaced with vectorized solution in TensorFlow APIs.
E5501 | nan-numpy | Nan Equality Checker(NumPy): Values cannot be compared with np.nan, as np.nan != np.nan.
W5502 | chain-indexing-pandas | Chain Indexing Checker(Pandas): Chain indexing is considered bad practice in pandas code and should be avoided. If chain indexing is used on a pandas dataframe, the rule is violated.
R5503 | datatype-pandas | Datatype Checker(Pandas): Datatype should be set when a dataframe is imported from data to ensure the data formats are imported as expected. If the datatype is not set when importing, the rule is violated.
R5504 | column-selection-pandas | Column Selection Checker(Pandas): Column should be selected after the dataframe is imported for better readability. The developer will know what data to be expected in the downstream.
R5505 | merge-parameter-pandas | Merge Parameter Checker(Pandas): Parameters 'how', 'on' and 'validate' should be set for merge operations to ensure the correct usage of merging.
W5503 | inplace-pandas | InPlace Checker(Pandas): Operations on DataFrames return new DataFrames, and they should be assigned to a variable. Otherwise the result will be lost, and the rule is violated. Operations from the whitelist and with in_place parameter set are excluded.
W5504 | dataframe-conversion-pandas | Dataframe Conversion Checker(Pandas): For dataframe conversion in pandas code, use .to_numpy() instead of .values. If .values is used in pandas code, the rule is violated.
W5505 | scaler-missing-scikitlearn | Scaler Missing Checker(ScikitLearn): Check whether the scaler is used before every scaling-sensitive operation in scikit-learn codes. Scaling-sensitive operations includes Principal Component Analysis (PCA), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), Multi-layer Perceptron classifier and L1 and L2 regularization.
R5506 | hyperparameters-scikitlearn | Hyperparameter Checker(ScikitLearn): For scikit-learn learning algorithms, some important hyperparameters should be set.
R5507 | hyperparameter-tensorflow | Hyperparameter Checker(TensorFlow): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
R5508 | hyperparameter-pytorch | Hyperparameter Checker(PyTorch): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
W5506 | memory-release-tensorflow | Memory Release Checker(TensorFlow): If a neural network is created in the loop, and no memory clear operation is used, the rule is violated.
W5507 | deterministic-pytorch | Deterministic Algorithm Usage Checker(PyTorch): If use_deterministic algorithm is not used in a pytorch program, the rule is violated.
W5508 | randomness-control-numpy | Randomness Control Checker(NumPy): The np.seed() should be used to preserve reproducibility in a machine learning program.
W5509 | randomness-control-scikitlearn | Randomness Control Checker(ScikitLearn): For reproducible results across executions, remove any use of random_state=None in scikit-learn estimators.
W5510 | randomness-control-tensorflow | Randomness Control Checker(TensorFlow): The tf.random.set_seed() should be used to preserve reproducibility in a Tensorflow program.
W5511 | randomness-control-pytorch | Randomness Control Checker(PyTorch): The torch.manual_seed() should be used to preserve reproducibility in a Tensorflow program.
W5512 | randomness-control-dataloader-pytorch | Randomness Control Checker(PyTorch-Dataloader): The worker_init_fn() and generator should be set in dataloader to preserve reproducibility. If they're not set, the rule is violated.
W5513 | missing-mask-tensorflow | Mask Missing Checker(TensorFlow): If log function is used in the code, check whether the argument value is valid.
W5514 | missing-mask-pytorch | Mask Missing Checker(PyTorch): If log function is used in the code, check whether the argument value is valid.
W5516 | tensor-array-tensorflow | Tensor Array Checker(Tensorflow): Use tf.TensorArray() for growing array in the loop.
W5517 | forward-pytorch | Net Forward Checker(PyTorch): It is recommended to use self.net() rather than self.net.forward() in PyTorch code. If self.net.forward() is used in the code, the rule is violated.
W5518 | gradient-clear-pytorch | Gradient Clear Checker(PyTorch): The loss_fn.backward() and optimizer.step() should be used together with optimizer.zero_grad(). If the ".backward()" is missing in the code, the rule is violated.
W5519 | data-leakage-scikitlearn | Data Leakage Checker(ScikitLearn): All scikit-learn estimators should be used inside Pipelines, to prevent data leakage between training and test data.
W5520 | dependent-threshold-scikitlearn | Dependent Threshold Checker(TensorFlow): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
W5521 | dependent-threshold-tensorflow | Dependent Threshold Checker(PyTorch): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
W5522 | dependent-threshold-pytorch | Dependent Threshold Checker(ScikitLearn): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.

Project details

These details have been verified by PyPI

Maintainers

cdegachi D.Wang H.Zhang luiscruz MarkHaakman

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.9

Jun 17, 2022

2.0.8

May 27, 2022

2.0.7

May 23, 2022

2.0.6

May 15, 2022

2.0.5

May 9, 2022

2.0.4

May 7, 2022

2.0.3

May 4, 2022

2.0.2

May 3, 2022

2.0.1

May 2, 2022

This version

2.0.0

May 1, 2022

1.1.0

Jun 18, 2021

1.0.2

Jun 7, 2021

1.0.1

Jun 3, 2021

1.0.0

Jun 23, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dslinter-2.0.0.tar.gz (65.5 kB view hashes)

Uploaded May 1, 2022 Source

Built Distribution

dslinter-2.0.0-py3-none-any.whl (102.8 kB view hashes)

Uploaded May 1, 2022 Python 3

Hashes for dslinter-2.0.0.tar.gz

Hashes for dslinter-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`44de9b125b70f407cc615b82a701c54dd08a846ba118fd5c698d4f12492a8973`
MD5	`b0adcf0f041f5a50b7e47159bb5a6140`
BLAKE2b-256	`df1b634f0df4782adbcdabc1edff9ab453563d19e0e37a502a19830efacd35bf`

Hashes for dslinter-2.0.0-py3-none-any.whl

Hashes for dslinter-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83b8bd2912312cf01f5f098b26864e819ed24b315d8005eebdcf29f50baf67fc`
MD5	`28043a673dcc98a2ad837c983972958a`
BLAKE2b-256	`03218ec1257e07f061909656dd2452a2d7849bed0839c818b87a23261549c4ed`