`dslinter` is a pylint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.
Project description
dslinter
dslinter
is a PyLint plugin for linting data science and machine learning code. We plan to support the following Python libraries: TensorFlow, PyTorch, Scikit-Learn, Pandas, NumPy and SciPy.
Installation
To install from the Python Package Index:
pip install dslinter
Usage
To only use the checkers implemented in this plugin, run:
pylint --load-plugins=dslinter --disable=all --enable=import,data-leakage <other_options> <path_to_sources>
To expand a current pylint configuration with the checkers from this plugin, run:
pylint --load-plugins=dslinter <other_options> <path_to_sources>
How to contribute
Contributions are welcome! If you want to contribute, please see the following steps:
- fork the repository and clone the repository you forked.
git clone https://github.com/your-github-account/dslinter.git
git submodule update --init --recursive
dslinter
usespoetry
to manage dependencies, so you will need to installpoetry
first and then install the dependencies.
pip install poerty
poetry install
- To install
dslinter
from source for development purposes, install it with:
poetry build
pip install ./dist/dslinter-version.tar.gz
- Assign yourself to the issue you want to solve. If you identify a new issue that needs to be solved, feel free to open a new issue.
- Make changes to the repository and run the tests. To run the tests using pytest:
poetry run pytest .
- Make a pull request. The pull request is expected to pass the tests. :)
Implemented Checkers:
-
C5501 - C5506 | import | Import Checker: Check whether data science modules are imported using the correct naming conventions.
-
R5501 | unnecessary-iteration-pandas | Unnecessary Iteration Checker(Pandas): Vectorized solutions are preferred over iterators for DataFrames. If iterations are used while there are vectorized APIs can be used, the rule is violated.
-
W5501 | dataframe-iteration-modification-pandas | Unnecessary Iteration Checker(Pandas): A dataframe where is iterated over should not be modified. If the dataframe is modified during iteration, the rule is violated.
-
R5502 | unnecessary-iteration-tensorflow | Unnecessary Iteration Checker(TensorFlow): If there is any augment assignment operation in the loop, the rule is violated. Augment assignment in the loop can be replaced with vectorized solution in TensorFlow APIs.
-
E5501 | nan-numpy | Nan Equality Checker(NumPy): Values cannot be compared with np.nan, as
np.nan != np.nan
. -
W5502 | chain-indexing-pandas | Chain Indexing Checker(Pandas): Chain indexing is considered bad practice in pandas code and should be avoided. If chain indexing is used on a pandas dataframe, the rule is violated.
-
R5503 | datatype-pandas | Datatype Checker(Pandas): Datatype should be set when a dataframe is imported from data to ensure the data formats are imported as expected. If the datatype is not set when importing, the rule is violated.
-
R5504 | column-selection-pandas | Column Selection Checker(Pandas): Column should be selected after the dataframe is imported for better readability. The developer will know what data to be expected in the downstream.
-
R5505 | merge-parameter-pandas | Merge Parameter Checker(Pandas): Parameters 'how', 'on' and 'validate' should be set for merge operations to ensure the correct usage of merging.
-
W5503 | inplace-pandas | InPlace Checker(Pandas): Operations on DataFrames return new DataFrames, and they should be assigned to a variable. Otherwise the result will be lost, and the rule is violated. Operations from the whitelist and with
in_place
parameter set are excluded. -
W5504 | dataframe-conversion-pandas | Dataframe Conversion Checker(Pandas): For dataframe conversion in pandas code, use .to_numpy() instead of .values. If .values is used in pandas code, the rule is violated.
-
W5505 | scaler-missing-scikitlearn | Scaler Missing Checker(ScikitLearn): Check whether the scaler is used before every scaling-sensitive operation in scikit-learn codes. Scaling-sensitive operations includes Principal Component Analysis (PCA), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), Multi-layer Perceptron classifier and L1 and L2 regularization.
-
R5506 | hyperparameters-scikitlearn | Hyperparameter Checker(ScikitLearn): For scikit-learn learning algorithms, some important hyperparameters should be set.
-
R5507 | hyperparameter-tensorflow | Hyperparameter Checker(TensorFlow): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
-
R5508 | hyperparameter-pytorch | Hyperparameter Checker(PyTorch): For neural network learning algorithm, some imporatnt hyperparameters should be set, such as learning rate, batch size, momentum and weight decay.
-
W5506 | memory-release-tensorflow | Memory Release Checker(TensorFlow): If a neural network is created in the loop, and no memory clear operation is used, the rule is violated.
-
W5507 | deterministic-pytorch | Deterministic Algorithm Usage Checker(PyTorch): If use_deterministic algorithm is not used in a pytorch program, the rule is violated.
-
W5508 | randomness-control-numpy | Randomness Control Checker(NumPy): The np.seed() should be used to preserve reproducibility in a machine learning program.
-
W5509 | randomness-control-scikitlearn | Randomness Control Checker(ScikitLearn): For reproducible results across executions, remove any use of random_state=None in scikit-learn estimators.
-
W5510 | randomness-control-tensorflow | Randomness Control Checker(TensorFlow): The tf.random.set_seed() should be used to preserve reproducibility in a Tensorflow program.
-
W5511 | randomness-control-pytorch | Randomness Control Checker(PyTorch): The torch.manual_seed() should be used to preserve reproducibility in a Tensorflow program.
-
W5512 | randomness-control-dataloader-pytorch | Randomness Control Checker(PyTorch-Dataloader): The worker_init_fn() and generator should be set in dataloader to preserve reproducibility. If they're not set, the rule is violated.
-
W5513 | missing-mask-tensorflow | Mask Missing Checker(TensorFlow): If log function is used in the code, check whether the argument value is valid.
-
W5514 | missing-mask-pytorch | Mask Missing Checker(PyTorch): If log function is used in the code, check whether the argument value is valid.
-
W5516 | tensor-array-tensorflow | Tensor Array Checker(Tensorflow): Use tf.TensorArray() for growing array in the loop.
-
W5517 | forward-pytorch | Net Forward Checker(PyTorch): It is recommended to use self.net() rather than self.net.forward() in PyTorch code. If self.net.forward() is used in the code, the rule is violated.
-
W5518 | gradient-clear-pytorch | Gradient Clear Checker(PyTorch): The loss_fn.backward() and optimizer.step() should be used together with optimizer.zero_grad(). If the ".backward()" is missing in the code, the rule is violated.
-
W5519 | data-leakage-scikitlearn | Data Leakage Checker(ScikitLearn): All scikit-learn estimators should be used inside Pipelines, to prevent data leakage between training and test data.
-
W5520 | dependent-threshold-scikitlearn | Dependent Threshold Checker(TensorFlow): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
-
W5521 | dependent-threshold-tensorflow | Dependent Threshold Checker(PyTorch): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
-
W5522 | dependent-threshold-pytorch | Dependent Threshold Checker(ScikitLearn): If threshold-dependent evaluation(e.g., f-score) is used in the code, check whether threshold-indenpendent evaluation(e.g., auc) metrics is also used in the code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.