Skip to main content

An implementation of the MixSAD algorithm for anomaly detection in mixed-feature data.

Project description

MixSAD: High-Performance Fraud Detection This project implements a high-performance, supervised learning pipeline for fraud detection. Originally based on the unsupervised MixSAD algorithm, the model has been significantly enhanced to use a direct supervised approach, enabling it to achieve high accuracy and recall on complex fraud detection tasks.

The current implementation is optimized to run on the Kaggle Credit Card Fraud Detection dataset.

Core Approach: Supervised Prediction The key to the model's high performance is its shift from unsupervised anomaly detection to a direct supervised classification strategy.

Supervised Feature Engineering: The pipeline trains a LogisticRegression model on the labeled data. This model's primary purpose is to generate a powerful, predictive feature: a fraud_score for each transaction, which represents the probability of that transaction being fraudulent.

Threshold-Based Prediction: Instead of using a complex secondary model, predictions are made by applying a simple probability threshold to the fraud_score. Any transaction with a score greater than or equal to the threshold is classified as fraud.

This direct approach is highly effective and transparent, allowing for precise control over the model's sensitivity to fraud.

Project Structure mixsad/: The main package source code, including the pipeline, preprocessor, feature_engineer, and prediction_builder.

examples/: Contains the run_on_kaggle_data.py script demonstrating how to use the package.

pyproject.toml: The package configuration file.

README.md: This file.

Setup and Installation Local Setup

Clone the repository and navigate into it.

Create a virtual environment: python -m venv venv and activate it.

Install requirements: pip install -r requirements.txt

Install the package in editable mode: pip install -e .

Usage Download the Dataset:

Download the "Credit Card Fraud Detection Dataset" from Kaggle.

Rename the file to credit_card_fraud.csv and place it in the project's root directory.

Run the Example: Execute the example script to see the model in action:

python examples/run_on_kaggle_data.py

Fine-Tuning for High Performance 🎯 For fraud detection, missing a real case of fraud (low recall) is usually much worse than flagging a legitimate transaction for review (low precision). The primary way to fine-tune this model is by adjusting the probability threshold.

Adjusting the Prediction Threshold

The run method of the pipeline accepts a threshold parameter.

A higher threshold (e.g., 0.7) makes the model more conservative. It will only flag transactions it is very confident are fraudulent. This leads to high precision but lower recall.

A lower threshold (e.g., 0.3) makes the model more sensitive. It will flag transactions that have even a small chance of being fraudulent. This leads to high recall but lower precision.

The examples/run_on_kaggle_data.py script demonstrates this principle by running the pipeline with two different thresholds to show how it directly impacts the precision-recall trade-off.

The example script shows how to adjust the threshold

to meet the goal of >90% recall for fraud.

pipeline.run(df_features, true_labels, threshold=0.30)

By adjusting this single parameter, you can configure the model to meet the specific business requirements of your fraud detection system.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mixsad_anomaly_detection-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mixsad_anomaly_detection-0.1.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file mixsad_anomaly_detection-0.1.0.tar.gz.

File metadata

File hashes

Hashes for mixsad_anomaly_detection-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c68e176ea786f5daf4db839996f5923f872b0bdc27f9c9c274d32d0a1e7a376e
MD5 c5b082895387638e813a5e86c9c12de4
BLAKE2b-256 b18deedd3db5b9eec373931f7bd62f16c090084043b30939508992253a7caca8

See more details on using hashes here.

File details

Details for the file mixsad_anomaly_detection-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mixsad_anomaly_detection-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6fc6f00177219fc000a40358905ed08df199d70e3083e14a639aa02085720c87
MD5 5d282f77016efacbd7117815b51a1ad5
BLAKE2b-256 4ebe2e80c9dc8c5abded627d0a2fde08bca1ca9c8d4b386877ef57ad46ff7ecc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page