Skip to main content

Scalable Fraud Transaction Identifier using Clustering, Anamoly Detection and Classification ML Algorithms

Project description

The generic objective of this project is to identify clusters in the data and finding out anamolies/outliers in each cluster which gives a mapping to each data point to determine whether it is an anamoly or genuine one. With this information, we can create a classification model through which we can segregate say fraud transactions from genuine ones. This algorithm can be applied to lot of use cases such as:

  • Fradulent Medical Claim detection

  • Fradulent Credit Card Transactions

  • Early detection of insider trading

  • System Security

Technologies used

As the package needs to be scalable and handle Big Data involving Hundreds of Millions of records, I have chosen to use

  • Apache Spark

  • H2o

My Approach

Below is the approach taken and algorithms used to solve the problem at hand:

  1. K-Means Clustering from Apache Spark MLlib to identify clusters

  2. Isolation Forest from H2o to detect the Anamolies

  3. PCA to visualize the data in 3D by reducing the number of dimensions

  4. Gradient Boosted Classification Trees from Spark MLlib to create classification model

  5. Model optimization using Apache Spark MLlib Cross Validator

How to import and use the package?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FraudTransactionDetector-0.1.1.dev0.tar.gz (2.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page