Scalable Fraud Transaction Identifier using Clustering, Anamoly Detection and Classification ML Algorithms
Project description
The generic objective of this project is to identify clusters in the data and finding out anamolies/outliers in each cluster which gives a mapping to each data point to determine whether it is an anamoly or genuine one. With this information, we can create a classification model through which we can segregate say fraud transactions from genuine ones. This algorithm can be applied to lot of use cases such as:
Fradulent Medical Claim detection
Fradulent Credit Card Transactions
Early detection of insider trading
System Security
Technologies used
As the package needs to be scalable and handle Big Data involving Hundreds of Millions of records, I have chosen to use
Apache Spark
H2o
My Approach
Below is the approach taken and algorithms used to solve the problem at hand:
K-Means Clustering from Apache Spark MLlib to identify clusters
Isolation Forest from H2o to detect the Anamolies
PCA to visualize the data in 3D by reducing the number of dimensions
Gradient Boosted Classification Trees from Spark MLlib to create classification model
Model optimization using Apache Spark MLlib Cross Validator
How to import and use the package?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for FraudTransactionDetector-0.1.0.dev0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f713fdeffee3031f75c6523a0d567e3b7ac6e39a7be4cd561a81b82284b2776 |
|
MD5 | 7654298feca58c19eaf0b90a7fbe7829 |
|
BLAKE2b-256 | cb4fce404e4c767a37f5d4e260179adf480db204ffc7dd9c840df98ddc9be0dd |