top of page

Building Models for eCommerce Fraud Detection : Project Requirement and Solution Approach

Overview

In this blog, we’ll explore a new project focused on Building Models for eCommerce Fraud Detection. This project leverages PySpark to create real-time fraud detection models that can efficiently identify and prevent fraudulent transactions, particularly Card-Not-Present (CNP) fraud, within eCommerce platforms.


We’ll walk through the project’s structure, which includes developing machine learning models using PySpark's powerful DataFrames and MLlib. You'll see how we preprocess the data, employ classification techniques, and use K-means clustering to gain insights into fraudster behavior.


In the solution approach section, we’ll explore how we transformed the provided datasets, trained models to predict fraud, and implemented a scalable system for both historical and real-time fraud detection, all while focusing on accuracy and performance optimization.


Introduction:

Fraud detection has become a critical component in safeguarding the integrity of eCommerce platforms. With the exponential growth in digital transactions, fraudsters have developed increasingly sophisticated methods to exploit vulnerabilities. In this blog, we’ll explore how machine learning models, combined with the power of PySpark, can be used to detect and prevent fraudulent activities in real-time. We'll dive into a project focused on developing a fraud detection system for Monash Fashion Corporation (MFC), an imaginary eCommerce retailer, and examine both the project requirements and the solution approach.


Project Requirement: Building a Fraud Detection System


Objective: The primary objective of this project is to implement a machine learning solution capable of detecting CNP fraud using customer information and browsing behavior. Additionally, the project will involve unsupervised learning through K-means clustering to identify common characteristics among fraudsters.


Datasets:

  • Customer dataset: Customer information

  • Category dataset: Product category details

  • Product datase: Information about products

  • Transaction dataset: Sales transaction records

  • Browsing behaviour dataset: Customer browsing data

  • Customer session dataset: Links browsing sessions with customer information

  • Fraud transaction dataset: List of fraudulent transactions


Use Cases:

  1. Fraud Classification: Using customer and browsing behavior data, classify whether a transaction is fraudulent.

  2. Fraudster Segmentation: Apply K-means clustering to identify patterns in fraudulent behavior.


Project Tasks:

  • Data Preprocessing: Clean and transform the datasets into usable features for machine learning models.

    • Extracting useful features from customer, transaction, and browsing behavior data.

    • Creating new columns like session-based event levels (L1, L2, L3) based on interaction likelihood.

    • Handling null values and balancing the dataset due to class imbalance (fraud vs. non-fraud).


  • Feature Engineering:

    • Create transaction-level features such as counts and ratios of high-likelihood (L1), medium-likelihood (L2), and low-likelihood (L3) actions.

    • Add time-of-day, geolocation, and customer segmentation features.

    • Merge fraud labels into the feature set.


  • Model Building:

    • Implement classification models using PySpark MLlib, such as Random Forest and Gradient-Boosted Trees to predict fraudulent transactions.

    • Evaluate models using metrics like AUC, precision, recall, accuracy, and ROC plots.


  • Unsupervised Learning:

    • Perform K-means clustering to identify patterns in customer behavior, particularly focusing on identifying fraud-prone customer segments.



Architecture: The project requires a scalable architecture where data is processed using PySpark DataFrames, and machine learning models are built using Spark MLlib. The solution will focus on:

  1. Data Loading and Transformation: Ingesting and transforming data into a format suitable for machine learning.

  2. Model Development: Building Random Forest (RF) and Gradient-Boosted Tree (GBT) models for fraud classification.

  3. Clustering: Using K-means clustering to segment customers based on behavior and detect anomalies.


Solution Approach: From Data to Fraud Detection

1. Data Loading and Transformation: The first step is to load the datasets into PySpark DataFrames. The data must be pre-processed to create meaningful features for machine learning. This involves several tasks:


  • Event Categorization: Browsing behavior events are classified into three levels based on their likelihood to lead to a purchase:

    • L1 (high likelihood): Add Promotion, Add to Cart, CheckOut

    • L2 (moderate likelihood): Viewing Category, Viewing Item, Search

    • L3 (low likelihood): Mouse Scrolling, Clicks, View HomePage


  • Feature Engineering: For each transaction, we calculate:

    • The count of L1, L2, and L3 actions

    • The ratio of L1 and L2 actions relative to the total

    • Time-of-day grouping based on browsing session timestamps (morning, afternoon, evening, night)

    • Customer demographics like gender, age, and location

    • The total number of purchases made by the customer

    • Fraud labels based on transaction history


2. Model Development: Using the prepared dataset, machine learning models are built to classify transactions as fraudulent or legitimate. The following steps outline the process:


  • Feature Selection: After exploring the dataset, the most relevant features are selected for model training. This includes customer demographics, browsing behavior, and transaction details.

  • Model Training: Two models are trained—Random Forest (RF) and Gradient-Boosted Trees (GBT)—both of which are known for their ability to handle large datasets and complex decision-making processes.

  • Model Evaluation: The models are evaluated using several metrics, including accuracy, precision, recall, and AUC. The better-performing model is selected for deployment.


3. Clustering for Fraud Detection: To further understand fraudulent behavior, K-means clustering is applied to segment customers based on their browsing patterns and transaction history. Fraudsters typically display different behaviors than legitimate customers:

  • Fraudsters may spend less time browsing and may attempt multiple failed transactions using stolen cards.

  • By clustering customers, MFC can identify patterns among fraudulent users and use this information to enhance their detection system.


If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.


Comments


bottom of page