• 机器学习之信用卡欺诈识别(严重类失衡数据建模)


    机器学习案例之客户的信用风险与预测

    本次分享的项目为银行中信用卡欺诈识别(严重类失衡数据建模),您将在此篇博客中了解到解决严重类失衡数据的方法:
    R语言模型调优
    分类截点的选取
    抽样方法
    成本敏感度训练

    数据介绍

    字段名称类型描述
    time连续型变量数据集中第一条记录与本条记录的时间差值(seconds elapsed),秒为单位
    V1到V28连续型变量不同的主成分
    amount连续型变量交易记录的金额
    class分类型变量类别是否为欺诈:1 是,0 否

    介于合规问题,没有办法展示原始的数据,我们提前对数据进行了PCA处理,一共选取了28个主成分来建立模型。对于时间变量与交易金额变量没有做处理。
    数据中一共有284807行数据,预测变量有30个,本篇博客只是为了展示严重类失衡问题的处理方法,考虑到用R的小伙伴在处理大规模数据时,并行计算的核数较低,在模型方面花费时间较长,所以只选取了10492条数据进行建模!

    响应变量的可视化

    R语言模型调优

    这里我们使用了逻辑回归、glmnet、随机森林模型进行预测,展示三种模型的调优过程。
    glmnet:调整参数: α = 0 , 0.2 , 0.4 \alpha=0,0.2,0.4 α=0,0.2,0.4;
    λ = 0.01 , 0.031 , 0.052 , 0.073 , 0.094 , 0.115 , 0.136 , 0.157 , 0.178 , 0.2 \lambda=0.01, 0.031, 0.052, 0.073, 0.094, 0.115,0.136, 0.157, 0.178,0.2 λ=0.01,0.031,0.052,0.073,0.094,0.115,0.136,0.157,0.178,0.2
    随机森林:调整参数:树的数量为1000,每棵树随机选取的成分数量为:2,9,16,23,30.

    > library(caret)
    > library(class)
    > library(data.table)
    > library(Hmisc)
    > library(pROC)
    > library(RColorBrewer)
    > library(DMwR)
    > library(kernlab)
    > library(doParallel)
    > df <- fread('creditcard.csv')
    |--------------------------------------------------|
    |==================================================|
    |--------------------------------------------------|
    |==================================================|
    > training1 <- sample(which(df$Class == 0),10000,replace = F)
    > df <- df[c(training1,which(df$Class == 1)),]
    > table(df$Class)
    
        0     1 
    10000   492 
    > df[,Class := factor(as.character(Class),levels = c('0','1'),
    +                     labels = c('nonfraud','fraud'))] # 因子化
    > 
    > training <- createDataPartition(df$Class,p = .75)[[1]]
    > df.train <- df[ training,]
    > df.test <- df[ !training,]
    > 
    > dim(df.train)
    [1] 7869   31
    > dim(df.test)
    [1] 2623   31
    > table(df.train$Class)
    
    nonfraud    fraud 
        7500      369 
    > table(df.test$Class)
    
    nonfraud    fraud 
        2500      123 
    > fiveStats <- function(...) c(twoClassSummary(...),defaultSummary(...))
    > unregister_dopar <- function() {
    +   env <- foreach:::.foreachGlobals
    +   rm(list=ls(name=env), pos=env)
    + }
    > 
    > df.train[,Class:=factor(Class)]
    > df.test[,Class:=factor(Class)]
    > unregister_dopar()
    > ctrl <- trainControl(method = 'cv',classProbs = TRUE,
    +                      summaryFunction = fiveStats,
    +                      verboseIter = F)
    > 
    > set.seed(1024)
    > lr.fit <- train(Class ~.,data = df.train,
    +                 method = 'glm',trControl = ctrl,metric = 'ROC')
    > cl <- makePSOCKcluster(7)
    > registerDoParallel(cl)
    > 
    > Grid <- expand.grid(.alpha = c(0,.2,.4),
    +                     .lambda = seq(.01,.2,length = 10))
    > 
    > set.seed(1024)
    > glmnet.fit <- train(x = df.train[,1:30],
    +                 y = df.train$Class,
    +                 method = 'glmnet',
    +                 tuneGrid = Grid,
    +                 metric = 'ROC',
    +                 trControl = ctrl) # glmnet模型
    > 
    > View(glmnet.fit$results)
    > stopCluster(cl) # 并行计算
    > unregister_dopar()
    > cl <- makePSOCKcluster(7)
    > registerDoParallel(cl)
    > 
    > set.seed(1024)
    > rf.fit <- train(x = df.train[,1:30],
    +                 y = df.train$Class,
    +                 method = 'rf',
    +                 ntree = 1000,
    +                 tuneLength = 5,
    +                 trControl = ctrl,
    +                 metric = 'ROC') # 随机森林模型
    > 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85

    模型调优表现:
    glmnet(只截取了部分数据)

    alphalambdaROCSensSpecAccuracyKappa
    00.010.98086330.99960000.7880.9890.8707
    00.0310.98094920.99973330.74534530.98780050.8434635
    00.0520.98053360.99973330.68843840.98513180.8018546
    00.0730.98002020.99973330.63686190.98271710.7628736

    随机森林(只截取了部分数据)

    mtryROCSensSpecAccuracyKappa
    20.9750.99970.840.990.906
    90.9730.99960.84810.9920.909
    160.9720.9990.8480.9910.903
    230.9700.99880.8500.99180.902

    模型在测试集上的表现

    > Results <- df.test[,.(Class)]
    > 
    > Results[,lr.pred := predict(lr.fit,newdata = df.test[,1:30],type = 'raw')]
    > Results[,glmnet.pred := predict(glmnet.fit,newdata = df.test[,1:30],
    +              type = 'raw')]
    > Results[,rf.pred := predict(rf.fit,newdata = df.test[,1:30],type = 'raw')]
    > Results[,lr.prob := predict(lr.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
    > Results[,glmet.prob := predict(glmnet.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
    > Results[,rf.prob := predict(rf.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
    > 
    > roc1 <- roc(Results$Class,Results$lr.prob,levels = c('fraud','nonfraud'))
    Setting direction: controls > cases
    > 
    > plot(roc1, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
    +      grid.col=c("green", "red"), max.auc.polygon=TRUE,
    +      auc.polygon.col="lightblue", print.thres=TRUE)
    > 
    > roc2 <- roc(Results$Class,Results$glmet.prob,levels = c('fraud','nonfraud'))
    Setting direction: controls > cases
    > 
    > plot(roc2, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
    +      grid.col=c("green", "red"), max.auc.polygon=TRUE,
    +      auc.polygon.col="lightblue", print.thres=TRUE)
    > 
    > roc3 <- roc(Results$Class,Results$rf.prob,levels = c('fraud','nonfraud'))
    Setting direction: controls > cases
    > 
    > plot(roc3, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
    +      grid.col=c("green", "red"), max.auc.polygon=TRUE,
    +      auc.polygon.col="lightblue", print.thres=TRUE)
    > 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32

    可视化ROC曲线

    通过三个模型的ROC曲线得到最优的分类截点。

    图1 roc1
    图2 roc2
    图3 roc3

    分类截点的选取

    在信用卡欺诈识别中我们可能更为关注的是具有欺诈行为的交易记录,希望可以准确识别出这些交易记录。接下来可视化混淆矩阵。

    > FI <- function(data,fit ,D ){
    +       z = predict(fit,data,type = 'prob')[,2]
    +       ee = NULL
    +       for(threshold in seq(.1,.9,.01)){
    +         u = rep(levels(data[,eval(as.symbol(D))])[2],nrow(data))
    +         u[!(z >= threshold)] = levels(data[,eval(as.symbol(D))])[1]
    +         e = sum(u == 'nonfraud' & data[,eval(as.symbol(D))] == 'fraud') / 
    +              sum(data[,eval(as.symbol(D))] == 'fraud')
    +         ee = rbind(ee,c(threshold,e))
    +                                       }
    +         I = which(ee[,2] == min(ee[,2]))
    +         return(ee[min(I),])}
    > 
    > temp1 = FI(df.test,lr.fit,'Class')
    > temp2 = FI(df.test,glmnet.fit,'Class')
    > temp3 = FI(df.test,rf.fit,'Class')
    > 
    > 
    > Results[,lr.pred := factor(ifelse(lr.prob < 0.1,'nonfraud','fraud'),
    +              levels = c('nonfraud','fraud'))]
    > 
    > 
    > confusionMatrix(Results$lr.pred,Results$Class)
    > 
    > Results[,glmnet.pred := factor(ifelse(glmet.prob < 0.1,'nonfraud','fraud'),
    +              levels = c('nonfraud','fraud'))]
    > 
    > 
    > confusionMatrix(Results$glmnet.pred,Results$Class)
    > 
    > Results[,rf.pred := factor(ifelse(rf.prob < 0.1,'nonfraud','fraud'),
    +              levels = c('nonfraud','fraud'))]
    > 
    > 
    > confusionMatrix(Results$rf.pred,Results$Class)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    可视化混淆矩阵

    > a1 = data.frame(Reference = rep(c('nonfraud','fraud'),each = 2),
    +                 Prediction = rep(c('nonfraud','fraud'),2))
    > p1 = confusionMatrix(Results$lr.pred,Results$Class)$table
    > p2 = confusionMatrix(Results$glmnet.pred,Results$Class)$table
    > p3 = confusionMatrix(Results$rf.pred,Results$Class)$table
    > 
    > for(i in 1:4){
    +   a1$value1[i] = p1[a1[i,1],a1[i,2]]
    +   a1$value2[i] = p2[a1[i,1],a1[i,2]]
    +   a1$value3[i] = p3[a1[i,1],a1[i,2]]}
    > a1$Reference = as.factor(a1$Reference)
    > a1$Prediction = as.factor(a1$Prediction)
    > 
    > m1 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value1)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value1))+
    +   scale_fill_distiller(palette="Greens", direction=1)
    > m2 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value2)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value2))+
    +   scale_fill_distiller(palette="Greens", direction=1)
    > m3 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value3)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value1))+
    +   scale_fill_distiller(palette="Greens", direction=1)
    
    > Rmisc::multiplot(plotlist = list(m1,m2,m3)[1:3], cols = 2) 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    抽样方法

    如果事先拥有类失衡的先验信息,直接减少模型影响的方法是选择一个训练集,使得该训练集样本中的各类样本比例大致相同。如果无法事先抽取样本,采用事后抽取样本的方法缓解类失衡问题带来的影响:向下抽样与向上抽样。向上抽样是任何通过随机模拟或添加额外的数据提高类平衡的技术。向下抽样是任何通过减少样本来提高类平衡的技术。

    > temp = as.data.frame(df.train)
    > class(temp)
    [1] "data.frame"
    > set.seed(1237)
    > downsampled <- downSample(temp[, -ncol(temp)], temp$Class,
    +                            yname = 'Class')
    > 
    > set.seed(1237)
    > upsampled <- upSample(temp[, -ncol(temp)], temp$Class)
    > 
    > library(DMwR)
    > set.seed(1237)
    > smoted <- SMOTE(Class ~ ., data = df.train)
    > 
    > cl <- makePSOCKcluster(7)
    > registerDoParallel(cl)
    > 
    > set.seed(1410)
    > rfDown <- train(Class ~ ., data = downsampled,
    +                 "rf",
    +                 trControl = ctrl,
    +                 ntree = 1500,
    +                 tuneLength = 5,
    +                 metric = "ROC") # 向下抽样
    > stopCluster(cl) 
    > unregister_dopar()
    > 
    > set.seed(1410)
    > rfUp <- train(Class ~ ., data = upsampled,
    +               "rf",
    +               trControl = ctrl,
    +               ntree = 1500,
    +               tuneLength = 5,
    +               metric = "ROC")
    > 
    > set.seed(1410)
    > rfSmote <- train(Class ~ ., data = smoted,
    +                  "rf",
    +                  trControl = ctrl,
    +                  ntree = 1500,
    +                  tuneLength = 5,
    +                  metric = "ROC")
    > 
    > a1 = data.frame(Reference = rep(c('nonfraud','fraud'),each = 2),
    +                 Prediction = rep(c('nonfraud','fraud'),2))
    > p1 = confusionMatrix(Results$rf.down.pred,Results$Class)$table
    > p2 = confusionMatrix(Results$rfUp.pred,Results$Class)$table
    > p3 = confusionMatrix(Results$rfSmote.pred,Results$Class)$table
    > 
    > 
    > for(i in 1:4){
    +   a1$value1[i] = p1[a1[i,1],a1[i,2]]
    +   a1$value2[i] = p2[a1[i,1],a1[i,2]]
    +   a1$value3[i] = p3[a1[i,1],a1[i,2]]}
    > 
    > Rmisc::multiplot(plotlist = list(ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value1)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value1))+
    +   scale_fill_distiller(palette="Greens", direction=1),
    +   ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value2)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value2))+
    +   scale_fill_distiller(palette="Greens", direction=1),
    +   ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value3)) +
    +   geom_tile() + theme_bw() + coord_equal() +
    +   geom_text(aes(x=Reference, y=Prediction, label=value3))+
    +   scale_fill_distiller(palette="Greens", direction=1)
    +                           )[1:3], cols = 2) 
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70

    可以通过混淆矩阵看到使用向下抽样、向上抽样和SMOTE结合随机森林训练出来的模型对测试集起到了良好的预测效果。向上抽样完全的将正类样本与负类样本完全分开。

    成本敏感度训练

    在分类的实际问题中更多的是与业务问题的直某些指标或者成本函数联系在一起,不再单独地去关注模型的表现度量(比如精确度等等)

    > svmGrid1 <- data.frame(sigma = sigma[2],
    +                        C = 2^c(2:10))
    > 
    > set.seed(1401)
    > svmFit <- train(Class ~ .,
    +                 data = df.train,
    +                 method = "svmRadial",
    +                 tuneGrid = svmGrid1,
    +                 preProc = c("center", "scale"),
    +                 metric = "Kappa",
    +                 trControl = ctrl)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    总结

    今天的机器学习案例之信用卡欺诈识别(严重类失衡数据建模)就分享到这里,介于本人水平有限,欢迎各位老师批评指正。

  • 相关阅读:
    js的基础
    ChatGPT AIGC 完成Excel跨多表查找操作vlookup+indirect
    【MindSpore产品】【多卡GPU训练功能】参数应该怎么设置
    基于若依(SpringBoot前后分离版-vue)的WebSocket消息推送实现
    Vue2基础学习
    代码随想录算法训练营Day49 | 动态规划(10/17) LeetCode 121. 买卖股票的最佳时机 122.买卖股票的最佳时机II
    python篇---python 用opencv读取rtsp视频流(二)
    Redis 最流行的图形化界面下载及使用超详细教程(带安装包)! redis windows客户端下载
    楠姐技术漫话:图计算的那些事 | 京东云技术团队
    Cobalt Strike
  • 原文地址:https://blog.csdn.net/weixin_43217641/article/details/126102820