本次分享的项目为银行中信用卡欺诈识别(严重类失衡数据建模),您将在此篇博客中了解到解决严重类失衡数据的方法:
R语言模型调优
分类截点的选取
抽样方法
成本敏感度训练
字段名称 | 类型 | 描述 |
---|---|---|
time | 连续型变量 | 数据集中第一条记录与本条记录的时间差值(seconds elapsed),秒为单位 |
V1到V28 | 连续型变量 | 不同的主成分 |
amount | 连续型变量 | 交易记录的金额 |
class | 分类型变量 | 类别是否为欺诈:1 是,0 否 |
介于合规问题,没有办法展示原始的数据,我们提前对数据进行了
PCA
处理,一共选取了28个主成分来建立模型。对于时间变量与交易金额变量没有做处理。
数据中一共有284807
行数据,预测变量有30
个,本篇博客只是为了展示严重类失衡问题的处理方法
,考虑到用R
的小伙伴在处理大规模数据时,并行计算的核数较低,在模型方面花费时间较长,所以只选取了10492
条数据进行建模!
响应变量的可视化
这里我们使用了
逻辑回归、glmnet、随机森林
模型进行预测,展示三种模型的调优过程。
glmnet
:调整参数: α = 0 , 0.2 , 0.4 \alpha=0,0.2,0.4 α=0,0.2,0.4;
λ = 0.01 , 0.031 , 0.052 , 0.073 , 0.094 , 0.115 , 0.136 , 0.157 , 0.178 , 0.2 \lambda=0.01, 0.031, 0.052, 0.073, 0.094, 0.115,0.136, 0.157, 0.178,0.2 λ=0.01,0.031,0.052,0.073,0.094,0.115,0.136,0.157,0.178,0.2
随机森林
:调整参数:树的数量为1000
,每棵树随机选取的成分数量为:2,9,16,23,30
.
> library(caret)
> library(class)
> library(data.table)
> library(Hmisc)
> library(pROC)
> library(RColorBrewer)
> library(DMwR)
> library(kernlab)
> library(doParallel)
> df <- fread('creditcard.csv')
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
> training1 <- sample(which(df$Class == 0),10000,replace = F)
> df <- df[c(training1,which(df$Class == 1)),]
> table(df$Class)
0 1
10000 492
> df[,Class := factor(as.character(Class),levels = c('0','1'),
+ labels = c('nonfraud','fraud'))] # 因子化
>
> training <- createDataPartition(df$Class,p = .75)[[1]]
> df.train <- df[ training,]
> df.test <- df[ !training,]
>
> dim(df.train)
[1] 7869 31
> dim(df.test)
[1] 2623 31
> table(df.train$Class)
nonfraud fraud
7500 369
> table(df.test$Class)
nonfraud fraud
2500 123
> fiveStats <- function(...) c(twoClassSummary(...),defaultSummary(...))
> unregister_dopar <- function() {
+ env <- foreach:::.foreachGlobals
+ rm(list=ls(name=env), pos=env)
+ }
>
> df.train[,Class:=factor(Class)]
> df.test[,Class:=factor(Class)]
> unregister_dopar()
> ctrl <- trainControl(method = 'cv',classProbs = TRUE,
+ summaryFunction = fiveStats,
+ verboseIter = F)
>
> set.seed(1024)
> lr.fit <- train(Class ~.,data = df.train,
+ method = 'glm',trControl = ctrl,metric = 'ROC')
> cl <- makePSOCKcluster(7)
> registerDoParallel(cl)
>
> Grid <- expand.grid(.alpha = c(0,.2,.4),
+ .lambda = seq(.01,.2,length = 10))
>
> set.seed(1024)
> glmnet.fit <- train(x = df.train[,1:30],
+ y = df.train$Class,
+ method = 'glmnet',
+ tuneGrid = Grid,
+ metric = 'ROC',
+ trControl = ctrl) # glmnet模型
>
> View(glmnet.fit$results)
> stopCluster(cl) # 并行计算
> unregister_dopar()
> cl <- makePSOCKcluster(7)
> registerDoParallel(cl)
>
> set.seed(1024)
> rf.fit <- train(x = df.train[,1:30],
+ y = df.train$Class,
+ method = 'rf',
+ ntree = 1000,
+ tuneLength = 5,
+ trControl = ctrl,
+ metric = 'ROC') # 随机森林模型
>
模型调优表现:
glmnet(只截取了部分数据)
alpha | lambda | ROC | Sens | Spec | Accuracy | Kappa |
---|---|---|---|---|---|---|
0 | 0.01 | 0.9808633 | 0.9996000 | 0.788 | 0.989 | 0.8707 |
0 | 0.031 | 0.9809492 | 0.9997333 | 0.7453453 | 0.9878005 | 0.8434635 |
0 | 0.052 | 0.9805336 | 0.9997333 | 0.6884384 | 0.9851318 | 0.8018546 |
0 | 0.073 | 0.9800202 | 0.9997333 | 0.6368619 | 0.9827171 | 0.7628736 |
随机森林(只截取了部分数据)
mtry | ROC | Sens | Spec | Accuracy | Kappa |
---|---|---|---|---|---|
2 | 0.975 | 0.9997 | 0.84 | 0.99 | 0.906 |
9 | 0.973 | 0.9996 | 0.8481 | 0.992 | 0.909 |
16 | 0.972 | 0.999 | 0.848 | 0.991 | 0.903 |
23 | 0.970 | 0.9988 | 0.850 | 0.9918 | 0.902 |
模型在测试集上的表现
> Results <- df.test[,.(Class)]
>
> Results[,lr.pred := predict(lr.fit,newdata = df.test[,1:30],type = 'raw')]
> Results[,glmnet.pred := predict(glmnet.fit,newdata = df.test[,1:30],
+ type = 'raw')]
> Results[,rf.pred := predict(rf.fit,newdata = df.test[,1:30],type = 'raw')]
> Results[,lr.prob := predict(lr.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
> Results[,glmet.prob := predict(glmnet.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
> Results[,rf.prob := predict(rf.fit,newdata = df.test[,1:30],type = 'prob')[,2]]
>
> roc1 <- roc(Results$Class,Results$lr.prob,levels = c('fraud','nonfraud'))
Setting direction: controls > cases
>
> plot(roc1, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
+ grid.col=c("green", "red"), max.auc.polygon=TRUE,
+ auc.polygon.col="lightblue", print.thres=TRUE)
>
> roc2 <- roc(Results$Class,Results$glmet.prob,levels = c('fraud','nonfraud'))
Setting direction: controls > cases
>
> plot(roc2, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
+ grid.col=c("green", "red"), max.auc.polygon=TRUE,
+ auc.polygon.col="lightblue", print.thres=TRUE)
>
> roc3 <- roc(Results$Class,Results$rf.prob,levels = c('fraud','nonfraud'))
Setting direction: controls > cases
>
> plot(roc3, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
+ grid.col=c("green", "red"), max.auc.polygon=TRUE,
+ auc.polygon.col="lightblue", print.thres=TRUE)
>
可视化ROC曲线
通过三个模型的ROC曲线得到最优的分类截点。
|
|
|
在信用卡欺诈识别中我们可能更为关注的是具有欺诈行为的交易记录,希望可以准确识别出这些交易记录。接下来可视化混淆矩阵。
> FI <- function(data,fit ,D ){
+ z = predict(fit,data,type = 'prob')[,2]
+ ee = NULL
+ for(threshold in seq(.1,.9,.01)){
+ u = rep(levels(data[,eval(as.symbol(D))])[2],nrow(data))
+ u[!(z >= threshold)] = levels(data[,eval(as.symbol(D))])[1]
+ e = sum(u == 'nonfraud' & data[,eval(as.symbol(D))] == 'fraud') /
+ sum(data[,eval(as.symbol(D))] == 'fraud')
+ ee = rbind(ee,c(threshold,e))
+ }
+ I = which(ee[,2] == min(ee[,2]))
+ return(ee[min(I),])}
>
> temp1 = FI(df.test,lr.fit,'Class')
> temp2 = FI(df.test,glmnet.fit,'Class')
> temp3 = FI(df.test,rf.fit,'Class')
>
>
> Results[,lr.pred := factor(ifelse(lr.prob < 0.1,'nonfraud','fraud'),
+ levels = c('nonfraud','fraud'))]
>
>
> confusionMatrix(Results$lr.pred,Results$Class)
>
> Results[,glmnet.pred := factor(ifelse(glmet.prob < 0.1,'nonfraud','fraud'),
+ levels = c('nonfraud','fraud'))]
>
>
> confusionMatrix(Results$glmnet.pred,Results$Class)
>
> Results[,rf.pred := factor(ifelse(rf.prob < 0.1,'nonfraud','fraud'),
+ levels = c('nonfraud','fraud'))]
>
>
> confusionMatrix(Results$rf.pred,Results$Class)
可视化混淆矩阵
> a1 = data.frame(Reference = rep(c('nonfraud','fraud'),each = 2),
+ Prediction = rep(c('nonfraud','fraud'),2))
> p1 = confusionMatrix(Results$lr.pred,Results$Class)$table
> p2 = confusionMatrix(Results$glmnet.pred,Results$Class)$table
> p3 = confusionMatrix(Results$rf.pred,Results$Class)$table
>
> for(i in 1:4){
+ a1$value1[i] = p1[a1[i,1],a1[i,2]]
+ a1$value2[i] = p2[a1[i,1],a1[i,2]]
+ a1$value3[i] = p3[a1[i,1],a1[i,2]]}
> a1$Reference = as.factor(a1$Reference)
> a1$Prediction = as.factor(a1$Prediction)
>
> m1 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value1)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value1))+
+ scale_fill_distiller(palette="Greens", direction=1)
> m2 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value2)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value2))+
+ scale_fill_distiller(palette="Greens", direction=1)
> m3 = ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value3)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value1))+
+ scale_fill_distiller(palette="Greens", direction=1)
> Rmisc::multiplot(plotlist = list(m1,m2,m3)[1:3], cols = 2)
如果事先拥有类失衡的先验信息,直接减少模型影响的方法是选择一个训练集,使得该训练集样本中的各类样本比例大致相同。如果无法事先抽取样本,采用事后抽取样本的方法缓解类失衡问题带来的影响:向下抽样与向上抽样。向上抽样是任何通过随机模拟或添加额外的数据提高类平衡的技术。向下抽样是任何通过减少样本来提高类平衡的技术。
> temp = as.data.frame(df.train)
> class(temp)
[1] "data.frame"
> set.seed(1237)
> downsampled <- downSample(temp[, -ncol(temp)], temp$Class,
+ yname = 'Class')
>
> set.seed(1237)
> upsampled <- upSample(temp[, -ncol(temp)], temp$Class)
>
> library(DMwR)
> set.seed(1237)
> smoted <- SMOTE(Class ~ ., data = df.train)
>
> cl <- makePSOCKcluster(7)
> registerDoParallel(cl)
>
> set.seed(1410)
> rfDown <- train(Class ~ ., data = downsampled,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC") # 向下抽样
> stopCluster(cl)
> unregister_dopar()
>
> set.seed(1410)
> rfUp <- train(Class ~ ., data = upsampled,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
>
> set.seed(1410)
> rfSmote <- train(Class ~ ., data = smoted,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
>
> a1 = data.frame(Reference = rep(c('nonfraud','fraud'),each = 2),
+ Prediction = rep(c('nonfraud','fraud'),2))
> p1 = confusionMatrix(Results$rf.down.pred,Results$Class)$table
> p2 = confusionMatrix(Results$rfUp.pred,Results$Class)$table
> p3 = confusionMatrix(Results$rfSmote.pred,Results$Class)$table
>
>
> for(i in 1:4){
+ a1$value1[i] = p1[a1[i,1],a1[i,2]]
+ a1$value2[i] = p2[a1[i,1],a1[i,2]]
+ a1$value3[i] = p3[a1[i,1],a1[i,2]]}
>
> Rmisc::multiplot(plotlist = list(ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value1)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value1))+
+ scale_fill_distiller(palette="Greens", direction=1),
+ ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value2)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value2))+
+ scale_fill_distiller(palette="Greens", direction=1),
+ ggplot2::ggplot(a1, aes(x=Reference, y=Prediction, fill=value3)) +
+ geom_tile() + theme_bw() + coord_equal() +
+ geom_text(aes(x=Reference, y=Prediction, label=value3))+
+ scale_fill_distiller(palette="Greens", direction=1)
+ )[1:3], cols = 2)
可以通过混淆矩阵看到使用
向下抽样、向上抽样和SMOTE
结合随机森林
训练出来的模型对测试集起到了良好的预测效果。向上抽样完全的将正类样本与负类样本完全分开。
在分类的实际问题中更多的是与业务问题的直某些指标或者成本函数联系在一起,不再单独地去关注模型的表现度量(比如精确度等等)
> svmGrid1 <- data.frame(sigma = sigma[2],
+ C = 2^c(2:10))
>
> set.seed(1401)
> svmFit <- train(Class ~ .,
+ data = df.train,
+ method = "svmRadial",
+ tuneGrid = svmGrid1,
+ preProc = c("center", "scale"),
+ metric = "Kappa",
+ trControl = ctrl)
今天的机器学习案例之信用卡欺诈识别(严重类失衡数据建模)就分享到这里,介于本人水平有限,欢迎各位老师批评指正。