【阿旭机器学习实战】【24】信用卡用户流失预测实战

【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例，欢迎点赞，关注共同学习交流。

本文针对某国外匿名化处理后的信用卡真实数据集，通过建模判断该用户是否已经流失，包括特征处理与分类模型建模评估。

问题描述

依据某国外匿名化处理后的真实数据集，通过建模，判断该用户是否已经流失。

1. 读取数据并分离特征与标签

import pandas as pd
import numpy as np
1
2

# 读取数据
train_data = pd.read_csv('./Churn-Modelling.csv')
test_data = pd.read_csv('./Churn-Modelling-Test-Data.csv')
1
2
3

x_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1].astype(int)
x_test = test_data.iloc[:,:-1]
y_test = test_data.iloc[:,-1].astype(int)
1
2
3
4

x_train.head()
1

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	15634602	Hargrave	619	France	Female	42	2	0.00	1	1	1	101348.88
1	2	15647311	Hill	608	Spain	Female	41	1	83807.86	1	0	1	112542.58
2	3	15619304	Onio	502	France	Female	42	8	159660.80	3	1	0	113931.57
3	4	15701354	Boni	699	France	Female	39	1	0.00	2	0	0	93826.63
4	5	15737888	Mitchell	850	Spain	Female	43	2	125510.82	1	1	1	79084.10

数据说明:
RowNumber：行号
CustomerID：用户编号
Surname：用户姓名
CreditScore：信用分数
Geography：用户所在国家/地区
Gender：用户性别
Age：年龄
Tenure：当了本银行多少年用户
Balance：存贷款情况
NumOfProducts：使用产品数量
HasCrCard：是否有本行信用卡
IsActiveMember：是否活跃用户
EstimatedSalary：估计收入
Exited：是否已流失，这将作为我们的标签数据

2.特征工程

2.1 删除无用特征

# 删除前三列没用的数据
x_train = x_train.drop(labels=x_train.columns[[0,1,2]], axis=1)
x_test = x_test.drop(labels=x_test.columns[[0,1,2]], axis=1)
1
2
3

x_train.head()
1

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	619	France	Female	42	2	0.00	1	1	1	101348.88
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58
2	502	France	Female	42	8	159660.80	3	1	0	113931.57
3	699	France	Female	39	1	0.00	2	0	0	93826.63
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10

y_train[:5]
1

0    1
1    0
2    1
3    0
4    0
Name: Exited, dtype: int32
1
2
3
4
5
6

2.2 将字符串特征进行编码

# 国家与性别两列为非数值型数据，使用LabelEncoder进行编码，将其转换为数值数据
from sklearn.preprocessing import LabelEncoder
Lb1 = LabelEncoder()
x_train.iloc[:,1] = Lb1.fit_transform(x_train.iloc[:,1])
x_test.iloc[:,1] = Lb1.transform(x_test.iloc[:,1])
Lb2 = LabelEncoder()
x_train.iloc[:,2] = Lb2.fit_transform(x_train.iloc[:,2])
x_test.iloc[:,2] = Lb2.transform(x_test.iloc[:,2])
1
2
3
4
5
6
7
8

x_train[:5]
1

	CreditScore	Geography	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	619	0	42	2	0.00	1	1	1	101348.88
1	608	2	41	1	83807.86	1	0	1	112542.58
2	502	0	42	8	159660.80	3	1	0	113931.57
3	699	0	39	1	0.00	2	0	0	93826.63
4	850	2	43	2	125510.82	1	1	1	79084.10

x_train.info()
1


RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null int64
Gender             10000 non-null int64
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
dtypes: float64(2), int64(8)
memory usage: 781.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

2.3 对特征数据进行归一化

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
1
2
3
4

x_train[:5]
1

array([[-0.32622142, -0.90188624, -1.09598752,  0.29351742, -1.04175968,
        -1.22584767, -0.91158349,  0.64609167,  0.97024255,  0.02188649],
       [-0.44003595,  1.51506738, -1.09598752,  0.19816383, -1.38753759,
         0.11735002, -0.91158349, -1.54776799,  0.97024255,  0.21653375],
       [-1.53679418, -0.90188624, -1.09598752,  0.29351742,  1.03290776,
         1.33305335,  2.52705662,  0.64609167, -1.03067011,  0.2406869 ],
       [ 0.50152063, -0.90188624, -1.09598752,  0.00745665, -1.38753759,
        -1.22584767,  0.80773656, -1.54776799, -1.03067011, -0.10891792],
       [ 2.06388377,  1.51506738, -1.09598752,  0.38887101, -1.04175968,
         0.7857279 , -0.91158349,  0.64609167,  0.97024255, -0.36527578]])
1
2
3
4
5
6
7
8
9
10

3. 建模预测与评估

# 使用逻辑回归进行建模
from sklearn.linear_model import LogisticRegression
1
2

lr=LogisticRegression()
sgd=SGDClassifier()
lr.fit(x_train,y_train)
lr_y_predict=lr.predict(x_test)
1
2
3
4

#使用逻辑斯蒂回归墨香自带的评分函数score获得模型在测试集上的准确性结果
print('LogisticRegression测试集准确度:',lr.score(x_test,y_test))
print('LogisticRegression训练集准确度:',lr.score(x_train,y_train))
1
2
3

LogisticRegression测试集准确度: 0.761
LogisticRegression训练集准确度: 0.809
1
2

from sklearn.metrics import classification_report
#使用classificaion_report模块获得LogisticRegression其他三个指标的结果
print(classification_report(y_test,lr_y_predict,target_names=['Exited','UnExited']))
1
2
3

             precision    recall  f1-score   support

     Exited       0.77      0.97      0.86       740
   UnExited       0.68      0.15      0.25       260

avg / total       0.74      0.76      0.70      1000
1
2
3
4
5
6

结果表明该模型准确率只有76%，还有一定的优化空间。

如果内容对你有帮助，感谢点赞+关注哦！

欢迎关注我的公众号:阿旭算法与机器学习，共同学习交流。
更多干货内容持续更新中…

相关阅读:
传输机房的基本结构
ROS机器人应用（4）—— 查看里程计、IMU 话题信息
单机K8s加入节点组成集群
USART串口协议
asp.net core webapi接收application/x-www-form-urlencoded和form-data参数
智慧公厕高精尖技术揭秘，让卫生管理更智能、更舒适
1807. 替换字符串中的括号内容
30个有发展前景的创业项目
08_openstack之nova节点扩容
传感器的基本特性

原文地址：https://blog.csdn.net/qq_42589613/article/details/127768719