• 【阿旭机器学习实战】【24】信用卡用户流失预测实战


    【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

    本文针对某国外匿名化处理后的信用卡真实数据集,通过建模判断该用户是否已经流失,包括特征处理与分类模型建模评估。

    问题描述

    依据某国外匿名化处理后的真实数据集,通过建模,判断该用户是否已经流失。

    1. 读取数据并分离特征与标签

    import pandas as pd
    import numpy as np
    
    • 1
    • 2
    # 读取数据
    train_data = pd.read_csv('./Churn-Modelling.csv')
    test_data = pd.read_csv('./Churn-Modelling-Test-Data.csv')
    
    • 1
    • 2
    • 3
    x_train = train_data.iloc[:,:-1]
    y_train = train_data.iloc[:,-1].astype(int)
    x_test = test_data.iloc[:,:-1]
    y_test = test_data.iloc[:,-1].astype(int)
    
    • 1
    • 2
    • 3
    • 4
    x_train.head()
    
    • 1
    RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalary
    0115634602Hargrave619FranceFemale4220.00111101348.88
    1215647311Hill608SpainFemale41183807.86101112542.58
    2315619304Onio502FranceFemale428159660.80310113931.57
    3415701354Boni699FranceFemale3910.0020093826.63
    4515737888Mitchell850SpainFemale432125510.8211179084.10

    数据说明:
    RowNumber:行号
    CustomerID:用户编号
    Surname:用户姓名
    CreditScore:信用分数
    Geography:用户所在国家/地区
    Gender:用户性别
    Age:年龄
    Tenure:当了本银行多少年用户
    Balance:存贷款情况
    NumOfProducts:使用产品数量
    HasCrCard:是否有本行信用卡
    IsActiveMember:是否活跃用户
    EstimatedSalary:估计收入
    Exited:是否已流失,这将作为我们的标签数据

    2.特征工程

    2.1 删除无用特征

    # 删除前三列没用的数据
    x_train = x_train.drop(labels=x_train.columns[[0,1,2]], axis=1)
    x_test = x_test.drop(labels=x_test.columns[[0,1,2]], axis=1)
    
    • 1
    • 2
    • 3
    x_train.head()
    
    • 1
    CreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalary
    0619FranceFemale4220.00111101348.88
    1608SpainFemale41183807.86101112542.58
    2502FranceFemale428159660.80310113931.57
    3699FranceFemale3910.0020093826.63
    4850SpainFemale432125510.8211179084.10
    y_train[:5]
    
    • 1
    0    1
    1    0
    2    1
    3    0
    4    0
    Name: Exited, dtype: int32
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    2.2 将字符串特征进行编码

    # 国家与性别两列为非数值型数据,使用LabelEncoder进行编码,将其转换为数值数据
    from sklearn.preprocessing import LabelEncoder
    Lb1 = LabelEncoder()
    x_train.iloc[:,1] = Lb1.fit_transform(x_train.iloc[:,1])
    x_test.iloc[:,1] = Lb1.transform(x_test.iloc[:,1])
    Lb2 = LabelEncoder()
    x_train.iloc[:,2] = Lb2.fit_transform(x_train.iloc[:,2])
    x_test.iloc[:,2] = Lb2.transform(x_test.iloc[:,2])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    x_train[:5]
    
    • 1
    CreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalary
    0619004220.00111101348.88
    16082041183807.86101112542.58
    250200428159660.80310113931.57
    3699003910.0020093826.63
    485020432125510.8211179084.10
    x_train.info()
    
    • 1
    
    RangeIndex: 10000 entries, 0 to 9999
    Data columns (total 10 columns):
    CreditScore        10000 non-null int64
    Geography          10000 non-null int64
    Gender             10000 non-null int64
    Age                10000 non-null int64
    Tenure             10000 non-null int64
    Balance            10000 non-null float64
    NumOfProducts      10000 non-null int64
    HasCrCard          10000 non-null int64
    IsActiveMember     10000 non-null int64
    EstimatedSalary    10000 non-null float64
    dtypes: float64(2), int64(8)
    memory usage: 781.3 KB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    2.3 对特征数据进行归一化

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    x_train = sc.fit_transform(x_train)
    x_test = sc.transform(x_test)
    
    • 1
    • 2
    • 3
    • 4
    x_train[:5]
    
    • 1
    array([[-0.32622142, -0.90188624, -1.09598752,  0.29351742, -1.04175968,
            -1.22584767, -0.91158349,  0.64609167,  0.97024255,  0.02188649],
           [-0.44003595,  1.51506738, -1.09598752,  0.19816383, -1.38753759,
             0.11735002, -0.91158349, -1.54776799,  0.97024255,  0.21653375],
           [-1.53679418, -0.90188624, -1.09598752,  0.29351742,  1.03290776,
             1.33305335,  2.52705662,  0.64609167, -1.03067011,  0.2406869 ],
           [ 0.50152063, -0.90188624, -1.09598752,  0.00745665, -1.38753759,
            -1.22584767,  0.80773656, -1.54776799, -1.03067011, -0.10891792],
           [ 2.06388377,  1.51506738, -1.09598752,  0.38887101, -1.04175968,
             0.7857279 , -0.91158349,  0.64609167,  0.97024255, -0.36527578]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    3. 建模预测与评估

    # 使用逻辑回归进行建模
    from sklearn.linear_model import LogisticRegression
    
    • 1
    • 2
    lr=LogisticRegression()
    sgd=SGDClassifier()
    lr.fit(x_train,y_train)
    lr_y_predict=lr.predict(x_test)
    
    • 1
    • 2
    • 3
    • 4
    #使用逻辑斯蒂回归墨香自带的评分函数score获得模型在测试集上的准确性结果
    print('LogisticRegression测试集准确度:',lr.score(x_test,y_test))
    print('LogisticRegression训练集准确度:',lr.score(x_train,y_train))
    
    • 1
    • 2
    • 3
    LogisticRegression测试集准确度: 0.761
    LogisticRegression训练集准确度: 0.809
    
    • 1
    • 2
    from sklearn.metrics import classification_report
    #使用classificaion_report模块获得LogisticRegression其他三个指标的结果
    print(classification_report(y_test,lr_y_predict,target_names=['Exited','UnExited']))
    
    • 1
    • 2
    • 3
                 precision    recall  f1-score   support
    
         Exited       0.77      0.97      0.86       740
       UnExited       0.68      0.15      0.25       260
    
    avg / total       0.74      0.76      0.70      1000
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    结果表明该模型准确率只有76%,还有一定的优化空间。

    如果内容对你有帮助,感谢点赞+关注哦!

    欢迎关注我的公众号:阿旭算法与机器学习,共同学习交流。
    更多干货内容持续更新中…

  • 相关阅读:
    个人秋招面经——百度
    【LeetCode-98】
    对数器是什么?简单理解下
    ik分词器
    微查系统,一站式查询,让您的查询更加便捷
    卷积的计算过程
    shiro-第二篇-整合springboot
    CMMI和SPCA是一样的吗?有什么区别
    2023十大杰出外盘黄金交易APP平台最新排名
    java版工程管理系统Spring Cloud+Spring Boot+Mybatis实现工程管理系统源码
  • 原文地址:https://blog.csdn.net/qq_42589613/article/details/127768719