• 【阿旭机器学习实战】【11】文本分类实战:利用朴素贝叶斯模型进行邮件分类


    【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

    本文主要介绍如何使用朴素贝叶斯模型进行邮件分类,置于朴素贝叶斯模型的原理及分类,可以参考我的上一篇文章《【阿旭机器学习实战】【10】朴素贝叶斯模型原理及3种贝叶斯模型对比:高斯分布朴素贝叶斯、多项式分布朴素贝叶斯、伯努利分布朴素贝叶斯》

    文本分类实战

    读取文本数据

    import pandas as pd
    
    • 1
    # sep参数代表指定的csv的属性分割符号
    sms = pd.read_csv("../data/SMSSpamCollection",sep="\t",header=None)
    
    sms
    
    • 1
    • 2
    • 3
    • 4
    01
    0hamGo until jurong point, crazy.. Available only ...
    1hamOk lar... Joking wif u oni...
    2spamFree entry in 2 a wkly comp to win FA Cup fina...
    3hamU dun say so early hor... U c already then say...
    4hamNah I don't think he goes to usf, he lives aro...
    5spamFreeMsg Hey there darling it's been 3 week's n...
    6hamEven my brother is not like to speak with me. ...
    7hamAs per your request 'Melle Melle (Oru Minnamin...
    8spamWINNER!! As a valued network customer you have...
    9spamHad your mobile 11 months or more? U R entitle...
    10hamI'm gonna be home soon and i don't want to tal...
    11spamSIX chances to win CASH! From 100 to 20,000 po...
    12spamURGENT! You have won a 1 week FREE membership ...
    13hamI've been searching for the right words to tha...
    14hamI HAVE A DATE ON SUNDAY WITH WILL!!
    15spamXXXMobileMovieClub: To use your credit, click ...
    16hamOh k...i'm watching here:)
    17hamEh u remember how 2 spell his name... Yes i di...
    18hamFine if that’s the way u feel. That’s the way ...
    19spamEngland v Macedonia - dont miss the goals/team...
    20hamIs that seriously how you spell his name?
    21hamI‘m going to try for 2 months ha ha only joking
    22hamSo ü pay first lar... Then when is da stock co...
    23hamAft i finish my lunch then i go str down lor. ...
    24hamFfffffffff. Alright no way I can meet up with ...
    25hamJust forced myself to eat a slice. I'm really ...
    26hamLol your always so convincing.
    27hamDid you catch the bus ? Are you frying an egg ...
    28hamI'm back & we're packing the car now, I'll...
    29hamAhhh. Work. I vaguely remember that! What does...
    .........
    5542hamArmand says get your ass over to epsilon
    5543hamU still havent got urself a jacket ah?
    5544hamI'm taking derek & taylor to walmart, if I...
    5545hamHi its in durban are you still on this number
    5546hamIc. There are a lotta childporn cars then.
    5547spamHad your contract mobile 11 Mnths? Latest Moto...
    5548hamNo, I was trying it all weekend ;V
    5549hamYou know, wot people wear. T shirts, jumpers, ...
    5550hamCool, what time you think you can get here?
    5551hamWen did you get so spiritual and deep. That's ...
    5552hamHave a safe trip to Nigeria. Wish you happines...
    5553hamHahaha..use your brain dear
    5554hamWell keep in mind I've only got enough gas for...
    5555hamYeh. Indians was nice. Tho it did kane me off ...
    5556hamYes i have. So that's why u texted. Pshew...mi...
    5557hamNo. I meant the calculation is the same. That ...
    5558hamSorry, I'll call later
    5559hamif you aren't here in the next <#> hou...
    5560hamAnything lor. Juz both of us lor.
    5561hamGet me out of this dump heap. My mom decided t...
    5562hamOk lor... Sony ericsson salesman... I ask shuh...
    5563hamArd 6 like dat lor.
    5564hamWhy don't you wait 'til at least wednesday to ...
    5565hamHuh y lei...
    5566spamREMINDER FROM O2: To get 2.50 pounds free call...
    5567spamThis is the 2nd time we have tried 2 contact u...
    5568hamWill ü b going to esplanade fr home?
    5569hamPity, * was in mood for that. So...any other s...
    5570hamThe guy did some bitching but I acted like i'd...
    5571hamRofl. Its true to its name

    5572 rows × 2 columns

    提取特征与标签

    data = sms[[1]]
    target = sms[[0]]
    
    • 1
    • 2
    data.shape
    
    • 1
    (5572, 1)
    
    • 1

    将文本变为稀疏矩阵

    对于文本数据,一般情况下会把字符串里面单词转化成浮点数表示稀疏矩阵

    from sklearn.feature_extraction.text import TfidfVectorizer
    # 这个算法模型用于把一堆字符串处理成稀疏矩阵
    
    • 1
    • 2
    tf = TfidfVectorizer()
    # 训练特征数:告诉tf模型有那些单词
    tf.fit(data[1])
    
    • 1
    • 2
    • 3
    TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
            dtype=, encoding='utf-8', input='content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
            stop_words=None, strip_accents=None, sublinear_tf=False,
            token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
            vocabulary=None)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    # 转化:把数据有5572条语句转化成5572*XX的一个稀疏矩阵
    data = tf.transform(data[1])
    data
    # 此时得到了一个5572*8713的稀疏矩阵,说明这5572条语句中有8713种单词
    
    • 1
    • 2
    • 3
    • 4
    <5572x8713 sparse matrix of type ''
    	with 74169 stored elements in Compressed Sparse Row format>
    
    • 1
    • 2

    训练模型

    b_NB.fit(data,target)
    
    • 1
    message = ["Confidence doesn't need any specific reason. If you're alive , you should feel 100 percent confident.",
               "Avis is only NO.2 in rent a cars.SO why go with us?We try harder.",
               "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"
              ]
    
    • 1
    • 2
    • 3
    • 4

    预测

    # 把message转化成稀疏矩阵
    x_test = tf.transform(message)
    
    • 1
    • 2
    b_NB.predict(x_test)
    
    • 1
    array(['ham', 'ham', 'spam'],
          dtype='
    • 1
    • 2
    b_NB.score(data,target)
    
    • 1
    0.98815506101938266
    
    • 1

    使用多项式贝叶斯

    m_NB = MultinomialNB()
    
    • 1
    m_NB.fit(data,target)
    
    • 1
    m_NB.score(data,target)
    
    • 1
    0.97613065326633164
    
    • 1

    使用高斯贝叶斯

    g_NB = GaussianNB()
    
    • 1
    g_NB.fit(data.toarray(),target)
    
    • 1
    g_NB.score(data.toarray(),target)
    
    • 1
    0.94149318018664752
    
    • 1

    如果内容对你有帮助,感谢记得点赞+关注哦!

    欢迎关注我的公众号:阿旭算法与机器学习,共同学习交流。
    更多干货内容持续更新中…

  • 相关阅读:
    某医院基于超融合架构的规划设计和应用实践
    ROC和AUC
    在 Android Studio 中为 C++ 设置 OpenCV(4.6.0)
    Intel汇编-数组遍历
    对Java中dto、dao、service、controller层的分析
    国外大佬的 4 个项目 yyds
    使用“vue init mpvue/mpvue-quickstart“初始化mpvue项目时出现的错误及解决办法
    FPGA设计FIR滤波器低通滤波器,代码及视频
    一比一还原axios源码(五)—— 拦截器
    Hexo主题hexo-theme-yilia-plus配置流程
  • 原文地址:https://blog.csdn.net/qq_42589613/article/details/127648820