• 数据挖掘题目:根据规则模板和信息表找出R中的所有强关联规则,基于信息增益、利用判定树进行归纳分类,计算信息熵的代码


    一、(30分)设最小支持度阈值为0.2500, 最小置信度为0.6500。对于下面的规则模板和信息表找出R中的所有强关联规则:

    S∈R,P(S,x )∧ Q(S,y )==> Gpa(S,w ) [ s, c ]
    其中,P,Q ∈{ Major, Status ,Age }.

    MajorStatusAgeGpaCount
    ArtsGraduateOldGood50
    ArtsGraduateOldExcellent150
    ArtsUndergraduateYoungGood150
    Appl_scienceUndergraduateYoungExcellent
    ScienceUndergraduateYoungGood100

    解答:
    样本总数为500,最小支持数为500*0.25 = 125。
    在Gpa取不同值的情形下,分别讨论。
    (1)Gpa = Good,

    MajorStatusAgeCount
    ArtsGraduateOld50
    ArtsUndergraduateYoung150
    ScienceUndergraduateYoung100

    频繁1项集L1 = {Major= Arts:200; Status=Undergraduate: 250; Age = Young:250} -----10分
    频繁2项集的待选集C2={Major= Arts,Status= Undergraduate:150; Major= Arts,Age=Young:150;Status=Undergraduate, Age=Young:250 }
    频繁2项集L2=C2

    (2) Gpa = Excellent

    MajorStatusAgeCount
    ArtsGraduateOld150
    Appl_scienceUndergraduateYoung50

    频繁1项集L1 = {Major= Arts:150; Status=Graduate: 150; Age = Old:250}
    频繁2项集的待选集C2={Major= Arts,Status= Graduate:150; Major= Arts,Age=Old:150;Status=Graduate, Age=Old:150 }
    频繁2项集L2=C2

    考察置信度:
    Major(S,Arts)^Status(S,Undergraduate)=>Gpa(S,Good) [s=150/500=0.3000, c=150/150=1.0000]
    Major(S, Arts)^Age(S,Young)=>Gpa(S, Good)[s=150/500=0.3000, c=150/150=1.0000]
    Status(S,Undergraduate)^Age(S,Young)=>Gpa(S,Good) [s=250/500=0.5000, c=250/300=0.8333]
    Major(S, Arts)^Status(S,Graduate)=>Gpa(S, Excellent)[s=150/500=0.3000, c=150/200=0.7500]
    Major(S, Arts)^Age(S,Old)=>Gpa(S, Excellent)[s=150/500=0.3000, c=150/200=0.7500]
    Status(S,Graduate)^Age(S,Old)=>Gpa(S,Excellent) [s=150/500=0.3000, c=150/200=0.7500]

    因此,所有强关联规则是:
    Major(S,Arts)^Status(S,Undergraduate)=>Gpa(S,Good) [s=150/500=0.3000, c=150/150=1.0000]
    Major(S, Arts)^Age(S,Young)=>Gpa(S, Good)[s=150/500=0.3000, c=150/150=1.0000]
    Status(S,Undergraduate)^Age(S,Young)=>Gpa(S,Good) [s=250/500=0.5000, c=250/300=0.8333]
    Major(S, Arts)^Status(S,Graduate)=>Gpa(S, Excellent)[s=150/500=0.3000, c=150/200=0.7500]
    Major(S, Arts)^Age(S,Old)=>Gpa(S, Excellent)[s=150/500=0.3000, c=150/200=0.7500]
    Status(S,Graduate)^Age(S,Old)=>Gpa(S,Excellent) [s=150/500=0.3000, c=150/200=0.7500]

    二、(30分)设类标号属性 Gpa 有两个不同的值( 即{ Good, Excellent } ), 基于信息增益,利用判定树进行归纳分类。

    解答:
    定义P: Gpa = Good
    N: Gpa = Excellent
    任何分割进行前,样本集的熵为:

    pnI(p,n)
    3002000.97095

    I(p,n)=-0.6log2(0.6) –0.4log2(0.4)
    = 0.97095

    考虑按属性Major分割后的样本的熵

    MajorpiniI(pi,ni)
    Arts2001500.98523
    Appl_science0500
    Science10000

    E(Major) = 350/500*0.98523 = 0.68966

    I(p,n)=-(4/7)log2(4/7) –(3/7)log2(3/7) =0.98523

    考虑按属性Status分割后的样本的熵

    StatuspiniI(pi,ni)
    Graduate501500.81128
    Undergraduate250500.65002

    E(Status) = 200/5000.81128+300/5000.65002 = 0.71452

    考虑按属性Age分割后的样本的熵

    AgepiniI(pi,ni)
    Old501500.81128
    Young250500.65002

    E(Age) = E(Status) = 0.71452

    各属性的信息增益如下:
    Gain(Major) =0.97095-0.68966 = 0.28129
    Gain(Status) =Gain(Age) =0.97095-0.71452 = 0.25643

    比较后,由于Gain(Major)的值最大,按照最大信息增益原则,按照属性Major的不同取值进行第一次分割.
    分割后,按照Major的不同取值,得到下面的3个表:

    (1)Major = Arts

    StatusAgeGpaCount
    GraduateOldGood50
    GraduateOldExcellent150
    UndergraduateYoungGood150

    考虑按属性Status分割后的样本的熵

    StatuspiniI(pi,ni)
    Graduate501500.81128
    Undergraduate15000

    E(Status) = 200/350*0.81128= 0.46359

    考虑按属性Age分割后的样本的熵

    StatuspiniI(pi,ni)
    Old501500.81128
    Young15000

    E(Age) = E(Status)= 0.46359

    由于E(Age) = E(Status),可按照属性Status的不同取值进行第二次分割。分割后,按照Status的不同取值,得到下面的2个表:

    (1.1) Status =Graduate

    AgeGpaCount
    OldGood50
    OldExcellent150

    由于表中属性Age的取值没有变化,停止分割。按照多数投票原则,该分支可被判定为Gpa=Excellent。
    (1.2)Status = Undergraduate

    StatusAgeGpaCount
    UndergraduateYoungGood150

    在这种情形下,所有样本的Gpa属性值都相同.停止分割.
    (2)Major= Appl_Science

    StatusAgeGpaCount
    UndergraduateYoungExcellent50

    在这种情形下,所有样本的Gpa属性值都相同.停止分割.
    (3)Major=Science

    StatusAgeGpaCount
    UndergraduateYoungGood100

    在这种情形下,所有样本的Gpa属性值都相同.停止分割.
    综合以上分析,有以下的判定树:
    Major--------- Arts ----------Status-------Graduate ------Excellent
    \ ______Undergraduate______Good
    _______Appl_Science_______________________Excellent

    __________Science______________________Good

    小 tricks

    计算信息熵的代码

    import math
    
    def entropy(probabilities):
        total = sum(probabilities)
        probabilities= [p / total for p in probabilities]
        entropy = 0
        for p in probabilities:
            if p > 0:
                entropy -= p * math.log2(p)
        return entropy
    
    probabilities = [100,100,150]#计算100 100 150的信息熵
    
    result = entropy(probabilities)
    print("信息熵:", result)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
  • 相关阅读:
    语音芯片的“等级”之分
    随机森林算法深入浅出
    IO学习系列之阻塞IO
    ps插件Coolorus for Mac中文激活版
    AWS S3加密
    聊聊druid连接池的监控
    Ubuntu升级自带的Python3版本
    白炽灯对婴儿眼睛好吗?分享适合婴儿、光线柔和的护眼台灯
    升讯威在线客服系统的并发高性能数据处理技术:高性能TCP服务器技术
    浏览器页面刷新,history增加,需要多次调用history.back()才能后退的解决方法
  • 原文地址:https://blog.csdn.net/m0_51738372/article/details/134209637