• Elasticsearch实战(二十)---ES相关度分数评分算法分析及相关度分数优化


    Elasticsearch实战—ES相关度分数评分算法分析

    ES相关度评分算法靠三个部分来依次实现,没有先后顺序,是一个逐层推进的逻辑

    1. Boolean模型 根据过滤条件true,false来过滤doc
    2. TFIDF模型
    3. VSM空间向量模型

    1.ES相关度分数评分算法

    1.1 Booolean

    boolean根据搜索条件过滤doc的国车过是不做相关度分数计算的,只是为了标记出来哪些doc是符合搜索条件要求的

    1.2 TFIDF模型

    了解文档分词处理的都听过TFIDF模型,TF词频,IDF逆文本频率,说白了就是单词term出现了再所有文档中出现了多少次,出现越多,说明这个单词越没有标识度,越不重要,和文档的相关度分数越低
    比如下面多个文章ABC中出现多次吃饭,一个文章C中出现一次原子弹,那肯定原子弹肯定对文章C很重要 很有标识度,原子弹这个单词对C来说 权重很高,这就是TFIDF模型

    文章DOC A :{ 吃饭, 喝酒, 喝茶}
    文章DOC B: {吃饭,原子弹}
    文章DOC C:{吃饭, 喝酒}
    
    • 1
    • 2
    • 3

    在这里插入图片描述

    1.3 VSM空间向量模型

    VSM这个就更为专业

    • 我们从document出发。document由若干个term组成,通过TF/IDF算法计算后,我们可以得知每一个term在document中的权重,而不同的term又会根据自己的权重影响当前document的相关度得分
    • 我们将当前document中出现的所有term的权重组合起来,形成一条向量 Document Vector, Document Vector可能会有多条
    • 有了向量可以根据 余弦函数cos计算两个向量的夹角,夹角越大,说明偏离越远,两个向量越不相似,进而得出文章不相似

    Document = {term1, term2, …… ,termN}
    Document Vector = {weight1, weight2, …… ,weightN}
    在这里插入图片描述

    2.ES相关度分数优化

    2.1 准备数据

    先构造 index:testquery, 然后构造mapping结构, 插入测试数据

    #构建 库index testquer
    put /testquery
    
    #构建mapping结构
    put /testquery/_mapping
    {
        "properties" : {
          "address" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              },
              "copy_to" : [
                "info"
              ]
            },
          "age" : {
              "type" : "long"
            },
          "area" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
          "city" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
          "content" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
          "deptName" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              },
              "fielddata" : true
            },
          "empId" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
          "info" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
          "mobile" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              },
              "copy_to" : [
                "info"
              ]
            },
          "name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              },
              "copy_to" : [
                "info"
              ]
            },
          "provice" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              },
              "fielddata" : true
            },
          "salary" : {
              "type" : "long"
            },
          "sex" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
     		 "addtime" : {
              "type":"date",
              //时间格式 epoch_millis表示毫秒
              "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130

    插入测试数据

    put /testquery/_bulk
    {"index":{"_id": 1},"addtime":"1658041203000"}
    {"empId" : "111","name" : "员工1","age" : 20,"sex" : "男","mobile" : "19000001111","salary":1333,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"光谷大道","address":"湖北省武汉市洪山区光谷大厦","content" : "i like to write best elasticsearch article", "addtime":"1658140003000"}
    {"index":{"_id": 2}}
    {"empId" : "222","name" : "员工2","age" : 25,"sex" : "男","mobile" : "19000002222","salary":15963,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i think java is the best programming language"}
    {"index":{"_id": 3},"addtime":"1658040045600"}
    { "empId" : "333","name" : "员工3","age" : 30,"sex" : "男","mobile" : "19000003333","salary":20000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"经济技术开发区","address" : "湖北省武汉市经济开发区","content" : "i am only an elasticsearch beginner"}
    {"index":{"_id": 4},"addtime":"1658040012000"}
    {"empId" : "444","name" : "员工4","age" : 20,"sex" : "女","mobile" : "19000004444","salary":5600,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"沌口开发区","address" : "湖北省武汉市沌口开发区","content" : "elasticsearch and hadoop are all very good solution, i am a beginner"}
    {"index":{"_id": 5},"addtime":"1658040593000"}
    { "empId" : "555","name" : "员工5","age" : 20,"sex" : "男","mobile" : "19000005555","salary":9665,"deptName" : "测试部","provice" : "湖北省","city":"高新开发区","area":"武汉","address" : "湖北省武汉市东湖隧道","content" : "spark is best big data solution based on scala ,an programming language similar to java"}
    {"index":{"_id": 6},"addtime":"1658043403000"}
    {"empId" : "666","name" : "员工6","age" : 30,"sex" : "女","mobile" : "19000006666","salary":30000,"deptName" : "技术部","provice" : "武汉市","city":"湖北省","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i like java developer","addtime":"1658041003000"}
    {"index":{"_id": 7}}
    {"empId" : "777","name" : "员工7","age" : 60,"sex" : "女","mobile" : "19000007777","salary":52130,"deptName" : "测试部","provice" : "湖北省","city":"黄冈市","area":"边城区","address" : "湖北省黄冈市边城区","content" : "i like elasticsearch developer","addtime":"1658040008000"}
    {"index":{"_id": 8}}
    {"empId" : "888","name" : "员工8","age" : 19,"sex" : "女","mobile" : "19000008888","salary":60000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"汉阳区","address" : "湖北省武汉市江汉大学","content" : "i like spark language","addtime":"1656040003000"}
    {"index":{"_id": 9}}
    {"empId" : "999","name" : "员工9","age" : 40,"sex" : "男","mobile" : "19000009999","salary":23000,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市郑州大学","content" : "i like java developer","addtime":"1608040003000"}
    {"index":{"_id": 10}}
    {"empId" : "101010","name" : "张湖北","age" : 35,"sex" : "男","mobile" : "19000001010","salary":18000,"deptName" : "测试部","provice" : "湖北省","city":"武汉","area":"高新开发区","address" : "湖北省武汉市东湖高新","content" : "i like java developer i also like  elasticsearch","addtime":"1654040003000"}
    {"index":{"_id": 11}}
    {"empId" : "111111","name" : "王河南","age" : 61,"sex" : "男","mobile" : "19000001011","salary":10000,"deptName" : "销售部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am not like  java ","addtime":"1658740003000"}
    {"index":{"_id": 12}}
    {"empId" : "121212","name" : "张大学","age" : 26,"sex" : "女","mobile" : "19000001012","salary":1321,"deptName" : "测试部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am java developer  thing java is good","addtime":"165704003000"}
    {"index":{"_id": 13}}
    {"empId" : "131313","name" : "李江汉","age" : 36,"sex" : "男","mobile" : "19000001013","salary":1125,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市二七区","content" : "i like java and java is very best i like it do you like java ","addtime":"1658140003000"}
    {"index":{"_id": 14}}
    {"empId" : "141414","name" : "王技术","age" : 45,"sex" : "女","mobile" : "19000001014","salary":6222,"deptName" : "测试部",,"provice" : "河南省","city":"郑州市","area":"金水区","address" : "河南省郑州市金水区","content" : "i like c++","addtime":"1656040003000"}
    {"index":{"_id": 15}}
    {"empId" : "151515","name" : "张测试","age" : 18,"sex" : "男","mobile" : "19000001015","salary":20000,"deptName" : "技术部",,"provice" : "河南省","city":"郑州市","area":"高新开发区","address" : "河南省郑州高新开发区","content" : "i think spark is good","addtime":"1658040003000"}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    2.2 Boost 增加搜索条件权重

    设置boost查询条件权重可以实现影响搜索结果评分的目的,比如 查询条件后面加上boost,实现当前条件关联度倍增的效果

    #不加boost条件查询
    get /testquery/_search
    {
      "query":{
       "bool": {
         "should": [
           {
             "match": {
               "provice.keyword": "湖北省"
             }
           },
           {
             "match": {
               "address": "开发区"
             }
           }
         ]
       }
      }
    }
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    不加条件查询结果
    在这里插入图片描述

    现在给 address地址 加权重boost,认为address包含开发的排名更优先
    然后员工4 中address包含开发,分数直接飙升到 12.44
    员工1中address并没有开发 两个字,所以address 的 boost对员工1的分数没有影响依旧是 0.344分
    在这里插入图片描述

    2.3 Negative boost 削弱搜索条件权重

    设置negative boost 削弱查询条件的权重 可以实现影响搜索结果评分的目的,削弱查询条件对分数的影响

    #设置 negative_boost 权重为 1 看下结果
    get /testquery/_search
    {
      "query":{
       "boosting": {
         "positive": {
           "match": {
             "provice.keyword": "湖北省"
           }
         },
         "negative": {
           "match": {
             "deptName.keyword": "销售部"
           }
         },
         "negative_boost": 1
       }
      }
    }
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    设置 negative_boost 权重为 1 看下结果
    员工1:0.344 湖北省技术部
    员工2:0.344 湖北省销售部
    在这里插入图片描述

    现在 negative_boost修改为 0.2 看下结果
    员工1:0.344 湖北技术部 不受影响,因为他的部门deptname不是销售部,所以削弱销售部的权重不影响他
    员工2:0.068 湖北销售部 受影响,分数明显降低,相关度降低
    在这里插入图片描述

    2.4 Function score 自定义相关分数算法

    场景:
    现在我想把 相关度分数和 文章的浏览量关联起来, 浏览量越大,分数越高,怎么实现

    分数算法有几个关键点

    1. query内部使用 function_score 表明我要使用自定义相关度分数
    2. function_score内部 使用 field_value_factor 表明参与到分数计算的字段 设置,及按照什么来计算等
    3. function_score 的 field表示 对哪个字段进行积分
    4. modifier表示 对哪个字段进行积分 比如 ln, log1p, log2p log 等等算式
    5. factor 表示 对 你要计算的字段 field 的值 与 factor 相乘 处理
    6. boost_mode表示 分数 旧分数和新分数 如何处理 累加/减/乘/除/max/min 等等
    7. max_boost表示 限制计算出来的分数不要超过max_boost指定的值 , 不是最终得分不超过多少
      我们下一篇文章 单独讲解一下 如何实现这种场景及 自定义相关度分数算法如何实现, 每个参数都是如何使用的详解

    至此 我们已经学习了 ES相关度分数评分算法分析, 也了解了 ES 实现相关度分析底层原理 使用 boolean模型,TFIDF,VSM空间向量模型计算相关度,也会使用 boost, negativeboost 来增加,削弱 查询条件权重 等等

    下一篇我们着重讲解下 如何实现自定义算法 function score ES相关度分数评分优化及FunctionScore 自定义相关度分数算法

  • 相关阅读:
    动态规划三:常见状态与常见递推关系式
    2023年【安全生产监管人员】考试报名及安全生产监管人员复审考试
    猿创征文 |【SpringBoot】SSM“加速器”SpringBoot初体验
    第六章:函数
    闲置电脑挂机赚钱-Traffmonetizer,支持windows,linux,Android,MacOS多平台
    【软件与系统安全】笔记与期末复习
    微博头条文章开放接口报错 auth by Null spi
    docker_python-django_uwsgi_nginx_浏览器_网络访问映过程
    【Vue面试题十七】、你知道vue中key的原理吗?说说你对它的理解
    Node.js 商标转让给 OpenJS 基金会
  • 原文地址:https://blog.csdn.net/u010134642/article/details/126319366