【知识图谱】实践篇——基于医疗知识图谱的问答系统实践（Part3）：基于规则的问题分类

前序文章：

背景

基于前面的章节，我们可以认为当前已经有了一个可以提供关于医疗知识的问答知识库。在进行pipline方式问答任务时，接到问题后，通常就是将问题进行分类，以作精细化的处理与回答。这个问题分类通常也被称为意图识别。对于意图识别获问题分类来说，本质上就是对文本进行分类，可以使用传统的机器学习算法以及深度学习算法来处理该问题，但是在缺乏语料标注的情况下，使用规则可能是最好的方式。原项目就是如此。

基于规则的问题分类

在知识图谱数据入库的模块中提供了实体数据导出功能，导出的数据即为一些实体数据，除此之外源代码中还提供了一些否定词deny.txt，我也将该文件放到dict文件夹下。这部分都是基于规则进行分类的特征词。问题的问题主要是接下来的对应类别的问题解析，已经问题搜索做准备。

下面就开始设计问题分类的类。KGQAMedicine\question_classify\rule_question_classify.py

其中为了能够快速匹配到问句中是否包含特征词库，这里引入一个包ahocorasick, 安装：pip install pyahocorasick

问题分类的第一步是判断问句内容中是否有图数据库中的实体内容，如果没有就无法做出相关的查询解答。

基于规则的分类方式主要是使用关键词匹配。其中问题支持以下类别：

问句类型	中文含义	问句举例
disease_symptom	疾病症状	乳腺癌的症状有哪些？
symptom_disease	已知症状找可能疾病	最近老流鼻涕怎么办？
disease_cause	疾病病因	为什么有的人会失眠？
disease_acompany	疾病的并发症	失眠有哪些并发症？
disease_not_food	疾病需要忌口的食物	失眠的人不要吃啥？
disease_do_food	疾病建议吃什么食物	耳鸣了吃点啥？
food_not_disease	什么病最好不要吃某事物	哪些人最好不好吃蜂蜜？
food_do_disease	食物对什么病有好处	鹅肉有什么好处？
disease_drug	啥病要吃啥药	肝病要吃啥药？
drug_disease	药品能治啥病	板蓝根颗粒能治啥病？
disease_check	疾病需要做什么检查	脑膜炎怎么才能查出来？
check_disease	检查能查什么病	全血细胞计数能查出啥来？
disease_prevent	预防措施	怎样才能预防肾虚？
disease_lasttime	治疗周期	感冒要多久才能好？
disease_cureway	治疗方式	高血压要怎么治？
disease_cureprob	治愈概率	白血病能治好吗？
disease_easyget	疾病易感人群	什么人容易得高血压？
disease_desc	疾病描述	糖尿病

具体实现如下：

import os
import ahocorasick
import tqdm
from utils.config import SysConfig


class RuleQuestionClassifier(object):
    disease_feature_words = []
    department_feature_words = []
    check_feature_words = []
    drug_feature_words = []
    food_feature_words = []
    producer_feature_words = []
    symptom_feature_words = []
    region_feature_words = set()
    deny_feature_words = []

    # 问句疑问词
    symptom_qwds = ['症状', '表征', '现象', '症候', '表现']
    cause_qwds = ['原因', '成因', '为什么', '怎么会', '怎样才', '咋样才', '怎样会', '如何会', '为啥', '为何', '如何才会', '怎么才会', '会导致', '会造成']
    acompany_qwds = ['并发症', '并发', '一起发生', '一并发生', '一起出现', '一并出现', '一同发生', '一同出现', '伴随发生', '伴随', '共现']
    food_qwds = ['饮食', '饮用', '吃', '食', '伙食', '膳食', '喝', '菜', '忌口', '补品', '保健品', '食谱', '菜谱', '食用', '食物', '补品']
    drug_qwds = ['药', '药品', '用药', '胶囊', '口服液', '炎片']
    prevent_qwds = ['预防', '防范', '抵制', '抵御', '防止', '躲避', '逃避', '避开', '免得', '逃开', '避开', '避掉', '躲开', '躲掉', '绕开',
                    '怎样才能不', '怎么才能不', '咋样才能不', '咋才能不', '如何才能不', '怎样才不', '怎么才不', '咋样才不', '咋才不',
                    '如何才不', '怎样才可以不', '怎么才可以不', '咋样才可以不', '咋才可以不', '如何可以不', '怎样才可不', '怎么才可不',
                    '咋样才可不', '咋才可不', '如何可不']
    lasttime_qwds = ['周期', '多久', '多长时间', '多少时间', '几天', '几年', '多少天', '多少小时', '几个小时', '多少年']
    cureway_qwds = ['怎么治疗', '如何医治', '怎么医治', '怎么治', '怎么医', '如何治', '医治方式', '疗法', '咋治', '怎么办', '咋办', '咋治']
    cureprob_qwds = ['多大概率能治好', '多大几率能治好', '治好希望大么', '几率', '几成', '比例', '可能性', '能治', '可治', '可以治', '可以医']
    easyget_qwds = ['易感人群', '容易感染', '易发人群', '什么人', '哪些人', '感染', '染上', '得上']
    check_qwds = ['检查', '检查项目', '查出', '检查', '测出', '试出']
    belong_qwds = ['属于什么科', '属于', '什么科', '科室']
    cure_qwds = ['治疗什么', '治啥', '治疗啥', '医治啥', '治愈啥', '主治啥', '主治什么', '有什么用', '有何用', '用处', '用途',
                 '有什么好处', '有什么益处', '有何益处', '用来', '用来做啥', '用来作甚', '需要', '要']

    def __init__(self):
        self.region_actree = None
        self.word_kind_dict = None
        self._init()

    @staticmethod
    def _load_line_file(file_path):
        print(f"load file {file_path}")
        data_list = []
        with open(file_path, 'r', encoding='utf8') as reader:
            for line in reader:
                if not line.strip():
                    continue
                data_list.append(line.strip())
        return data_list

    def _init(self):
        # load data
        file_list = ["disease", "department", "check", "drug", "food", "producer", "symptoms", "deny"]
        for index, file_path in enumerate(file_list):
            data_list = self._load_line_file(os.path.join(SysConfig.DATA_DICT_DIR, file_path + ".txt"))
            setattr(self, file_path + "_feature_words", data_list)
            self.region_feature_words.update(data_list)
        # build actree
        self.region_actree = self._get_actree(list(self.region_feature_words))
        # build word kind dict
        self._build_word_kind_dict()
        print("object init over")

    def _build_word_kind_dict(self):
        word_kind_dict = {}
        for word in tqdm.tqdm(self.region_feature_words, desc='building word kind dict'):
            word_kind_dict.setdefault(word, [])
            if word in self.disease_feature_words:
                word_kind_dict[word].append("disease")
            if word in self.department_feature_words:
                word_kind_dict[word].append("department")
            if word in self.check_feature_words:
                word_kind_dict[word].append("check")
            if word in self.drug_feature_words:
                word_kind_dict[word].append("drug")
            if word in self.food_feature_words:
                word_kind_dict[word].append("food")
            if word in self.symptom_feature_words:
                word_kind_dict[word].append("symptom")
            if word in self.producer_feature_words:
                word_kind_dict[word].append("producer")
        self.word_kind_dict = word_kind_dict

    @staticmethod
    def _get_actree(key_list):
        actree = ahocorasick.Automaton()
        for index, word in enumerate(key_list):
            actree.add_word(word, (index, word))
        actree.make_automaton()
        return actree

    def classify(self, question):
        classify_res = {}
        medical_dict = self.check_query(question)
        if not medical_dict:
            return {}
        classify_res['args'] = medical_dict
        region_word_kinds = []
        for kinds in medical_dict.values():
            region_word_kinds.extend(kinds)
        question_kinds = []
        # disease symptom
        self.sub_classify(self.symptom_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_symptom")
        # symptom disease
        self.sub_classify(self.symptom_qwds, question, 'symptom', region_word_kinds, question_kinds, "symptom_disease")
        # disease cause
        self.sub_classify(self.cause_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_cause")
        # disease accompany
        self.sub_classify(self.acompany_qwds, question, 'disease', region_word_kinds, question_kinds,
                          "disease_accompany")
        # disease food
        if self.check_words(self.food_qwds, question) and 'disease' in region_word_kinds:
            deny_status = self.check_words(self.deny_feature_words, question)
            if deny_status:
                question_kind = "disease_not_food"
            else:
                question_kind = "disease_do_food"
            question_kinds.append(question_kind)
        # food disease
        if self.check_words(self.food_qwds + self.cure_qwds, question) and 'food' in region_word_kinds:
            deny_status = self.check_words(self.deny_feature_words, question)
            if deny_status:
                question_kind = 'food_not_disease'
            else:
                question_kind = 'food_do_disease'
            question_kinds.append(question_kind)
        # disease_drug
        self.sub_classify(self.drug_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_drug")
        # drug disease
        self.sub_classify(self.cure_qwds, question, 'drug', region_word_kinds, question_kinds, "drug_disease")
        # disease check
        self.sub_classify(self.check_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_check")
        # check disease
        self.sub_classify(self.check_qwds + self.cure_qwds, question, 'check', region_word_kinds, question_kinds,
                          "check_disease")
        # disease prevent
        self.sub_classify(self.prevent_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_prevent")
        # disease last time
        self.sub_classify(self.lasttime_qwds, question, 'disease', region_word_kinds, question_kinds,
                          "disease_lasttime")
        # disease cure way
        self.sub_classify(self.cureway_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_cureway")
        # disease cure prob
        self.sub_classify(self.cureprob_qwds, question, 'disease', region_word_kinds, question_kinds,
                          "disease_cureprob")
        # disease easy get
        self.sub_classify(self.easyget_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_easyget")
        # others deal
        if question_kinds == [] and 'disease' in region_word_kinds:
            question_kinds.append('disease_desc')
        if question_kinds == [] and 'symptom' in region_word_kinds:
            question_kinds.append('symptom_disease')

        classify_res['question_kinds'] = question_kinds
        return classify_res

    def sub_classify(self, kind_qkwds, question, key, region_word_kinds, question_kinds, kind_type):
        if self.check_words(kind_qkwds, question) and (key in region_word_kinds):
            question_kinds.append(kind_type)

    @staticmethod
    def check_words(kws, question):
        for kw in kws:
            if kw in question:
                return True
        return False

    def check_query(self, question):
        region_feature_words = []
        for i in self.region_actree.iter(question):
            feature_word = i[1][1]
            region_feature_words.append(feature_word)
        inner_words = []
        for i in range(len(region_feature_words)):
            wi = region_feature_words[i]
            for j in range(i + 1, len(region_feature_words)):
                wj = region_feature_words[j]
                if wi in wj and wi != wj:
                    inner_words.append(wi)
        final_dict = {word: self.word_kind_dict.get(word) for word in
                      filter(lambda x: x not in inner_words, region_feature_words)}
        return final_dict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

效果测试：
在这里插入图片描述
效果也基本上符合预期。当然，也可以使用实体识别识别出目标实体以及使用基于深度学习的模型对问题进行分类提高问题分类泛化能力以及召回效果。使用深度学习的方式去优化，也就意味着需要大量的标注数据。

相关阅读:
PG-事务、并发和锁
 TiDB与MySQL兼容性对比
 Go简单实现协程池
 matlab习题 —— 矩阵的常规运算
 SpringBoot学习小结之Redis
【Unity实战100例】Unity幸运大转盘之概率可控
 现代C++编程实践(八)—关于noexcept修饰符和noexcept操作符
 Vue.js——过渡系统
 js的this及this的指向是什么
 【华为机试真题 JAVA】按索引范围翻转文章片段-100
原文地址：https://blog.csdn.net/meiqi0538/article/details/125951800