基于Python实现的英文文本信息检索系统

目录
1、用户交互的实现： 3
3、查询表的建立 6
3.1 预处理 6
3.2 倒排表的构建 8
3.3 倒排表的压缩 9
3.4 构建轮排索引 10
4、布尔查询 11
5、TF-IDF 值的计算 14
6、通配符查询 14
7、短语查询 15
8、结果数目更改 16
1、用户交互： 17
2、数据获取： 18
3、查询表的建立 19
4、查看指定词的 VB 编码 20
5、布尔查询 20
6、通配符查询 21
7、短语查询 22
【实验名称】：英文文本检索系统
【实验目的】：
开发一款针对英文文本的信息检索系统，可以实现建立索引表、布尔查询、通配符查询、短语查询等功能，本文转载自http://www.biyezuopin.vip/onews.asp?id=16709并通过开发过程达到以下目的：
（1）复习本学期所学信息检索知识；
（2）掌握基本的信息检索方法，了解检索系统的搭建；
（3）具备实现、维护与优化信息检索系统的能力。目前实现的功能有：
（1）自动获取某英文小说网站的文本作为数据源；
（2）建立查询表；
（3）计算指定词的 TF-IDF 值；
（4）进行布尔查询；
（5）进行通配符查询；
（6）进行短语查询。
所有功能都可以通过—hit 参数限制输出的结果数量。
【实验环境】：
(1)处理器：
Intel® Core™ i5-9300H CPU @ 2.40GHz 2.40 GHz
(2)操作系统环境：
Windows 10 家庭中文版 x64 21H1 19043.1052
(3)编程语言：
Python 3.8
(4)IDE 及包管理器：
JetBrains PyCharm 2020.1 x64， anaconda 3 for Windows（conda 4.9.0）
(6)使用的第三方库：
见附件 requirements.txt
【参考文献】：
[1]. [美]克里斯托夫·曼宁，[美]普拉巴卡尔·拉格万，[德]欣里希·舒策著.王斌，李
鹏译.信息检索导论（修订版）.人民邮电出版社,2019.7.
实验内容
【实验方案设计】：
本部分将围绕以下 8 个模块，就原理和实现层面分别予以介绍：用户交互的实现、数据获取、查询表的建立、布尔查询、TF-IDF 值的计算、通配符查询、短语查询、结果数目更改。

# -*- coding: utf-8 -*-
import cmd
import re
import sys
from utils.Inverted_Index_Table import process
from utils.IO_contral import show_summary
from utils.crawl import Spider
import time
import operator
import os


class IRcmder(cmd.Cmd):
    intro = "Welcome to the Information Retrival System.\n".center(100, ' ')
    intro += "\n\nThis is a simple Information Retrival System.\n"
    intro += "You can use some commands to do some work related to the information retrieval.\n"
    intro += "Only supports English.\n"
    intro += "Shell commands are defined internally.  \n\n"
    intro += "Type \'help\' or \'?\' to list all available commands.\n"
    intro += "Type \'help cmd\' to see more details about the command \'cmd\'.\n"
    intro += "Or type \'exit\' to exit this system.\n\n"

    def __init__(self):
        super(IRcmder, self).__init__()
        self.k = 10

    # 爬虫获取数据
    def do_get_data(self, args):
        # get_data ./data --numbers=10 --wait=0.5
        k = 10
        un_mattched = args.split(' ')
        hit_arg_rule = r'(?<=numbers=)[\w]*'
        for item in un_mattched:
            res = re.search(hit_arg_rule, item)
            if res:
                un_mattched.remove(item)
                k_rule = r'(?<=numbers=)[\d]*'
                k = re.search(k_rule, item).group()
                break
        args = ' '.join(un_mattched)
        try:
            tar_k = int(k)
            k = tar_k
        except Exception as e:
            print(e)
        tar_seconds = None
        un_mattched = args.split(' ')
        wait_arg_rule = r'(?<=wait=)[\w]*'
        for item in un_mattched:
            res = re.search(wait_arg_rule, item)
            if res:
                un_mattched.remove(item)
                wait_rule = r'(?<=wait=)[\d]*'
                seconds = re.search(wait_rule, item).group()
                try:
                    tar_seconds = int(seconds)
                except Exception as e:
                    print(e)
                break
        args = ' '.join(un_mattched)
        dirr = args.strip(' ')
        if not os.path.exists(dirr):
            os.mkdir(dirr)
        bug = Spider(limit=k, save_dir=dirr)
        bug.get_novels(wait=tar_seconds)
        bug.get_chapter(wait=tar_seconds)

    def change_k(self, args):
        k = self.k
        un_mattched = args.split(' ')
        hit_arg_rule = r'(?<=hits=)[\w]*'
        for item in un_mattched:
            res = re.search(hit_arg_rule, item)
            if res:
                un_mattched.remove(item)
                k_rule = r'(?<=hits=)[\d]*'
                k = re.search(k_rule, item).group()
                break
        args = ' '.join(un_mattched)
        try:
            tar = int(k)
            self.k = tar
            return args
        except Exception as e:
            print(e)

    # 构建倒排表
    def do_build_table(self, args):
        try:
            self.object = process(args)
            self.object.indextable.index_compression()
            self.object.indextable.create_Permuterm_index()

        except Exception as e:
            print(e)

    # 打印索引
    def do_show_index(self, args):
        try:
            self.object.indextable.show_index(args)
        except Exception as e:
            print(e)

    # 构建轮排索引
    def do_create_Permuterm_index(self, args):
        try:
            self.object.indextable.create_Permuterm_index()
        except Exception as e:
            print(e)

    # 通配符查询
    def do_wildcard_query(self, args):
        args = self.change_k(args)
        print('\nWildcard query.')
        print('\n')
        try:
            t1 = time.time()
            if not self.object.indextable.permuterm_index_table:
                self.object.indextable.create_Permuterm_index()
            ret = self.object.indextable.find_regex_words(args)
            words = ret
            print('searched words: ', ret)
            ret = self.object.indextable.compute_TFIDF(' '.join(ret))
            t2 = time.time()
            print('Total docs: {0} (in {1:.5f} seconds)'.format(len(ret), t2 - t1))
            print('Top-{0} rankings:\n'.format(min(self.k, len(ret))))
            printed = {}
            for word in words:
                printed[word] = []
            cnt = 0
            for index, i in enumerate(ret):
                if cnt >= self.k:
                    break
                hit_info = 'doc ID: {0} '.format(i[0]).ljust(12, ' ')
                hit_info += 'TF-IDF value: {0:.5f} '.format(i[1]).ljust(22, ' ')
                hit_info += 'doc name: {0}'.format(self.object.doc_lists[i[0]])
                print(hit_info)
                for word in words:
                    if i[0] not in printed[word]:
                        if cnt >= self.k:
                            break
                        flag = show_summary(doc_list=self.object.doc_lists, index=i[0], word=word)
                        if flag:
                            # 打印过的词对应的文章不再打印
                            print('\n')
                            printed[word].append(i[0])
                            cnt += 1
            self.k = 10
        except Exception as e:
            print(e)

    # 直接查指定词，通过TF-IDF
    # def do_search_by_TFIDF(self, args):
    #     args = self.change_k(args)
    #     try:
    #         ret = self.object.indextable.compute_TFIDF(args)
    #         print('Top-%d rankings:' % self.k)
    #         for index, i in enumerate(ret):
    #             if index > self.k:
    #                 break
    #             print(i)
    #         # build Reuters
    #         # search_by_TFIDF approximately
    #     except Exception as e:
    #         print(e)

    # 布尔查询，暂不支持显示文章摘要
    def do_boolean_query(self, args):
        args = self.change_k(args)
        print('\nBoolean query.Does not support summary display temporarily.')
        print('\n')
        try:
            t1 = time.time()
            expression = args.replace('(', ' ( ').replace(')', ' ) ').split()
            m = self.object.documents.keys()
            doc_list = sorted(self.object.documents.keys())
            ret = self.object.indextable.boolean_query(expression, doc_list)
            t2 = time.time()
            if len(ret) == 0:
                print('Not found. (in {0:.5f} seconds)'.format(t2 - t1))
                if len(expression) == 1:
                    self.object.indextable.correction(expression[0])
            else:
                if ret != 'Invalid boolean expression.':
                    print('Total docs: {0} (in {1:.5f} seconds)'.format(len(ret), t2 - t1))
                    print('Top-{0} rankings:\n'.format(min(self.k, len(ret))))
                cnt = 0
                for ID in ret:
                    if cnt >= self.k:
                        break
                    result = 'doc ID: {0} '.format(ID).ljust(12, ' ')
                    result += 'doc name: {0}'.format(self.object.doc_lists[ID])
                    print(result)
                    cnt += 1
                print('\n')
        except Exception as e:
            print(e)

    # 短语查询
    def do_phrase_query(self, args):
        args = self.change_k(args)
        print('\nPhrase query.')
        print('\n')
        try:
            t1 = time.time()
            ret = self.object.indextable.phrase_query(args)
            scores = {}
            for i in ret:
                scores[i] = self.object.indextable.compute_TFIDF_with_docID(args, i)
            scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
            t2 = time.time()
            print('Total docs: {0} (in {1:.5f} seconds)'.format(len(scores), t2 - t1))
            print('Top-{0} rankings:\n'.format(min(self.k, len(scores))))
            for index, i in enumerate(scores):
                if index > self.k:
                    break
                hit_info = 'doc ID: {0} '.format(i[0]).ljust(12, ' ')
                hit_info += 'TF-IDF value: {0:.5f} '.format(i[1]).ljust(22, ' ')
                hit_info += 'doc name: {0}'.format(self.object.doc_lists[i[0]])
                print(hit_info)
                flag = show_summary(doc_list=self.object.doc_lists, index=i[0], word=args)
                if flag:
                    print('\n')
                # print(i)
        except Exception as e:
            print(e)

    def do_exit(self, args):
        try:
            print('\nThank you for using. Goodbye.\n')
            sys.exit()
        except Exception as e:
            print(e)

    def emptyline(self):
        pass

    def default(self, line):
        print('Unrecognized command.\nNo such symbol : {0}'.format(line))

    def help_build_table(self):
        cmd_info = 'command: \tbuild_table'.center(80, ' ')
        cmd_info = cmd_info + '\nbuild_table [dir] --language'.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'Before the whole project starts,'
        cmd_info = cmd_info + 'you need to build an inverted index table via this command first.\n'
        cmd_info = cmd_info + 'The program will read the files under the path [dir],'
        cmd_info = cmd_info + ' and then create an inverted index table with using VB encoding to compress.\n'
        cmd_info = cmd_info + 'You should use the parameter \'--language=\' ' \
                              'to explicitly specify the language of the text.\n\n'
        cmd_info = cmd_info + 'For example, assume that your documentation set is stored in the \'./data\' directory.\n'
        cmd_info = cmd_info + '如果文件集为中文文档，请输入: \n\tbuild_table ./data --language=zh\n'
        cmd_info = cmd_info + 'And if the language of the documentation set is English, '
        cmd_info = cmd_info + 'please type: \n\tbuild_table ./data --language=en\n\n'
        cmd_info = cmd_info + 'Later an inverted index table will be built.'
        print(cmd_info)

    def help_show_index(self):
        cmd_info = 'command: \tshow_index'.center(80, ' ')
        cmd_info = cmd_info + '\nshow_index [word]'.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'When builting the index table, I use VB code to compress.'
        cmd_info = cmd_info + '\nAfter building the index table, '
        cmd_info = cmd_info + 'you can view the VB compression code of the word you want via this command.\n\n'
        cmd_info = cmd_info + 'For example, if you want to see the VB compression code of the word \'we\','
        cmd_info = cmd_info + ' please type: \n\nshow_index we\n\n'
        cmd_info = cmd_info + 'Later the screen will show the VB compression code of \'we\'\n'
        print(cmd_info)

    def help_get_data(self):
        cmd_info = 'command: \tget_data'.center(80, ' ')
        cmd_info = cmd_info + '\nget_data [dir] --wait --numbers'.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'If you don\'t have any English text, '
        cmd_info = cmd_info + 'then you may need to get some English text for the next work.'
        cmd_info = cmd_info + '\nYou can use your own data source, ' \
                              'or use this command to get some data automatically.\n\n'
        cmd_info = cmd_info + 'For example, if you want to get some data automatically,'
        cmd_info = cmd_info + ' please type: \n\nget_data ./data --numbers=10 --wait=0.5\n\n'
        cmd_info = cmd_info + 'Later you will get some English novels as a data source\n'
        cmd_info = cmd_info + 'Of course, in order to prevent crawlers from being banned by the website, '
        cmd_info = cmd_info + 'we use the --wait parameter to wait for a period of time after each link is obtained '
        cmd_info = cmd_info + 'to avoid putting too much pressure on the server.\n'
        print(cmd_info)

    def help_boolean_query(self):
        cmd_info = 'command: \tboolean_query'.center(80, ' ')
        cmd_info = cmd_info + '\nboolean_query [options] --hit '.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'After creating the index table, you can use this command for boolean query.\n'
        cmd_info = cmd_info + 'Available operations are AND, OR and NOT, '
        cmd_info = cmd_info + 'and you can use () to combine them arbitrarily.\n'
        cmd_info = cmd_info + 'For example, if you want to find articles '
        cmd_info = cmd_info + 'that contain \'we\' and \'are\' but not \'you\','
        cmd_info = cmd_info + ' please type: \n\nboolean_query we AND are NOT you --hits=7\n\n'
        cmd_info = cmd_info + 'Later the screen will show some articles which are found.\n'
        cmd_info = cmd_info + 'Only supports English.\n'
        print(cmd_info)

    def help_phrase_query(self):
        cmd_info = 'command: \tphrase_query'.center(80, ' ')
        cmd_info = cmd_info + '\nphrase_query [phrase] --hit '.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'After creating the index table, you can use this command for phrase query.\n'
        cmd_info = cmd_info + 'For example, If you want to find an article that contains \'how is the weather today\','
        cmd_info = cmd_info + ' please type: \n\nphrase_query how is the weather today --hits=7\n\n'
        cmd_info = cmd_info + 'Later the screen will show some articles ' \
                              'with some summary information which are found.\n'
        cmd_info = cmd_info + 'Only supports English.\n'
        print(cmd_info)

    def help_wildcard_query(self):
        cmd_info = 'command: \twildcard_query'.center(80, ' ')
        cmd_info = cmd_info + '\nwildcard_query [target] --hit '.center(30, " ") + '\n\n'
        cmd_info = cmd_info + 'After creating the index table, you can use this command for wildcard query.\n'
        cmd_info = cmd_info + 'For example, If you want to find some articles that contain words starting with\'wh\' '
        cmd_info = cmd_info + '(like \'when\' or \'where\' or \'what\' or some words else),'
        cmd_info = cmd_info + ' please type: \n\nwildcard_query wh* --hits=7\n\n'
        cmd_info = cmd_info + 'Later the screen will show some articles ' \
                              'with some summary information which are found.\n'
        cmd_info = cmd_info + 'Only supports English.\n'
        print(cmd_info)


if __name__ == '__main__':
    info = "\n\nThis is a simple Information Retrival System.\nCopyright 2021 " \
           "@CandyMonster37: https://github.com/CandyMonster37\n" + \
           "A course final project for Information Retrival, and you can find the latest vertion of the codes " \
           "here: \n   https://github.com/CandyMonster37/InformationRetrival.git \n\n\n"

    print(info)

    IRcmder.prompt = 'IR > '
    IRcmder().cmdloop()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

在这里插入图片描述

相关阅读:
矩阵求导数
 桌面应用小程序，一种创新的跨端开发方案
 12.权重衰退+QA
毕业生可以做出新冠患者统计系统，使用的SSM框架yyds
SQL AND, OR and NOT（与，或不是运算符）
R语言使用substr函数、paste函数或str_c函数提取并生成输出文件名
 HTML常用标签的使用
 常用的sql函数（语法）
说说你对事件委托的理解？
React如何优化减少组件间的重新Render
原文地址：https://blog.csdn.net/sheziqiong/article/details/126794550