• 基于Boost的搜索引擎



    1、项目的的相关背景

    搜索引擎是指根据一定的策略、运用特定的计算机程序从互联网上采集信息,在对信息进行组织和处理后,为用户提供检索服务,将检索的相关信息展示给用户的系统。

    国内有许多做搜索的公司:百度、搜狗、360搜索等等。
    这些大型公司做的搜索引擎是全网搜索,背后也是有巨大的团队支撑,单单靠几个人是完全不可能的完成的。
    所以,我这次做的是一个小型的搜索引擎。

    1.1 什么样的搜索引擎

    学C++朋友都应该知道Boost库吧,需要用到现成的东西就需要去官网查找,但Boost官网中并没有站内搜索,所以这次的项目就是基于Boost官网的站内搜索引擎。

    在实现项目之前,需要了解搜索成功后,需要将哪些内容呈现给用户,可以对标一下其它的搜索引擎。

    百度:
    在这里插入图片描述

    搜狗:
    在这里插入图片描述

    360搜索:
    在这里插入图片描述


    根据上面的搜索结果可以发现,都包含了标题,正文,url链接。所以我们的搜索结果也以这样的方式呈现。

    项目效果

    后台数据:

    在这里插入图片描述

    前端页面:

    在这里插入图片描述

    2、搜索引擎的相关宏观原理图

    在这里插入图片描述

    3、搜索引擎技术栈和项目环境

    • 技术栈: C/C++,C++11,STL,准标准库Boost(用于对文件的处理),Jsoncpp(进行序列化和反序列化),cppjieba(对html中的内容进行分词处理),cpp-httplib(c++封装的http开源库),html5,css,js,jQuery,Ajax(进行前后端交互)
    • 项目环境: Centos 7云服务器,vim/gcc(g++)/Makefile,vscode

    4、 正排索引 vs 倒排索引——搜索引擎具体原理

    正排索引: 根据文档ID映射文档内容,只要文档ID正确,就一定能找到对应的文档

    文档ID文档内容
    1重庆今年特别热,不愧为中国的火炉呀
    2重庆是中国的四大火炉之首

    对目标文档进行分词,目的是为了方便建立倒排索引和查找

    • 重庆今年特别热:重庆/今年/特别/热/特别热/不愧/中国/火炉
    • 重庆是中国的四大火炉之首:重庆/中国/四大火炉/火炉/之首

    像"的",“呀"之类的词,几乎在所有的文章都有大量出现,不应该作为搜索的关键词,这类词称之为**“暂停词”**,类似的还有"了”、"吗"等等,对此,不给予考虑。

    倒排索引: 根据文档内容,分词,整理不重复的各个关键字,对应联系到文档ID的方案

    关键词(具有唯一性)对应的文档ID
    重庆文档1、文档2
    今年文档1
    特别文档1
    文档1
    中国文档1、文档2
    特别热文档1
    火炉文档1、文档2
    四大火炉文档2
    不愧文档1
    四大火炉文档2
    之首文档2

    模拟一次查找的过程: 输入关键词火炉,通过倒排索引提取对应的文档ID,在正排索引中,通过文档ID获取文档内容,对文档内容中的title、conent(desc)、url文档结果进行摘要,最后构成响应结果

    此时有两个细节:
    细节一:需要用到权值(weight)对搜索到的文档结果进行排序,而权值根据关键词在标题中是否出现以及在内容中出现的次数决定
    细节二:搜索的内容可能不是一个关键词,而是一条语句,语句经过分词处理,被处理成多个关键词,多个关键词可能对应着同一个文档,此时需要进行去重处理

    5、 编写数据去标签与数据清洗的模块Parser

    5.1 获取原始数据

    在这里插入图片描述
    到Boost官网下载boost_1_79_0.tar.gz,也可以下载其它版本

    链接:https://www.boost.org/users/history/version_1_79_0.html

    下载好后,通过rz -E命令将下载好的压缩文件上传到服务器上
    在这里插入图片描述

    通过命名tar xzf 进行解压,例如:tar xzf boost_1_78_0.tar.gz
    在这里插入图片描述

    然后进入boost_1_79_0这个文件夹,就可以看到文件的内容
    在这里插入图片描述

    因为Boost的大部分网页内容都在doc/html下,这也是搜索的数据源,将其拷贝到data/input
    在这里插入图片描述

    mkdir -p data/input 创建目录
    cp -rf boost_1_78_0/doc/html/* data/input/ 拷贝doc/html下的所有html文件

    5.2 为什么要进行数据清洗

    DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Chapter?41.?Boost.Typeoftitle>
    <link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
    <meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
    <link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook Documentation Subset">
    <link rel="up" href="libraries.html" title="Part?I.?The Boost C++ Libraries (BoostBook Subset)">
    <link rel="prev" href="boost_typeindex/acknowledgements.html" title="Acknowledgements">
    <link rel="next" href="typeof/tuto.html" title="Tutorial">
    head>
    <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
    <table cellpadding="2" width="100%"><tr>
    <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../boost.png">td>
    <td align="center"><a href="../../index.html">Homea>td>
    <td align="center"><a href="../../libs/libraries.htm">Librariesa>td>
    <td align="center"><a href="http://www.boost.org/users/people.html">Peoplea>td>
    <td align="center"><a href="http://www.boost.org/users/faq.html">FAQa>td>
    <td align="center"><a href="../../more/index.htm">Morea>td>
    tr>table>
    <hr>
    <div class="spirit-nav"><a accesskey="p" href="boost_typeindex/acknowledgements.html"><img src="../../doc/src/images/prev.png" alt="Prev">a>
    <a accesskey="u" href="libraries.html"><img src="../../doc/src/images/up.png" alt="Up">a>
    <a accesskey="h" href="index.html"><img src="../../doc/src/images/home.png" alt="Home">a>
    <a accesskey="n" href="typeof/tuto.html"><img src="../../doc/src/images/next.png" alt="Next">a>
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    上面是一个html页面的部分内容,除了需要的标题,内容之外,还有大量的双标签和单标签,这些是标签之中的内容跟搜索毫无关系,需要去除,然后保留其他有价值的内容

    数据清洗后的每个文档之间用’\3’作为分割符,然后将其放到data/raw_html/raw.txt中

    为什么要用’\3’作为文档之间的分隔符?
    因为它是不可显示的控制字符,不会显示,不会污染我们形成的新的文档。如果原意,也可以采用其他不可显示的字符。

    5.3 编写parser.cpp

    5.3.1 整体框架

    #include 
    #include 
    #include 
    #include 
    #include "util.hpp"
    
    //是一个目录,下面放的是所有的html网页
    const std::string src_path = "data/input";
    //用来存放解析的内容
    const std::string output = "data/raw_html/raw.txt";
    
    typedef struct DocInfo
    {
    	std::string title;	 //文档的标题
    	std::string content; //文档的内容
    	std::string url;	 //文档在官网中的url
    } DocInfo_t;
    
    // const&:输入
    //*:输出
    //&:输入输出
    
    bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);
    bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results);
    bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);
    
    int main()
    {
    	std::vector<std::string> files_list;
    	//第一步,递归式的把每个html文件名+路径,保存到file_list中,方便后期进行文件读取
    	//例如:data/input/.......
    	if (!EnumFile(src_path, &files_list))
    	{
    		std::cerr << "enum file name error" << std::endl;
    		return 1;
    	}
    
    	//第二步,读取file_list中的每个文件内容(文件名+路径确定唯一文件),并进行解析
    	std::vector<DocInfo_t> results;
    	if (!ParseHtml(files_list, &results))
    	{
    		std::cerr << "parse html error" << std::endl;
    		return 2;
    	}
    
    	//第三步,把解析完毕的各个文件内容,写入到output,以\3作为每个文件的分隔符
    	if (!SaveHtml(results, output))
    	{
    		std::cerr << "save html error" << std::endl;
    		return 3;
    	}
    	return 0;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53

    第一步: 递归式的把每个html文件名+路径,保存到file_list中,方便后期进行文件读取
    第二步: 读取file_list中的每个文件内容(文件名+路径确定唯一文件),并进行解析
    第三步: 把解析完毕的各个文件内容,写入到output,以\3作为每个文件的分隔符

    5.3.2 保存html的文件名

    因为这里有文件操作,所以需要用到Boost开发库中的一些文件操作函数。

    Boost库的安装命令: sudo yum install -y boost-devel
    Boost库中的文件操作具体细节可以在官网查看

    安装好后,就能进行相关代码的编写,具体如下:

    bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
    {
    	// boost库下的文件操作函数名全在filesystem的命名空间下
    	namespace fs = boost::filesystem;
    	fs::path root_path(src_path);
    	//判断路径是否存在,不存在就没必要继续了
    	if (!fs::exists(root_path))
    	{
    		std::cerr << src_path << "not exists" << std::endl;
    		return false;
    	}
    
    	//定义一个空的迭代器,用来进行判断递归是否结束
    	fs::recursive_directory_iterator end;
    	for (fs::recursive_directory_iterator iter(root_path); iter != end; ++iter)
    	{
    		//判断当前路径下的当前文件是否是普通文件,html都是普通文件
    		// if (!fs::is_regular_file(*iter))
    		// {
    		// 	continue;
    		// }
    		// //判断当前文件l路径名的后缀是否是复合要求,即是否是html文件
    		// // path()获取当前路径,extension()获取后缀
    		// if (iter->path().extension() != ".html")
    		// {
    		// 	continue;
    		// }
    		//上述两步可以合并为一步
    		if (!fs::is_regular_file(*iter) || iter->path().extension() != ".html")
    		{
    			continue;
    		}
    
    		//当前的路径一定是一个合法的,以html结束的普通网页文件
    		// path()是一个路径对象,需要对象里面的路径转化为字符串
    		//将所有带路径的thml保存到files_list中,方便后续进行文本分析
    		files_list->push_back(iter->path().string());
    	}
    
    	return true;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41

    5.3.3 解析html文件

    解析html文件一共分为四步
    第一步:通过保存下来的文件名(路径),读取文件的内容
    第二步:提取文件标题(title)
    第三步:提取文件正文(content)
    第四步:构建该网页的url链接

    bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
    {
    	for (const std::string &file : files_list)
    	{		
    		// 1.读取文件, Read()
    		std::string result;
    		if (!ns_util::FileUtil::ReadFile(file, &result))
    		{
    			continue;
    		}
    
    		// 2.解析指定文件。提取title
    		DocInfo doc;
    		if (!ParseTitle(result, &doc.title))
    		{
    			continue;
    		}
    
    		// 3.解析指定文件,提取content
    		if (!ParseContent(result, &doc.content))
    		{
    			continue;
    		}
    
    		// 4.解析指定的文件路径,构建url
    		if (!ParseUrl(file, &doc.url))
    		{
    			continue;
    		}
    
    		// done,一定是完成了解析任务,当前文档的相关结果都保存在了doc里面
    		results->push_back(std::move(doc)); //移动拷贝,提高效率
    	}
    	return true;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35

    如何读取html文件内容?

    打开对应的文件,一行一行的读取数据,读取完关闭文件。如果中间出错,则返回false。不过这个函数放在了util.hpp下的namespace ns_util中

    class FileUtil
    {
    public:
        static bool ReadFile(const std::string &file_path, std::string *out)
        {
            std::ifstream in(file_path, std::ios::in);
            if (!in.is_open())
            {
                std::cerr << "open file" << file_path << "error" << std::endl;
                return false;
            }
            std::string line;
            while (std::getline(in, line)) // getline的返回值是一个&,对其进行判断时,重载了强制将类型转换
            {
                *out += line;
            }
    
            in.close();
            return true;
        }
    };
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    如何提取标题?

    先回顾之前的html文件,在第一次出现< title> 和< /title>之间的内容一定是标题。
    首先,需要找到在文件内容中的位置(begin),其次再找到在文件内容中的位置(end),最后begin+的长度就是标题的起始位置,end-起始位置就是标题的长度</strong></p> <p><img src="https://1000bd.com/contentImg/2022/08/19/103622365.png" alt="在这里插入图片描述"></p> <pre data-index="5" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseTitle</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>file<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>title<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>size_t begin <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"<title>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>begin <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>size_t end <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">""); if (end == std::string::npos) { return false; } begin += std::string(""</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>begin <span class="token operator">></span> end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//begin就是标题开始的位置,end-begin就是标题的长度</span> <span class="token operator">*</span>title <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span> end <span class="token operator">-</span> begin<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li></ul></pre> <p><strong>如何提取正文?</strong></p> <p>html文件中含有大量的标签,不仅有单标签,还有双标签。标签都是使用一堆尖括号<>组成,提取正文的本质也就是将这些标签去除。</p> <p>使用状态机去标签是一个很好的选择。<br> 状态机的原理:<br> <img src="https://1000bd.com/contentImg/2022/08/19/103622567.png" alt="在这里插入图片描述"></p> <pre data-index="6" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>file<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>content<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//去标签,基于一个简单的状态机</span> <span class="token keyword">enum</span> <span class="token class-name">status</span> <span class="token punctuation">{<!-- --></span> LABLE<span class="token punctuation">,</span> CONTENT <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">enum</span> <span class="token class-name">status</span> s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">char</span> c <span class="token operator">:</span> file<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">switch</span> <span class="token punctuation">(</span>s<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">case</span> LABLE<span class="token operator">:</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'>'</span><span class="token punctuation">)</span> s <span class="token operator">=</span> CONTENT<span class="token punctuation">;</span> <span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token keyword">case</span> CONTENT<span class="token operator">:</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'<'</span><span class="token punctuation">)</span> s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span> <span class="token keyword">else</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//不保留原始文件中的\n,因为需要用\n作为html解析之后文本的分隔符</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'\n'</span><span class="token punctuation">)</span> c <span class="token operator">=</span> <span class="token char">' '</span><span class="token punctuation">;</span> content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token keyword">default</span><span class="token operator">:</span> <span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li></ul></pre> <p><strong>如何构建该网页的url链接?</strong></p> <p>我们获取的网页最终都是在https://www.boost.org/doc/libs/1_79_0/doc/html链接下的,这就是url链接的头部。<br> 而我们已经获取了所有的html文件路径,因为所有的html文件在data/input目录下保存,所有所有的文件路径都是以/data/input为开头,只需要去掉/data/input,将剩下的部分作为尾部,然后将头部和尾部拼接在一起,就能获取完整的url链接。<br> 例如:</p> <blockquote> <p>https://www.boost.org/doc/libs/1_79_0/doc/html 头部<br> /accumulators.html 尾部<br> https://www.boost.org/doc/libs/1_79_0/doc/html/accumulators.html 头部+尾部=完整的url链接</p> </blockquote> <pre data-index="7" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>file_path<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>url<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//链接头部</span> std<span class="token double-colon punctuation">::</span>string url_head <span class="token operator">=</span> <span class="token string">"https://www.boost.org/doc/libs/1_79_0/doc/html"</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string url_tail <span class="token operator">=</span> file_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token operator">*</span>url <span class="token operator">=</span> url_head <span class="token operator">+</span> url_tail<span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li></ul></pre> <p>至此,我们已经完成了解析html文件的所有功能。不过,需要注意的是,<strong>如果以上四步中,哪一步出错,都将舍弃这个html文件,然后再解析下一个html文件</strong></p> <h3><a name="t13"></a><a id="534__html_442"></a>5.3.4 保存已经解析的html文件</h3> <p>通过解析html文件,我们已经提取到了标题、正文、url链接。然后以<strong>标题\3正文\3url链接\n</strong>的格式进行存储,至于为什么用\3作为分割符,前面已经说过,这里不多赘述。<br> 将所有的html文件都存放在data/raw_html/raw.txt中,<strong>各个html网页之间用\n作为分割符</strong>,方便后续操作</p> <pre data-index="8" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">bool</span> <span class="token function">SaveHtml</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo_t<span class="token operator">></span> <span class="token operator">&</span>results<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>output<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">SEP</span> <span class="token char">'\3'</span></span> <span class="token comment">//按二进制方式写入</span> std<span class="token double-colon punctuation">::</span>ofstream <span class="token function">out</span><span class="token punctuation">(</span>output<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>out <span class="token operator">|</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>out<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"open "</span><span class="token operator"><<</span>output<span class="token operator"><<</span><span class="token string">"failed!"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//可以进行文内容的写入了</span> <span class="token comment">//写入格式 title\3content\3url\n</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>results<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string out_string<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>title<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>content<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>url<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> <span class="token char">'\n'</span><span class="token punctuation">;</span> <span class="token comment">//写入内容</span> out<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>out_string<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> out_string<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> out<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li></ul></pre> <h1><a name="t14"></a><a id="6Index_477"></a>6、编写建立索引的模块Index</h1> <h2><a name="t15"></a><a id="61___478"></a>6.1 整体框架</h2> <p>正排索引是根据文档ID映射文档内容,文档内容又包括标题、正文和url链接,所以我们需要构建一个结构体来维护ID到文档内容的映射关系。</p> <pre data-index="9" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">struct</span> <span class="token class-name">DocInfo</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string title<span class="token punctuation">;</span> <span class="token comment">//文档标题</span> std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span> <span class="token comment">//文档去标签之后所对应的内容</span> std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span> <span class="token comment">//官网文档url</span> <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span> <span class="token comment">//文档ID</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> forward_index<span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li></ul></pre> <p><strong>正排索引的数据结构用数组,因为数组的下标是天然的文档ID</strong></p> <p>倒排索引是由一个关键词映射多个文档。此时的文档并不是采用正排索引的文档(struct DocInfo)。考虑到数据冗余、根据权重排序的问题,需要定义一个包含文档ID、关键词和权重的结构。<br> 关键词映射的多个文档组合在一起,我们称之为倒排拉链(由vector封装而成)。因为关键词具有唯一性,所以采用unordered_map来维护关键词和倒排拉链映射关系。</p> <pre data-index="10" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> doc_id<span class="token punctuation">;</span> <span class="token comment">//文档ID</span> std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">//关键词</span> <span class="token keyword">int</span> weight<span class="token punctuation">;</span> <span class="token comment">//权重,用来排序</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> InvertedList<span class="token operator">></span> inverted_index<span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li></ul></pre> <p><strong>有了正排索引和倒排索引,如何通过关键词获取文档?</strong><br> 通过关键词,经过unordered_map获取映射的倒排拉链,倒排拉链中有多个<strong>struct InvertedElem</strong>,每个都包含了文档ID。因为文档ID是<strong>vector forward_index</strong>的下标,所以就能获取文档</p> <p><strong>整体框架:</strong></p> <pre data-index="11" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">namespace</span> ns_index <span class="token punctuation">{<!-- --></span> <span class="token keyword">struct</span> <span class="token class-name">DocInfo</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string title<span class="token punctuation">;</span> <span class="token comment">//文档标题</span> std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span> <span class="token comment">//文档去标签之后所对应的内容</span> std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span> <span class="token comment">//官网文档url</span> <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span> <span class="token comment">//文档ID</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> doc_id<span class="token punctuation">;</span> <span class="token comment">//文档ID</span> std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">//关键词</span> <span class="token keyword">int</span> weight<span class="token punctuation">;</span> <span class="token comment">//权重,用来排序</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token comment">//倒排拉链</span> <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">Index</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">private</span><span class="token operator">:</span> <span class="token comment">//正排索引的数据结构用数组,因为数组的下标是天然的文档ID</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> forward_index<span class="token punctuation">;</span> <span class="token comment">//倒排索引一定是一个关键词和一组(多个)InvertedElem相对应(关键词和倒排拉链的映射关系)</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> InvertedList<span class="token operator">></span> inverted_index<span class="token punctuation">;</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token comment">//根据doc_id找到文档内容</span> DocInfo <span class="token operator">*</span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//根据关键词string获取倒排拉链</span> InvertedList <span class="token operator">*</span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>word<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//根据去标签,格式化之后的文档,构建正排索引和倒排索引</span> <span class="token comment">// data/raw_html/raw.txt</span> <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li></ul></pre> <h2><a name="t16"></a><a id="62_BuildIndex_564"></a>6.2 BuildIndex的编写</h2> <p>思路:首先打开经过清洗的html文件,然后一行一行的读取,每读取一行,就表示读取了一个html网页,因为在存在html网页时,是以\n作为分割符的。读取到的html文件包括文档ID和html内容,根据ID可以建立正排索引,根据html内容建立倒排索引</p> <pre data-index="12" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//根据去标签,格式化之后的文档,构建正排索引和倒排索引</span> <span class="token comment">// data/raw_html/raw.txt</span> <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>input<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>in <span class="token operator">|</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span>in<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>FATAL<span class="token punctuation">,</span> input <span class="token operator">+</span> <span class="token string">"open error"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// std::cerr << "sorry " << input << "open error" << std::endl;</span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span> <span class="token keyword">int</span> count <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token keyword">while</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//建立正排索引</span> DocInfo <span class="token operator">*</span>doc <span class="token operator">=</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>doc <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> <span class="token string">"bulid: "</span> <span class="token operator"><<</span> line <span class="token operator"><<</span> <span class="token string">" error"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> <span class="token comment">// fro debug</span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//建立倒排索引</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token operator">*</span>doc<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li></ul></pre> <h3><a name="t17"></a><a id="621___598"></a>6.2.1 建立正排索引</h3> <p>通过前面的步骤,已经成功提取到了html网页内容,而内容中的标题、正文和url链接通过’\3’作为分隔符。因此,我们需要用Boost库中的split函数将其解析出来,从而形成DocInfo,再插入forwardindex中,返回DocInfo的地址。<br> ns_util::StringUtil::Split是对Boost库中split的封装</p> <pre data-index="13" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">DocInfo <span class="token operator">*</span><span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>line<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// 1.解析line,字符串切分</span> <span class="token comment">// line->title content url</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> results<span class="token punctuation">;</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string sep <span class="token operator">=</span> <span class="token string">"\3"</span><span class="token punctuation">;</span> <span class="token comment">//行内分隔符</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">StringUtil</span><span class="token double-colon punctuation">::</span><span class="token function">Split</span><span class="token punctuation">(</span>line<span class="token punctuation">,</span> <span class="token operator">&</span>results<span class="token punctuation">,</span> sep<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//说明不包含完整的title content url</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>results<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">3</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 2.字符串进行填充到DocInfo</span> DocInfo doc<span class="token punctuation">;</span> doc<span class="token punctuation">.</span>title <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>content <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>url <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//在没插入之前,forward_index.size()就是doc_id(下标)</span> <span class="token comment">// 3.插入到正排索引的vector中</span> forward_index<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// move提高效率</span> <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">.</span><span class="token function">back</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">class</span> <span class="token class-name">StringUtil</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">Split</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>target<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>sep<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// boost split</span> <span class="token comment">//第一个参数:存放结果的地方</span> <span class="token comment">//第二个参数:需要进行分割的数据源</span> <span class="token comment">//第三个参数:分隔符</span> <span class="token comment">//第四个参数:是否对进行压缩 例如:fff\3\3\3lll不压缩->fff lll fff\3\3\3lll压缩->ffflll</span> boost<span class="token double-colon punctuation">::</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token operator">*</span>out<span class="token punctuation">,</span> target<span class="token punctuation">,</span> boost<span class="token double-colon punctuation">::</span><span class="token function">is_any_of</span><span class="token punctuation">(</span>sep<span class="token punctuation">)</span><span class="token punctuation">,</span> boost<span class="token double-colon punctuation">::</span>token_compress_on <span class="token comment">/*需要压缩*/</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li></ul></pre> <p>以上就是建立正排索引的全部过程</p> <h3><a name="t18"></a><a id="622___642"></a>6.2.2 建立倒排索引</h3> <p>因为搜索出的文档最终需要进行排序,那么就需要用到权重值。而关键词在标题和正文出现的不同,所以需要一下结构来记录关键词分别在标题和正文中出现的次数</p> <pre data-index="14" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">struct</span> <span class="token class-name">word_cnt</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> title_cnt<span class="token punctuation">;</span> <span class="token comment">//关键词在标题出现的次数</span> <span class="token keyword">int</span> content_cnt<span class="token punctuation">;</span> <span class="token comment">//关键词在正文出现的次数</span> <span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li></ul></pre> <p>不仅如此,我们还需要一个结构用来保存关键词和struct word_cnt的映射关系,所以使用unordered_map</p> <pre data-index="15" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_map<span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <p>有了以上结构作为基础,我们就能对关键词进行词频统计<br> 首先需要对标题和正文进行分词,分词的结果放在不同的vector<string>,通过遍历的方式,增加word_map中的struct word_cnt中的title_cnt和 content_cnt。<br> 为了不区分大小写,统一转为小写处理。</p> <p>这样就获得了完成的 word_map(用来暂存词频的映射表)。我们已经有了文档ID、关键词和词频记录,就能构建一个struct InvertedElem,然后插入到对应的倒排拉链中,至此,整个建立倒排拉链的过程就已经完成。</p> <pre data-index="16" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">bool</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> DocInfo <span class="token operator">&</span>doc<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// DocInfo{title, content, url, doc_id}</span> <span class="token comment">// word->倒排拉链</span> <span class="token keyword">struct</span> <span class="token class-name">word_cnt</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> title_cnt<span class="token punctuation">;</span> <span class="token comment">//关键词在标题出现的次数</span> <span class="token keyword">int</span> content_cnt<span class="token punctuation">;</span> <span class="token comment">//关键词在正文出现的次数</span> <span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_map<span class="token punctuation">;</span> <span class="token comment">//用来暂存词频的映射表</span> <span class="token comment">//对标题进行分词,结果放在title_words中</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> title_words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>title<span class="token punctuation">,</span> <span class="token operator">&</span>title_words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//对标题进行词频统计</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string s <span class="token operator">:</span> title_words<span class="token punctuation">)</span> <span class="token comment">//这里不用&s的原因是将s转换为小写时,不需要改变原始数据</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//进行词频统计时,将单词全都转换为小写,因为搜索不区分大小写</span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span> word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token comment">//如果word_map中存在s,就返回value的引用,否则就创建</span> <span class="token punctuation">}</span> <span class="token comment">//对文档内容进行分词,结果放在content_words中</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> content_words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>content<span class="token punctuation">,</span> <span class="token operator">&</span>content_words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//对文档内容进行词频统计</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string s <span class="token operator">:</span> content_words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span> word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">X</span> <span class="token expression"><span class="token number">10</span></span></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">Y</span> <span class="token expression"><span class="token number">1</span></span></span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>word_pair <span class="token operator">:</span> word_map<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> InvertedElem item<span class="token punctuation">;</span> item<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> doc<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> item<span class="token punctuation">.</span>word <span class="token operator">=</span> word_pair<span class="token punctuation">.</span>first<span class="token punctuation">;</span> item<span class="token punctuation">.</span>weight <span class="token operator">=</span> X <span class="token operator">*</span> word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt <span class="token operator">+</span> Y <span class="token operator">*</span> word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt<span class="token punctuation">;</span> <span class="token comment">//相关性(权重)</span> InvertedList <span class="token operator">&</span>inverted_list <span class="token operator">=</span> inverted_index<span class="token punctuation">[</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//如果不存在就创建,并且返归value的引用。如果存在,直接返回value的引用</span> inverted_list<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//将item插入到所对应的倒排拉链中</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li></ul></pre> <p>因为这个功能使用到了cppjieba库中的分词功能,所以还需要引入cppjieba库</p> <blockquote> <p>// 项目源文档:<br> https://github.com/yanyiwu/cppjieba<br> // 安装命令<br> git clone https://github.com/yanyiwu/cppjieba.git</p> </blockquote> <pre data-index="17" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">JiebaUtil</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">private</span><span class="token operator">:</span> <span class="token keyword">static</span> cppjieba<span class="token double-colon punctuation">::</span>Jieba jieba<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_set<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> stop_words<span class="token punctuation">;</span> <span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span> HMM_PATH<span class="token punctuation">,</span> USER_DICT_PATH<span class="token punctuation">,</span> IDF_PATH<span class="token punctuation">,</span> STOP_WORD_PATH<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token keyword">const</span> JiebaUtil <span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> JiebaUtil <span class="token keyword">operator</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">const</span> JiebaUtil <span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> <span class="token keyword">static</span> JiebaUtil <span class="token operator">*</span>instance<span class="token punctuation">;</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token comment">//获取暂停词</span> <span class="token keyword">void</span> <span class="token function">InitJiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span> <span class="token keyword">while</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> stop_words<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> in<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">void</span> <span class="token function">CutStringHelper</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//去掉暂停词会耗费大量的时间</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> iter <span class="token operator">!=</span> out<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//查看暂停词中是否存在当前词,如果存在就从out中删除这个词</span> <span class="token keyword">auto</span> it <span class="token operator">=</span> stop_words<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token operator">*</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>it <span class="token operator">!=</span> stop_words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">erase</span><span class="token punctuation">(</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">else</span> <span class="token punctuation">{<!-- --></span> <span class="token operator">++</span>iter<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> jieba<span class="token punctuation">.</span><span class="token function">CutStringHelper</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li></ul></pre> <h2><a name="t19"></a><a id="63__Index_777"></a>6.3 将Index设置为单例</h2> <p>因为我们对建立索引的html文件不会做任何修改,由于索引是由整个html文件构成,所以十分消耗内存,由此可见,只要需要存在一份即可。<br> 因此,可以把Index设置为单例模式</p> <pre data-index="18" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">private</span><span class="token operator">:</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token comment">//一定要有函数体,不能delete,用于第一次构造</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index <span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> Index <span class="token operator">&</span><span class="token keyword">operator</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index <span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> <span class="token comment">//单例模式</span> <span class="token keyword">static</span> Index <span class="token operator">*</span>instance<span class="token punctuation">;</span> <span class="token keyword">static</span> std<span class="token double-colon punctuation">::</span>mutex mtx<span class="token punctuation">;</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token keyword">static</span> Index <span class="token operator">*</span><span class="token function">GetInstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//双判定加锁,提高效率</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> mtx<span class="token punctuation">.</span><span class="token function">lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> instance <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> mtx<span class="token punctuation">.</span><span class="token function">unlock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> instance<span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li></ul></pre> <h1><a name="t20"></a><a id="7_Searcher_810"></a>7、 编写搜索引擎模块Searcher</h1> <h2><a name="t21"></a><a id="71___811"></a>7.1 整体框架</h2> <p>到此,我们能通过关键词获取倒排拉链。但是获取的倒排拉链中可能存在重复文档,因为搜索的可能是一条语句被切分的多个关键词。<br> 例如:文档内容为今天天气真热,而搜索的语句为今天天气真热,被切分为三个词,今天、天气、真热。所以搜索出的文档会出现三次。<br> 所以,我们需要用到unordered_map对搜索出的文档进行去重,并将其添加到vector中,方便根据权重进行排序<br> 最后再根据查找出来的结果,构建json串,对于正文部分,只需要截取一部分即可</p> <pre data-index="19" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">namespace</span> ns_searcher <span class="token punctuation">{<!-- --></span> <span class="token keyword">struct</span> <span class="token class-name">InvertedElemPrint</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> doc_id<span class="token punctuation">;</span> <span class="token comment">//文档ID</span> <span class="token keyword">int</span> weight<span class="token punctuation">;</span> <span class="token comment">//权重,用来排序</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span> <span class="token function">InvertedElemPrint</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">doc_id</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">weight</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">Searcher</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">private</span><span class="token operator">:</span> ns_index<span class="token double-colon punctuation">::</span>Index <span class="token operator">*</span>index<span class="token punctuation">;</span> <span class="token comment">//供系统进行查找的索引</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token comment">// query:搜索关键词</span> <span class="token comment">// json_string:返回给用户浏览器的搜索结果</span> <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>query<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>json_string<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// 1.[分词]:对我们的query进行按照searcher的要求进行分词</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 2.[触发]:就是根据分词的各个“词”进行index查找-----建立index是忽略大小写的,所以搜索时,关键词也要忽略大小写(全转为小写)</span> <span class="token comment">//关键词可能被分为多个词,多个词对应多个倒排拉链,将这些倒排拉链都存放在inverted_list_all中</span> <span class="token comment">// ns_index::InvertedList inverted_list_all; //内部存放的是InvertedElem</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElemPrint<span class="token operator">></span> inverted_list_all<span class="token punctuation">;</span> <span class="token comment">//内部存放的是InvertedElemPrint</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span><span class="token keyword">int</span><span class="token punctuation">,</span> InvertedElemPrint<span class="token operator">></span> tokens_map<span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string word <span class="token operator">:</span> words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// struct InvertedElem</span> <span class="token comment">// {<!-- --></span> <span class="token comment">// int doc_id; //文档ID</span> <span class="token comment">// std::string word; //关键词</span> <span class="token comment">// int weight; //权重,用来排序</span> <span class="token comment">// };</span> <span class="token comment">// //倒排拉链</span> <span class="token comment">// typedef std::vector<InvertedElem> InvertedList;</span> <span class="token comment">//通过关键词获取倒排拉链</span> ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>inverted_list <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//倒排拉链可能为空,为空就继续获取下一个关键词的倒排拉链</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//不完美,可能会存在重复文档</span> <span class="token comment">// inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());</span> <span class="token comment">//fl/是/一个/程序员-------4个词可能指向了同一个文档,所以需要去重</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span> <span class="token operator">&</span>elem <span class="token operator">:</span> <span class="token operator">*</span>inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">=</span> tokens_map<span class="token punctuation">[</span>elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//返回InvertedElemPrint的引用</span> item<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> item<span class="token punctuation">.</span>weight <span class="token operator">+=</span> elem<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> item<span class="token punctuation">.</span>words<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>elem<span class="token punctuation">.</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token comment">//将已经去重的文档存放在inverted_list_all中</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> tokens_map<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> inverted_list_all<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>second<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 3.[合并排序]:汇总查找结果,按照相关性[weight]进行降序排序</span> <span class="token comment">// std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2)</span> <span class="token comment">// { return e1.weight > e2.weight; });</span> std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> InvertedElemPrint <span class="token operator">&</span>e1<span class="token punctuation">,</span> <span class="token keyword">const</span> InvertedElemPrint <span class="token operator">&</span>e2<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> e1<span class="token punctuation">.</span>weight <span class="token operator">></span> e2<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 4.[构建]:根据查找出来的结果,构建json串</span> Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> inverted_list_all<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> ns_index<span class="token double-colon punctuation">::</span>DocInfo <span class="token operator">*</span>doc <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span> Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>title<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">"desc"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">GetDesc</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>content<span class="token punctuation">,</span> item<span class="token punctuation">.</span>words<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// content是文档的去标签的结果,但不是我们想要的结果,要的是一部分(摘要)</span> elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>url<span class="token punctuation">;</span> <span class="token comment">// for debug</span> <span class="token comment">// elem["id"] = (int)doc->doc_id;</span> <span class="token comment">// elem["weight"] = item.weight;</span> <span class="token comment">//因为item是有序的,doc也是有序的,elem也是有序的,所以root里面的数据也是有序的</span> root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// Json::StyledWriter writer;</span> Json<span class="token double-colon punctuation">::</span>FastWriter writer<span class="token punctuation">;</span> <span class="token operator">*</span>json_string <span class="token operator">=</span> writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//获取内容摘要</span> std<span class="token double-colon punctuation">::</span>string <span class="token function">GetDesc</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>html_content<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>word<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//找到word在html_content中首次出现,然后往前找50字节(如果没有,从begin开始),往后找100字节(如果没有,到end就可以)</span> <span class="token comment">//截取出这部分内容</span> <span class="token comment">// 1.找到首次出现</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>size_t prev_step <span class="token operator">=</span> <span class="token number">50</span><span class="token punctuation">;</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>size_t next_step <span class="token operator">=</span> <span class="token number">100</span><span class="token punctuation">;</span> <span class="token comment">//为什么不用string.find?</span> <span class="token comment">//因为搜索关键词为小写,但原始文档没有改变,可能是大写,此时就存在找不到的情况</span> <span class="token keyword">auto</span> iter <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">search</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> word<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> word<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> x<span class="token punctuation">,</span> <span class="token keyword">int</span> y<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span> <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>iter <span class="token operator">==</span> html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> <span class="token comment">//这种情况是不能存在的,因为倒排拉链是对应word,这里只是为了保险起见</span> <span class="token punctuation">}</span> <span class="token comment">// distance计算两个迭代器之间的距离</span> std<span class="token double-colon punctuation">::</span>size_t pos <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 2.获取begin,end</span> std<span class="token double-colon punctuation">::</span>size_t start <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>size_t end <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>start <span class="token operator">+</span> prev_step <span class="token operator"><</span> pos<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> start <span class="token operator">=</span> pos <span class="token operator">-</span> prev_step<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>pos <span class="token operator">+</span> next_step <span class="token operator"><</span> end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> end <span class="token operator">=</span> pos <span class="token operator">+</span> next_step<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 3.截取子串,然后return</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>start <span class="token operator">>=</span> end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string desc <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>start<span class="token punctuation">,</span> end <span class="token operator">-</span> start<span class="token punctuation">)</span><span class="token punctuation">;</span> desc <span class="token operator">+=</span> <span class="token string">"..."</span><span class="token punctuation">;</span> <span class="token keyword">return</span> desc<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li><li style="color: rgb(153, 153, 153);">53</li><li style="color: rgb(153, 153, 153);">54</li><li style="color: rgb(153, 153, 153);">55</li><li style="color: rgb(153, 153, 153);">56</li><li style="color: rgb(153, 153, 153);">57</li><li style="color: rgb(153, 153, 153);">58</li><li style="color: rgb(153, 153, 153);">59</li><li style="color: rgb(153, 153, 153);">60</li><li style="color: rgb(153, 153, 153);">61</li><li style="color: rgb(153, 153, 153);">62</li><li style="color: rgb(153, 153, 153);">63</li><li style="color: rgb(153, 153, 153);">64</li><li style="color: rgb(153, 153, 153);">65</li><li style="color: rgb(153, 153, 153);">66</li><li style="color: rgb(153, 153, 153);">67</li><li style="color: rgb(153, 153, 153);">68</li><li style="color: rgb(153, 153, 153);">69</li><li style="color: rgb(153, 153, 153);">70</li><li style="color: rgb(153, 153, 153);">71</li><li style="color: rgb(153, 153, 153);">72</li><li style="color: rgb(153, 153, 153);">73</li><li style="color: rgb(153, 153, 153);">74</li><li style="color: rgb(153, 153, 153);">75</li><li style="color: rgb(153, 153, 153);">76</li><li style="color: rgb(153, 153, 153);">77</li><li style="color: rgb(153, 153, 153);">78</li><li style="color: rgb(153, 153, 153);">79</li><li style="color: rgb(153, 153, 153);">80</li><li style="color: rgb(153, 153, 153);">81</li><li style="color: rgb(153, 153, 153);">82</li><li style="color: rgb(153, 153, 153);">83</li><li style="color: rgb(153, 153, 153);">84</li><li style="color: rgb(153, 153, 153);">85</li><li style="color: rgb(153, 153, 153);">86</li><li style="color: rgb(153, 153, 153);">87</li><li style="color: rgb(153, 153, 153);">88</li><li style="color: rgb(153, 153, 153);">89</li><li style="color: rgb(153, 153, 153);">90</li><li style="color: rgb(153, 153, 153);">91</li><li style="color: rgb(153, 153, 153);">92</li><li style="color: rgb(153, 153, 153);">93</li><li style="color: rgb(153, 153, 153);">94</li><li style="color: rgb(153, 153, 153);">95</li><li style="color: rgb(153, 153, 153);">96</li><li style="color: rgb(153, 153, 153);">97</li><li style="color: rgb(153, 153, 153);">98</li><li style="color: rgb(153, 153, 153);">99</li><li style="color: rgb(153, 153, 153);">100</li><li style="color: rgb(153, 153, 153);">101</li><li style="color: rgb(153, 153, 153);">102</li><li style="color: rgb(153, 153, 153);">103</li><li style="color: rgb(153, 153, 153);">104</li><li style="color: rgb(153, 153, 153);">105</li><li style="color: rgb(153, 153, 153);">106</li><li style="color: rgb(153, 153, 153);">107</li><li style="color: rgb(153, 153, 153);">108</li><li style="color: rgb(153, 153, 153);">109</li><li style="color: rgb(153, 153, 153);">110</li><li style="color: rgb(153, 153, 153);">111</li><li style="color: rgb(153, 153, 153);">112</li><li style="color: rgb(153, 153, 153);">113</li><li style="color: rgb(153, 153, 153);">114</li><li style="color: rgb(153, 153, 153);">115</li><li style="color: rgb(153, 153, 153);">116</li><li style="color: rgb(153, 153, 153);">117</li><li style="color: rgb(153, 153, 153);">118</li><li style="color: rgb(153, 153, 153);">119</li><li style="color: rgb(153, 153, 153);">120</li><li style="color: rgb(153, 153, 153);">121</li><li style="color: rgb(153, 153, 153);">122</li><li style="color: rgb(153, 153, 153);">123</li><li style="color: rgb(153, 153, 153);">124</li><li style="color: rgb(153, 153, 153);">125</li><li style="color: rgb(153, 153, 153);">126</li><li style="color: rgb(153, 153, 153);">127</li><li style="color: rgb(153, 153, 153);">128</li><li style="color: rgb(153, 153, 153);">129</li><li style="color: rgb(153, 153, 153);">130</li><li style="color: rgb(153, 153, 153);">131</li><li style="color: rgb(153, 153, 153);">132</li><li style="color: rgb(153, 153, 153);">133</li><li style="color: rgb(153, 153, 153);">134</li><li style="color: rgb(153, 153, 153);">135</li><li style="color: rgb(153, 153, 153);">136</li><li style="color: rgb(153, 153, 153);">137</li><li style="color: rgb(153, 153, 153);">138</li></ul></pre> <h2><a name="t22"></a><a id="72___958"></a>7.2 分词</h2> <p>对我们的query进行按照searcher的要求进行分词</p> <pre data-index="20" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>words<span class="token punctuation">)</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li></ul></pre> <h2><a name="t23"></a><a id="73___966"></a>7.3 触发</h2> <p>建立index是忽略大小写的,所以搜索时,关键词也要忽略大小写(全转为小写)<br> 关键词可能被分为多个词,多个词对应多个倒排拉链,将这些倒排拉链都存放在inverted_list_all中</p> <pre data-index="21" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElemPrint<span class="token operator">></span> inverted_list_all<span class="token punctuation">;</span> <span class="token comment">//内部存放的是InvertedElemPrint</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span><span class="token keyword">int</span><span class="token punctuation">,</span> InvertedElemPrint<span class="token operator">></span> tokens_map<span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string word <span class="token operator">:</span> words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//全部转为小写</span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//通过关键词获取倒排拉链</span> ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>inverted_list <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//倒排拉链可能为空,为空就继续获取下一个关键词的倒排拉链</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//不完美,可能会存在重复文档</span> <span class="token comment">// inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());</span> <span class="token comment">// fl/是/一个/程序员-------4个词可能指向了同一个文档,所以需要去重</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span> <span class="token operator">&</span>elem <span class="token operator">:</span> <span class="token operator">*</span>inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">=</span> tokens_map<span class="token punctuation">[</span>elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//返回InvertedElemPrint的引用</span> item<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> item<span class="token punctuation">.</span>weight <span class="token operator">+=</span> elem<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> item<span class="token punctuation">.</span>words<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>elem<span class="token punctuation">.</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token comment">//将已经去重的文档存放在inverted_list_all中</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> tokens_map<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> inverted_list_all<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>second<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li></ul></pre> <h2><a name="t24"></a><a id="74___1005"></a>7.4 合并排序</h2> <p>利用sort和lambda表示对文档按照权值进行降序排序</p> <pre data-index="22" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> InvertedElemPrint <span class="token operator">&</span>e1<span class="token punctuation">,</span> <span class="token keyword">const</span> InvertedElemPrint <span class="token operator">&</span>e2<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> e1<span class="token punctuation">.</span>weight <span class="token operator">></span> e2<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li></ul></pre> <h2><a name="t25"></a><a id="75__jsoncppjson_1013"></a>7.5 利用jsoncpp构建json串</h2> <p>在网路中传输的数据都是已经序列化的字符串,所以我们需要利用jsoncpp对传输的数据进行序列化。<br> 因为jsoncpp是第三方库,也需要自行安装</p> <blockquote> <p>sudo yum install -y jsoncpp-devel</p> </blockquote> <pre data-index="23" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> inverted_list_all<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> ns_index<span class="token double-colon punctuation">::</span>DocInfo <span class="token operator">*</span>doc <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span> Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>title<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">"desc"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">GetDesc</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>content<span class="token punctuation">,</span> item<span class="token punctuation">.</span>words<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// content是文档的去标签的结果,但不是我们想要的结果,要的是一部分(摘要)</span> elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>url<span class="token punctuation">;</span> <span class="token comment">//因为item是有序的,doc也是有序的,elem也是有序的,所以root里面的数据也是有序的</span> root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// Json::StyledWriter writer;</span> Json<span class="token double-colon punctuation">::</span>FastWriter writer<span class="token punctuation">;</span> <span class="token operator">*</span>json_string <span class="token operator">=</span> writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li></ul></pre> <h3><a name="t26"></a><a id="751___1036"></a>7.5.1 获取内容摘要</h3> <p>因为在网页不需要显示的正文全部分,所以需要对正文进行摘要。摘要的原则是找到关键词在正文中首次出现的位置,然后往前找50字节(如果没有,从begin开始),往后找100字节(如果没有,到end就可以)</p> <pre data-index="24" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;">std<span class="token double-colon punctuation">::</span>string <span class="token function">GetDesc</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>html_content<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>word<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//找到word在html_content中首次出现的位置,然后往前找50字节(如果没有,从begin开始),往后找100字节(如果没有,到end就可以)</span> <span class="token comment">//截取出这部分内容</span> <span class="token comment">// 1.找到首次出现</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>size_t prev_step <span class="token operator">=</span> <span class="token number">50</span><span class="token punctuation">;</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>size_t next_step <span class="token operator">=</span> <span class="token number">100</span><span class="token punctuation">;</span> <span class="token comment">//为什么不用string.find?</span> <span class="token comment">//因为搜索关键词为小写,但原始文档没有改变,可能是大写,此时就存在找不到的情况</span> <span class="token keyword">auto</span> iter <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">search</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> word<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> word<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> x<span class="token punctuation">,</span> <span class="token keyword">int</span> y<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span> <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>iter <span class="token operator">==</span> html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> <span class="token comment">//这种情况是不能存在的,因为倒排拉链是对应word,这里只是为了保险起见</span> <span class="token punctuation">}</span> <span class="token comment">// distance计算两个迭代器之间的距离</span> std<span class="token double-colon punctuation">::</span>size_t pos <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 2.获取begin,end</span> std<span class="token double-colon punctuation">::</span>size_t start <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>size_t end <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>start <span class="token operator">+</span> prev_step <span class="token operator"><</span> pos<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> start <span class="token operator">=</span> pos <span class="token operator">-</span> prev_step<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>pos <span class="token operator">+</span> next_step <span class="token operator"><</span> end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> end <span class="token operator">=</span> pos <span class="token operator">+</span> next_step<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 3.截取子串,然后return</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>start <span class="token operator">>=</span> end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string desc <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>start<span class="token punctuation">,</span> end <span class="token operator">-</span> start<span class="token punctuation">)</span><span class="token punctuation">;</span> desc <span class="token operator">+=</span> <span class="token string">"..."</span><span class="token punctuation">;</span> <span class="token keyword">return</span> desc<span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li></ul></pre> <h1><a name="t27"></a><a id="8http_server__1083"></a>8、编写http_server 模块</h1> <pre data-index="25" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path <span class="token operator">=</span> <span class="token string">"./wwwroot"</span><span class="token punctuation">;</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input <span class="token operator">=</span> <span class="token string">"data/raw_html/raw.txt"</span><span class="token punctuation">;</span> <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> ns_searcher<span class="token double-colon punctuation">::</span>Searcher search<span class="token punctuation">;</span> search<span class="token punctuation">.</span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span> httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span> svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/s"</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token operator">&</span>search<span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span> httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>req<span class="token punctuation">.</span><span class="token function">has_param</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"必须要有关键字"</span><span class="token punctuation">,</span><span class="token string">"text/plain:charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string word <span class="token operator">=</span> req<span class="token punctuation">.</span><span class="token function">get_param_value</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">LOG</span><span class="token punctuation">(</span>NORMAL<span class="token punctuation">,</span><span class="token string">"用户搜索:"</span> <span class="token operator">+</span> word<span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span> search<span class="token punctuation">.</span><span class="token function">Search</span><span class="token punctuation">(</span>word<span class="token punctuation">,</span> <span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span> rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span>json_string<span class="token punctuation">,</span> <span class="token string">"application/json"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">LOG</span><span class="token punctuation">(</span>NORMAL<span class="token punctuation">,</span><span class="token string">"服务器启动成功..."</span><span class="token punctuation">)</span><span class="token punctuation">;</span> svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span> <span class="token number">8080</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li></ul></pre> <h1><a name="t28"></a><a id="9_1115"></a>9、添加简易日志</h1> <p>系统日志策略可以在故障刚刚发生时就向你发送警告信息,系统日志帮助你在最短的时间内发现问题。为此,我们实现一个简单的日志系统来记录搜索引擎的运行情况。</p> <h2><a name="t29"></a><a id="91__loghpp_1118"></a>9.1 编写log.hpp</h2> <p>采用C语言中内置的宏,包装出一个宏函数,用于日志调用,主要信息包含日志等级、日期、日志内容和具体运行位置</p> <pre data-index="26" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//日志等级</span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">NORMAL</span> <span class="token expression"><span class="token number">1</span></span></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">WARNING</span> <span class="token expression"><span class="token number">2</span></span></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">DEBUG</span> <span class="token expression"><span class="token number">3</span></span></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">FATAL</span> <span class="token expression"><span class="token number">4</span></span></span> <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name function">LOG</span><span class="token expression"><span class="token punctuation">(</span>LEVEL<span class="token punctuation">,</span> MESSAGE<span class="token punctuation">)</span> <span class="token function">log</span><span class="token punctuation">(</span>#LEVEL</span><span class="token comment">/*将宏参数转为字符串*/</span><span class="token expression"><span class="token punctuation">,</span> MESSAGE</span><span class="token comment">/*日志内容*/</span><span class="token expression"><span class="token punctuation">,</span> <span class="token constant">__FILE__</span></span><span class="token comment">/*文件名*/</span><span class="token expression"><span class="token punctuation">,</span> <span class="token constant">__LINE__</span></span><span class="token comment">/*行号*/</span><span class="token expression"><span class="token punctuation">)</span></span></span> <span class="token keyword">void</span> <span class="token function">log</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string level<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string message<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string file<span class="token punctuation">,</span> <span class="token keyword">int</span> line<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> time_t nowtime<span class="token punctuation">;</span> <span class="token keyword">struct</span> <span class="token class-name">tm</span> <span class="token operator">*</span>t<span class="token punctuation">;</span> <span class="token function">time</span><span class="token punctuation">(</span><span class="token operator">&</span>nowtime<span class="token punctuation">)</span><span class="token punctuation">;</span> t <span class="token operator">=</span> <span class="token function">localtime</span><span class="token punctuation">(</span><span class="token operator">&</span>nowtime<span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> <span class="token string">"["</span> <span class="token operator"><<</span> level <span class="token operator"><<</span> <span class="token string">"]"</span> <span class="token operator"><<</span> <span class="token string">"["</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_year <span class="token operator">+</span> <span class="token number">1900</span> <span class="token operator"><<</span> <span class="token string">"-"</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_mon <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator"><<</span> <span class="token string">"-"</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_mday <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator"><<</span> <span class="token string">" "</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_hour <span class="token operator"><<</span> <span class="token string">":"</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_min <span class="token operator"><<</span> <span class="token string">":"</span> <span class="token operator"><<</span> t<span class="token operator">-></span>tm_sec <span class="token operator"><<</span> <span class="token string">"]"</span> <span class="token operator"><<</span> <span class="token string">"["</span> <span class="token operator"><<</span> message <span class="token operator"><<</span> <span class="token string">"]"</span> <span class="token operator"><<</span> <span class="token string">"["</span> <span class="token operator"><<</span> file <span class="token operator"><<</span> <span class="token string">" : "</span> <span class="token operator"><<</span> line <span class="token operator"><<</span> <span class="token string">"]"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> <span class="token punctuation">}</span> <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li></ul></pre> <h2><a name="t30"></a><a id="92___linux__1145"></a>9.2 部署服务到 linux 上</h2> <blockquote> <p>nohup命令 ,英文全称no hang up (不挂起),用于在系统后台不挂断地运行命令,退出终端不会影响程序的运行。nohup命令,在默认情况下(非重定向时),会输出-一个名叫nohup.out的文件到当前目录下,如果当前目录的nohup out文件不可写,输出重定向到$HOME/nohup.out文件中。</p> </blockquote> <p>使用命令:<code>nohup ./http_server > log/log.txt 2>&1 &</code>将服务部署到linux上<br> <code>2>&1 解释</code>:将标准错误2重定向到标准输出1,标准输出1再重定向输入到 log/log.txt 文件中<br> <code>最后一个&</code>解释:表示后台运行</p> <p>部署成功后,如果需要关闭服务,则使用kill -9 命令,根据进程pid结束进程即可。</p> <h1><a name="t31"></a><a id="10_1155"></a>10、整体逻辑框架</h1> <ul><li>第一步:对html文件进行数据清洗。</li><li>第二步:构建正排索引和倒排索引。</li><li>第三步:对搜索的语句进行分词处理,形成多个关键词。</li><li>第四步:获取每个关键词所对应的倒排拉链。</li><li>第五步:对所有倒排拉链中的html文件进行去重。</li><li>第六步:根据权重对html文件进行排序。</li><li>第七步:根据文档ID+正排索引,获取文档内容,形成关键词的摘要,构建json串,返回给前端页面</li></ul> </div> </div> </li> <li class="list-group-item ul-li"> <b>相关阅读:</b><br> <nobr> <a href="/Article/Index/612360">Citus 分布式 PostgreSQL 集群 - SQL Reference(摄取、修改数据 DML)</a> <br /> <a href="/Article/Index/784014">百战RHCE(第四十八战:运维工程师必会技-Ansible学习3-构建Ansible清单)</a> <br /> <a href="/Article/Index/1449997">使用pixy计算群体遗传学统计量</a> <br /> <a href="/Article/Index/1452406">千帆SDK开源到GitHub,开发者可免费下载使用!</a> <br /> <a href="/Article/Index/1099987">异常检测:Towards Total Recall in Industrial Anomaly Detection</a> <br /> <a href="/Article/Index/1495447">Xcode中App图标和APP名称的修改</a> <br /> <a href="/Article/Index/923505">搜索引擎项目的详细介绍</a> <br /> <a href="/Article/Index/1736979">Hive基础知识(十八):Hive 函数的使用</a> <br /> <a href="/Article/Index/1175380">在gstreamer中做线程同步</a> <br /> <a href="/Article/Index/712539">Linux基础——定时任务</a> <br /> </nobr> </li> <li class="list-group-item from-a mb-2"> 原文地址:https://blog.csdn.net/qq_56044032/article/details/126336377 </li> </ul> </div> <div class="col-lg-4 col-sm-12"> <ul class="list-group" style="word-break:break-all;"> <li class="list-group-item ul-li-bg" aria-current="true"> 最新文章 </li> <li class="list-group-item ul-li"> <nobr> <a href="/Article/Index/1484446">攻防演习之三天拿下官网站群</a> <br /> <a href="/Article/Index/1515268">数据安全治理学习——前期安全规划和安全管理体系建设</a> <br /> <a href="/Article/Index/1759065">企业安全 | 企业内一次钓鱼演练准备过程</a> <br /> <a href="/Article/Index/1485036">内网渗透测试 | Kerberos协议及其部分攻击手法</a> <br /> <a href="/Article/Index/1877332">0day的产生 | 不懂代码的"代码审计"</a> <br /> <a href="/Article/Index/1887576">安装scrcpy-client模块av模块异常,环境问题解决方案</a> <br /> <a href="/Article/Index/1887578">leetcode hot100【LeetCode 279. 完全平方数】java实现</a> <br /> <a href="/Article/Index/1887512">OpenWrt下安装Mosquitto</a> <br /> <a href="/Article/Index/1887520">AnatoMask论文汇总</a> <br /> <a href="/Article/Index/1887496">【AI日记】24.11.01 LangChain、openai api和github copilot</a> <br /> </nobr> </li> </ul> <ul class="list-group pt-2" style="word-break:break-all;"> <li class="list-group-item ul-li-bg" aria-current="true"> 热门文章 </li> <li class="list-group-item ul-li"> <nobr> <a href="/Article/Index/888177">十款代码表白小特效 一个比一个浪漫 赶紧收藏起来吧!!!</a> <br /> <a href="/Article/Index/797680">奉劝各位学弟学妹们,该打造你的技术影响力了!</a> <br /> <a href="/Article/Index/888183">五年了,我在 CSDN 的两个一百万。</a> <br /> <a href="/Article/Index/888179">Java俄罗斯方块,老程序员花了一个周末,连接中学年代!</a> <br /> <a href="/Article/Index/797730">面试官都震惊,你这网络基础可以啊!</a> <br /> <a href="/Article/Index/797725">你真的会用百度吗?我不信 — 那些不为人知的搜索引擎语法</a> <br /> <a href="/Article/Index/797702">心情不好的时候,用 Python 画棵樱花树送给自己吧</a> <br /> <a href="/Article/Index/797709">通宵一晚做出来的一款类似CS的第一人称射击游戏Demo!原来做游戏也不是很难,连憨憨学妹都学会了!</a> <br /> <a href="/Article/Index/797716">13 万字 C 语言从入门到精通保姆级教程2021 年版</a> <br /> <a href="/Article/Index/888192">10行代码集2000张美女图,Python爬虫120例,再上征途</a> <br /> </nobr> </li> </ul> </div> </div> </div> <!-- 主体 --> <!--body结束--> <!--这里是footer模板--> <!--footer--> <nav class="navbar navbar-inverse navbar-fixed-bottom"> <div class="container"> <div class="row"> <div class="col-md-12"> <div class="text-muted center foot-height"> Copyright © 2022 侵权请联系<a href="mailto:2656653265@qq.com">2656653265@qq.com</a>    <a href="https://beian.miit.gov.cn/" target="_blank">京ICP备2022015340号-1</a> </div> <div style="width:300px;margin:0 auto; padding:0px 5px;"> <a href="/regex.html">正则表达式工具</a> <a href="/cron.html">cron表达式工具</a> <a href="/pwdcreator.html">密码生成工具</a> </div> <div style="width:300px;margin:0 auto; padding:5px 0;"> <a target="_blank" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010502049817" style="display:inline-block;text-decoration:none;height:20px;line-height:20px;"> <img src="" style="float:left;" /><p style="float:left;height:20px;line-height:20px;margin: 0px 0px 0px 5px; color:#939393;">京公网安备 11010502049817号</p></a> </div> </div> </div> </div> </nav> <!--footer--> <!--footer模板结束--> <script src="/js/plugins/jquery/jquery.js"></script> <script src="/js/bootstrap.min.js"></script> <!--这里是scripts模板--> <!--scripts模板结束--> </body> </html>