• 【项目】 基于BOOST的站内搜索引擎



    在这里插入图片描述


    1. 简介

    常见的搜索引擎:baidu、google、bing,以及常见的一些带有搜索功能的app等。

    在这里插入图片描述

    我们自己单枪匹马实现一个常规的搜索引擎(全网搜索)显然是不可能的,但可以实现一个简单的搜索引擎来进行站内搜索的行为。

    比如我们学习C++常用的cplusplus网站就是带有站内搜索功能,搜索的内容更垂直(范围小且相关性更强),数据量更小。

    在这里插入图片描述

    boost库是没有站内搜索的,我们可以自己做一个。

    完成后的搜索引擎也将显示每个检索条目的:网页标题,网页内容摘录以及url。

    建立搜索引擎的宏观体系

    在这里插入图片描述

    技术栈和项目环境

    • 技术栈:

      • 后端:C/C++, C++11,STL,Boost,Jsoncpp,cppjieba分词库,cpp-httplib开源库
      • 前端:html5,css,js,jQuery,Ajax
    • 项目环境:Centos 7云服务器,vim/gcc(g++)/Makefile,vs2019/vs code

    正排索引 and 倒排索引

    正排索引:由key查询实体的过程

    • 例如通过文档名找到相应的文档内容

      文档名文档内容
      XXX公司2021年财报2021年XXX总营收…
      XXX公司2021年产品销售情况2021年A产品销售量…
    • 例如,用户表:

      t_user(uid,name,passwd,age,gender)

      由uid查询整行的过程就是正排索引。

    • 例如,网页库:

      t_web_page(url, page_content)

      由url查询整个网页的过程,也是正排索引查询。

    分词:实体内容分词后,会对应一个分词后的集合list。所以简易的正排索引可以理解为 Map。(关键词具有唯一性)

    • 举个例子,假设有3个网页:

      url1 -> “我爱北京”

      url2 -> “我爱宏伟的天安门”

      url3 -> “长城真宏伟啊”

      这是一个正排索引Map

      分词之后:

      url1 -> {我,爱,北京}

      url2 -> {我,爱,宏伟,天安门}

      url3 -> {长城,宏伟}

      这是一个分词后的正排索引Map

    停止词:了,的,吗,啊,a,the,一般我们在分词的时候可以不考虑

    倒排索引:由实体查询key的过程

    • 例如,网页库:

      由查询词快速找到包含这个查询词的网页

      分词后倒排索引:

      我 -> {url1,url2}

      爱 -> {url1,url2}

      北京 -> {url1}

      宏伟 -> {url2,url3}

      长城 -> {url3}

    由检索词item快速找到包含这个查询词的网页 Map 就是倒排索引。

    模拟一次查找的过程

    用户输入关键词:宏伟 -> 倒排索引 -> 提取出网页{url2,url3} -> 正排索引 -> 分别提取网页内容 -> 分别构建 title + content + url 响应结果 -> 呈现用户时,根据权重划分优先级

    2. 数据去标签与数据清洗模块 —— Parser

    数据源直接在boost官网下载

    在这里插入图片描述

    打开云服务器,建立项目文件夹,使用rz指令将之前下载的数据报添加进入云服务器中:

    在这里插入图片描述

    在这里插入图片描述

    使用tar指令解压:

    在这里插入图片描述

    目前只需要 boost_1_79_0/doc/html目录下的html文件,来对它建立索引。

    所以创建 data/input 目录,将boost库的 doc/html/*文件放在input目录下即可。

    [sjl@VM-16-6-centos boost_searcher]$ cp -rf boost_1_79_0/doc/html/* data/input/
    
    • 1

    数据去标签 parser.cc

    新建去标签程序

    [sjl@VM-16-6-centos boost_searcher]$ touch parser.cc
    //原始数据  -- > 去标签之后的数据
    
    • 1
    • 2

    html文件中 被 <> 括起来的就是标签,然而这对于我们执行搜索是没有价值的,需要去掉这些标签。

    <td align="center"><a href="../../libs/libraries.htm">Librariesa>td>
    
    • 1

    处理完标签的html数据将会存放在 raw_html 目录中

    [sjl@VM-16-6-centos data]$ mkdir raw_html
    [sjl@VM-16-6-centos data]$ ll
    total 16
    drwxrwxr-x 58 sjl sjl 16384 Jul 19 16:37 input      //原始html文档
    drwxrwxr-x  2 sjl sjl  4096 Jul 19 20:37 raw_html   //去标签之后的html文档
    
    • 1
    • 2
    • 3
    • 4
    • 5

    可以看一下data这个文件目前包含多少个html文件:

    [sjl@VM-16-6-centos data]$ ls -Rl|grep -E *.html|wc -l
    8172
    
    • 1
    • 2

    grep : 文本搜索指令 —E 支持正则表达式

    wc : 统计文件属性 -l 统计行数

    目标

    把每个html都去标签,然后写入同一个文件中,注意方便读取,那么我们就把每个文件都各自放在一行里,例子如下,不同的内容以 \3 分隔,不同文件以 \n 分隔:

    类似:

    title\3content\3url \n title\3content\3url \n title\3content\3url \n

    我们知道getline函数可以直接读取一行,直接获取一个文档的全部内容title\3content\3url\3

    parser.cc 的代码结构

    在这里插入图片描述

    #include 
    #include 
    #include 
    
    const std::string src_path="data/input";
    const std::string output="data/raw_html/raw.txt";
    
    typedef struct DocInfo
    {
        std::string title;    //文档标题
        std::string content;  //文档内容
        std::string url;      //该文档在官网中的url
    }DocInfo_t;
    
    //const & : 输入
    //* : 输出
    //& : 输入输出
    
    bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list);
    
    bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results);
    
    bool SaveHtml(const std::vector<DocInfo_t>& results,const std::string& output);
    
    int main()
    {
        //文件名列表
        std::vector<std::string> files_list;
    
        //第一步:递归式地把每个html文件名(带路径),存放到files_list中,方便后期对html文件的读取
        if(!EnumFile(src_path,&files_list))
        {
            std::cerr<<"enum file name error"<<std::endl;
            return 1;
        }
    
        //第二步:读取files_list的文件名读取每个文件的内容,并解析:title + content + url 
        std::vector<DocInfo_t> results; //files_list中所有文件 去除标签后的结果 存放于此
        if(!ParseHtml(files_list,&results))
        {
            std::cerr<<"parse html error"<<std::endl;
            return 2;
        }
    
        //第三步:将解析完毕的各个文件的内容,写入到 output路径 ,每个文件结束以 \3 作为每个文档的分隔符
        if(!SaveHtml(results,output))
        {
            std::cerr<<"Save html error"<<std::endl;
            return 3;
        }
        return 0;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    EnumFile() 函数 —— 枚举筛选html文件

    由于C++标准库对文件操作的支持并不完善,所以这里需要使用Boost库的filesystem模块来完成。

    • boost开发库的安装
    [sjl@VM-16-6-centos boost_searcher]$ sudo yum install -y boost-devel 
    
    • 1

    同时在parser.cc中引入头文件

    #include 
    
    • 1
    • 代码如下
    bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list)
    {
        namespace fs=boost::filesystem;
        fs::path root_path(src_path);
    
        //判断路径是否存在,如果不存在就不必往后走了 
        if(!fs::exists(root_path))
        {
            std::cerr<<src_path<<"not exists"<<std::endl;
            return false;
        }
    
        //定义空的迭代器,用来判断递归结束
        fs::recursive_directory_iterator end;
        for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
        {
            //筛选路径下的普通文件(过滤掉目录文件),html文件都是普通文件
            if(!fs::is_regular_file(*iter))
            {
                continue;
            }
            //过滤掉后缀不为".html"的文件
            if(iter->path().extension()!=".html")
            {
                continue;
            }
    
            //打印测试
            std::cout<<"debug: "<<iter->path().string()<<std::endl; 
      
            //当前的路径一定是以".html"为后缀而定普通网页文件
            files_list->push_back(iter->path().string());//将html文件的路径名转为字符串填入files_list中。
    
        }
    
        return true;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    Makefile文件如下(注意链接boost库和boost文件库):

    cc=g++
    
    parser:parser.cc
    	$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
    
    
    .PHONY:clean
    clean:
    		rm -rf parser
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    make后查看parser的链接库

    在这里插入图片描述

    我们运行下parser可执行文件(另两个函数先默认 return true),查看输出情况:

    在这里插入图片描述

    这样html的文件就被筛选出来了,共有8171个html文件。

    ParseHtml() 函数 —— 解析html代码结构

    经过上面函数的筛选后,我们 files_list中存放的都是html文件的路径名了。

    ParseHtml()代码的整体框架如下:

    在这里插入图片描述

    函数架构

    bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results)
    {
        for(const std::string &file: files_list)
        {
            //1.读取文件 ReadFile
            std::string result;
            if(!ns_tool::FileTool::ReadFile(file,&result))
            {
                continue;
            }
      
            DocInfo_t doc;
            //2.解析文件,提取title
            if(!ParseTitle(result,&doc.title))
            {
                continue;
            }
      
            //3.解析文件,提取content,就是去标签
            if(!ParseContent(result,&doc.content))
            {
                continue;
            }
            //4.解析指定的文件路径,构建官网url
            if(!ParseUrl(file,&doc.url))
            {
                continue;
            }
            //done 一定是完成了解析任务,当前文档的相关结果都保存在了结构体doc中
            //将这些结构体存入results中
            results->push_back(std::move(doc));//bug:todo细节,本质会发生拷贝,效率会比较低
        }
        return true;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    解释

    该函数主要完成4件事:根据路径名依次读取文件内容,提取title,提取content,构建url

    1. 读取文件

    遍历files_list中存储的文件名,从中读取文件内容到 result 中,由函数 ReadFile() 完成该功能。

    该函数定义于头文件 tool.hpp的类 FileTool中。

    //tool.hpp
    #pragma once
    #include 
    #include 
    #include 
    namespace ns_tool
    {
        class FileTool
        {
         public:
      
             //输入文件名,将文件内容读取到out中
             static bool ReadFile(const std::string& file_path,std::string *out)
             {
                std::ifstream in(file_path,std::ifstream::in);
      
                //文件打开失败检查
                if(!in.is_open())
                {
                    std::cerr<<"open file: "<<file_path<<std::endl;
                    return false;
                }
      
                //读取文件
                std::string line;
                while(getline(in,line))
                {
                    *out+=line; 
                }//while(bool),getline的返回值istream会重载操作符bool,读到文件尾eofset被设置并返回false
       
    
                in.close();
                return true;
             }
    
        };
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    1. 提取title —— ParseTitle()

    随意打开一个html文件,可以看到我们要提取的title部分是被title标签包围起来的部分。如下所示:

    在这里插入图片描述

    这里需要依赖函数 —— bool ParseTitle(const std::string& result,&doc.title),来帮助完成这一工作,函数就定义在parse.cc中。

    //解析title
    static bool ParseTitle(const std::string& result,std::string* title)
    {
        std::size_t begin=result.find(""</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        <span class="token keyword">if</span><span class="token punctuation">(</span>begin<span class="token operator">==</span>std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        std<span class="token double-colon punctuation">::</span>size_t end<span class="token operator">=</span>result<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"/title"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        <span class="token keyword">if</span><span class="token punctuation">(</span>end<span class="token operator">==</span>std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        begin<span class="token operator">+=</span>std<span class="token double-colon punctuation">::</span><span class="token function">string</span><span class="token punctuation">(</span><span class="token string">"<title>"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        <span class="token keyword">if</span><span class="token punctuation">(</span>begin<span class="token operator">></span>end<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        <span class="token operator">*</span>title <span class="token operator">=</span> result<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span>end<span class="token operator">-</span>begin<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li></ul></pre> 
    <ol start="3"><li><strong>提取content,实际上是去除标签</strong> —— <code>ParseContent()</code></li></ol> 
    <p>即把所有尖括号及尖括号包含的部分全部去除</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024950532.png" alt="在这里插入图片描述"></p> 
    <p>在遍历的时候,只要碰到了 <code>></code> ,就意味着,当前的标签被处理完毕. 只要碰到了 <code><</code> 意味着新的标签开始了。</p> 
    <p>这里需要依赖函数 —— <code>bool ParseContent(const std::string& result,&doc.content)</code>,来帮助完成这一工作,函数就定义在parse.cc中。</p> 
    <pre data-index="13" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//去标签</span>
    <span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> result<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> content<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//基于一个简易的状态机</span>
        <span class="token keyword">enum</span> <span class="token class-name">status</span> 
        <span class="token punctuation">{<!-- --></span>
            LABLE<span class="token punctuation">,</span>
            CONTENT
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
        <span class="token keyword">enum</span> <span class="token class-name">status</span> s<span class="token punctuation">;</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">char</span> c<span class="token operator">:</span>result<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">switch</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token keyword">case</span> LABLE<span class="token operator">:</span>
                    <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'>'</span><span class="token punctuation">)</span>s<span class="token operator">=</span>CONTENT<span class="token punctuation">;</span>
                    <span class="token keyword">break</span><span class="token punctuation">;</span>
                <span class="token keyword">case</span> CONTENT<span class="token operator">:</span>
                    <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'<'</span><span class="token punctuation">)</span> s<span class="token operator">=</span>LABLE<span class="token punctuation">;</span>
                    <span class="token keyword">else</span> 
                    <span class="token punctuation">{<!-- --></span>
                        <span class="token comment">//不保留 '/n'</span>
                        <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'\n'</span><span class="token punctuation">)</span> c<span class="token operator">=</span><span class="token char">' '</span><span class="token punctuation">;</span>
                        content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">;</span>
                    <span class="token punctuation">}</span>
                    <span class="token keyword">break</span><span class="token punctuation">;</span>
                <span class="token keyword">default</span><span class="token operator">:</span>
                    <span class="token keyword">break</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
        <span class="token punctuation">}</span>
        <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li></ul></pre> 
    <ol start="4"><li><strong>构建官网url</strong></li></ol> 
    <p>boost库在网页上的url,和我们下载的文档的路径是有对应关系的:</p> 
    <p>举个例子:</p> 
    <p>当我们进入官网中查询 <code>Accumulators</code>,其<strong>官网url</strong>为:</p> 
    <p>https://www.boost.org/doc/libs/1_79_0/doc/html/accumulators.html</p> 
    <p>如果我们在下载的文档中查询该网页文件,那么其路径为:</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024950733.png" alt="在这里插入图片描述"></p> 
    <p>而我们项目中的所有数据源都拷贝到了 <code>data/input</code>目录下,那么在我们项目中寻找该网页文件的路径为:</p> 
    <pre data-index="14" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;">data/input/accumulators.html
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> 
    <p>于是我们可以将url拼接:</p> 
    <p>url_head = https://www.boost.org/doc/libs/1_79_0/doc/html</p> 
    <p>url_tail = <s>data/input</s>/accumulators.html</p> 
    <pre data-index="15" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token assign-left variable">url</span><span class="token operator">=</span>url_head + url_tail //相当于形成了一个官网链接
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> 
    <p>这里需要依赖函数 —— <code>bool ParseUrl(const std::string& file_path,std:string* url)</code>,来帮助完成这一工作,函数就定义在parse.cc中。</p> 
    <pre data-index="16" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//构建官网url :url_head + url_tail</span>
    <span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> file_path<span class="token punctuation">,</span>std<span class="token operator">:</span>string<span class="token operator">*</span> url<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        std<span class="token double-colon punctuation">::</span>string url_head<span class="token operator">=</span><span class="token string">"https://www.boost.org/doc/libs/1_79_0/doc/html"</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>string url_tail<span class="token operator">=</span>file_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token operator">*</span>url<span class="token operator">=</span>url_head<span class="token operator">+</span>url_tail<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> 
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li></ul></pre> 
    <h3><a name="t10"></a><a id="SaveHtml____548"></a>SaveHtml() 函数 —— 保存去标签后的文档</h3> 
    <pre data-index="17" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">bool</span> <span class="token function">SaveHtml</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo_t<span class="token operator">></span><span class="token operator">&</span> results<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> output<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">SEP</span> <span class="token char">'\3'</span></span>
        std<span class="token double-colon punctuation">::</span>ofstream <span class="token function">out</span><span class="token punctuation">(</span>output<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>out<span class="token operator">|</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>out<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"open "</span><span class="token operator"><<</span>out<span class="token operator"><<</span><span class="token string">" error"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        <span class="token comment">//文档写入磁盘</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>results<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>string out_string<span class="token punctuation">;</span>
            out_string <span class="token operator">=</span> item<span class="token punctuation">.</span>title<span class="token punctuation">;</span>
            out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span>
            out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>content<span class="token punctuation">;</span>
            out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span>
            out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>url<span class="token punctuation">;</span>
            out_string <span class="token operator">+=</span> <span class="token char">'\n'</span><span class="token punctuation">;</span>
            out<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>out_string<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>out_string<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        out<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li></ul></pre> 
    <h3><a name="t11"></a><a id="_580"></a>测试</h3> 
    <p>我们编译下 parser.cc,得到parser可执行文件,随后make。如果成功,那么此时 <code>/data/raw_html</code>目录下的 <code>raw.txt</code> 就会填入所有的处理完的html文档。</p> 
    <pre data-index="18" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">make</span>
    g++ -o parser parser.cc -std<span class="token operator">=</span>c++11 -lboost_system -lboost_filesystem
    <span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ ll
    total <span class="token number">136</span>
    drwxr-xr-x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> 05:33 boost_1_79_0
    drwxrwxr-x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span>:37 data
    -rw-rw-r-- <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span>:03 Makefile
    -rwxrwxr-x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span>:36 parser
    -rw-rw-r-- <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span>:31 parser.cc
    -rw-rw-r-- <span class="token number">1</span> sjl sjl    <span class="token number">889</span> Jul <span class="token number">21</span> <span class="token number">21</span>:27 tool.hpp
    <span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">cat</span> data/raw_html/raw.txt <span class="token operator">|</span> <span class="token function">wc</span> -l
    <span class="token number">8171</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li></ul></pre> 
    <p>每个html文档占据一行,显然行数与处理之前的html文件数是匹配的。</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024950895.png" alt="在这里插入图片描述"></p> 
    <p>'\3’ascii对应的控制字符 就是 <code>^C</code></p> 
    <h1><a name="t12"></a><a id="3___Index_606"></a>3. 建立索引模块 —— Index</h1> 
    <pre data-index="19" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">touch</span> index.hpp
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> 
    <p>该头文件主要负责三件事:1.构建索引 2.正排索引 3.倒排索引</p> 
    <p>构建思路框图:</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951051.png" alt="在这里插入图片描述"></p> 
    <pre data-index="20" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once </span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><unordered_map></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"tool.hpp"</span></span>
    
    <span class="token keyword">namespace</span> ns_index
    <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">struct</span> <span class="token class-name">DocInfo</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>string title <span class="token punctuation">;</span>   <span class="token comment">//文档标题</span>
            std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span>  <span class="token comment">//文档去标签内容</span>
            std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span>      <span class="token comment">//文档对应的官网url</span>
            <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>      <span class="token comment">//文档ID</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
    
    
        <span class="token comment">//倒排索引结构体</span>
        <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>   <span class="token comment">// 文档ID</span>
            std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">// 文档相关关键字</span>
            <span class="token keyword">int</span> weight<span class="token punctuation">;</span>        <span class="token comment">// 文档权重</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
      
        <span class="token comment">//倒排拉链</span>
        <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span>
      
    
        <span class="token keyword">class</span> <span class="token class-name">Index</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token keyword">private</span><span class="token operator">:</span>
                <span class="token comment">//正排索引的数据结构使用数组,下标将对应文档ID</span>
                std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> forward_index<span class="token punctuation">;</span> <span class="token comment">//正排索引:通过文档ID找到文档内容</span>
    
    
                <span class="token comment">//倒排索引:一个关键词和一组 InvertedElem 对应(关键字和倒排拉链的映射关系)</span>
                std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span> std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> InvertedList <span class="token operator">></span> inverted_index<span class="token punctuation">;</span>
    
            <span class="token keyword">private</span><span class="token operator">:</span>
                <span class="token comment">//Index作为单例模式</span>
                <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span>
                <span class="token function">Index</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span> <span class="token punctuation">)</span><span class="token operator">=</span><span class="token keyword">delete</span><span class="token punctuation">;</span>
                Index<span class="token operator">&</span> <span class="token keyword">operator</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span> <span class="token punctuation">)</span><span class="token operator">=</span><span class="token keyword">delete</span><span class="token punctuation">;</span>
                <span class="token keyword">static</span> Index<span class="token operator">*</span> instance<span class="token punctuation">;</span>
                <span class="token keyword">static</span> std<span class="token double-colon punctuation">::</span>mutex mtx<span class="token punctuation">;</span>
            <span class="token keyword">public</span><span class="token operator">:</span>
                <span class="token comment">//创建单例</span>
                <span class="token keyword">static</span> Index<span class="token operator">*</span> <span class="token function">Getinstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>instance<span class="token punctuation">)</span>
                    <span class="token punctuation">{<!-- --></span>
                        <span class="token comment">//instance为临界资源,需为互斥量</span>
                        mtx<span class="token punctuation">.</span><span class="token function">lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                        <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>instance<span class="token punctuation">)</span>
                        <span class="token punctuation">{<!-- --></span>
                            instance<span class="token operator">=</span><span class="token keyword">new</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                        <span class="token punctuation">}</span>
                        mtx<span class="token punctuation">.</span><span class="token function">unlock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                    <span class="token punctuation">}</span>
                    <span class="token keyword">return</span> instance<span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
    
                <span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span>
            <span class="token keyword">public</span><span class="token operator">:</span>
                <span class="token comment">//获得正排索引:根据文档的 doc_id 获得文档内容</span>
                DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> 
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
    
                <span class="token comment">//获得倒排索引:根据关键字word,获得倒排拉链</span>
                InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
    
                <span class="token comment">//构建索引</span>
                <span class="token comment">//Parse处理后的文档,用来构建正排与倒排索引</span>
                <span class="token comment">//Parse处理后的文档路径存于路径:data/raw_html/raw.txt</span>
                <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> parsed_path<span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
    
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
    
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li><li style="color: rgb(153, 153, 153);">53</li><li style="color: rgb(153, 153, 153);">54</li><li style="color: rgb(153, 153, 153);">55</li><li style="color: rgb(153, 153, 153);">56</li><li style="color: rgb(153, 153, 153);">57</li><li style="color: rgb(153, 153, 153);">58</li><li style="color: rgb(153, 153, 153);">59</li><li style="color: rgb(153, 153, 153);">60</li><li style="color: rgb(153, 153, 153);">61</li><li style="color: rgb(153, 153, 153);">62</li><li style="color: rgb(153, 153, 153);">63</li><li style="color: rgb(153, 153, 153);">64</li><li style="color: rgb(153, 153, 153);">65</li><li style="color: rgb(153, 153, 153);">66</li><li style="color: rgb(153, 153, 153);">67</li><li style="color: rgb(153, 153, 153);">68</li><li style="color: rgb(153, 153, 153);">69</li><li style="color: rgb(153, 153, 153);">70</li><li style="color: rgb(153, 153, 153);">71</li><li style="color: rgb(153, 153, 153);">72</li><li style="color: rgb(153, 153, 153);">73</li><li style="color: rgb(153, 153, 153);">74</li><li style="color: rgb(153, 153, 153);">75</li><li style="color: rgb(153, 153, 153);">76</li><li style="color: rgb(153, 153, 153);">77</li><li style="color: rgb(153, 153, 153);">78</li><li style="color: rgb(153, 153, 153);">79</li><li style="color: rgb(153, 153, 153);">80</li><li style="color: rgb(153, 153, 153);">81</li><li style="color: rgb(153, 153, 153);">82</li><li style="color: rgb(153, 153, 153);">83</li><li style="color: rgb(153, 153, 153);">84</li><li style="color: rgb(153, 153, 153);">85</li><li style="color: rgb(153, 153, 153);">86</li><li style="color: rgb(153, 153, 153);">87</li><li style="color: rgb(153, 153, 153);">88</li><li style="color: rgb(153, 153, 153);">89</li><li style="color: rgb(153, 153, 153);">90</li><li style="color: rgb(153, 153, 153);">91</li></ul></pre> 
    <p>有了基本思路后我们就可以开始编写函数了</p> 
    <h2><a name="t13"></a><a id="_715"></a>获得正排索引</h2> 
    <p>在 <code>forward_list</code>已经建立好的前提下,获得正排索引的函数并不难写。</p> 
    <pre data-index="21" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//根据文档的 doc_id 获得文档内容</span>
    DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> 
    <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>doc_id<span class="token operator">>=</span>forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"doc_id out of range!"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">[</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li></ul></pre> 
    <h2><a name="t14"></a><a id="_732"></a>获得倒排索引</h2> 
    <pre data-index="22" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//根据关键字word,获得倒排拉链</span>
    InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span>InvertedList<span class="token operator">></span><span class="token double-colon punctuation">::</span>iterator iter<span class="token operator">=</span>inverted_index<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>iter<span class="token operator">==</span>inverted_index<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token comment">//没有索引结果</span>
            std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span>word<span class="token operator"><<</span><span class="token string">"has no InvertedList"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        <span class="token keyword">return</span> <span class="token operator">&</span><span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">;</span> 
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li></ul></pre> 
    <h2><a name="t15"></a><a id="_750"></a>构建索引</h2> 
    <p>显然这部分的难点就是如何构建索引,而<strong>构建索引的思路正好和用户使用搜索功能的过程正好相反</strong>。</p> 
    <p>思路:一个一个文档遍历,为其每个构建先正排索引后构建倒排索引。</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951254.png" alt="在这里插入图片描述"></p> 
    <p>代码如下:</p> 
    <pre data-index="23" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//Parse处理后的文档,构建正排与倒排索引</span>
    <span class="token comment">//Parse处理后的文档路径存于路径:data/raw_html/raw.txt</span>
    <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> parsed_path<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//读取Parse路径的文件</span>
        std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>parsed_path<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>in<span class="token operator">|</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>in<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span>parsed_path<span class="token operator"><<</span><span class="token string">" open failed"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
      
        std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span>
        <span class="token keyword">int</span> count<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span><span class="token comment">//统计已构成索引的条目数</span>
        <span class="token keyword">while</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span>line<span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span> 
            <span class="token comment">//构建正排索引:把Parse后的文档读入到正排索引中</span>
            DocInfo<span class="token operator">*</span> doc<span class="token operator">=</span><span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>doc<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"bulid "</span><span class="token operator"><<</span>line<span class="token operator"><<</span><span class="token string">" error"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token comment">//for debug</span>
                <span class="token keyword">continue</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
    
            <span class="token comment">//构建倒排索引:</span>
            <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token operator">*</span>doc<span class="token punctuation">)</span><span class="token punctuation">;</span>
    
            <span class="token comment">//实时打印已完成构建的索引条目数:进度条</span>
            count<span class="token operator">++</span><span class="token punctuation">;</span>
            <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"已构建索引%d条: %d%%\r"</span><span class="token punctuation">,</span>count<span class="token punctuation">,</span>count<span class="token operator">*</span><span class="token number">100</span><span class="token operator">/</span><span class="token number">8171</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//8171为已解析文件数</span>
            <span class="token function">fflush</span><span class="token punctuation">(</span><span class="token constant">stdout</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li></ul></pre> 
    <h3><a name="t16"></a><a id="_796"></a>构建正排索引</h3> 
    <pre data-index="24" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">private</span><span class="token operator">:</span>
        DocInfo<span class="token operator">*</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> line<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token comment">//1.解析line,字符串切分</span>
            <span class="token comment">//line -> title+content+url </span>
            std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> results<span class="token punctuation">;</span>
            <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string sep<span class="token operator">=</span><span class="token string">"\3"</span><span class="token punctuation">;</span>
            ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">StringTool</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>line<span class="token punctuation">,</span><span class="token operator">&</span>results<span class="token punctuation">,</span>sep<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span>results<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">!=</span><span class="token number">3</span><span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
    
            <span class="token comment">//2.切分后填入DocInfo</span>
            DocInfo doc<span class="token punctuation">;</span>
            doc<span class="token punctuation">.</span>title<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
            doc<span class="token punctuation">.</span>content<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
            doc<span class="token punctuation">.</span>url<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
            doc<span class="token punctuation">.</span>doc_id<span class="token operator">=</span>forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
      
            <span class="token comment">//3.DocInfo再插入到正排索引的forward_index</span>
            forward_index<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">.</span><span class="token function">back</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li></ul></pre> 
    <p>其中 <code>CutString</code>函数定义在tool.hpp中</p> 
    <p>借用boost库的split函数可以方便我们切分字符串,在此之前我们把title/content/url使用 <code>\3</code>进行了划分。</p> 
    <pre data-index="25" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//tool.hpp</span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><boost/algorithm/string.hpp></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cppjieba/Jieba.hpp"</span></span>
    <span class="token keyword">namespace</span> ns_tool
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//...</span>
    
        <span class="token keyword">class</span> <span class="token class-name">StringTool</span>
        <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">public</span><span class="token operator">:</span>
            <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> src<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> dst<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> sep <span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//boost split</span>
                boost<span class="token double-colon punctuation">::</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token operator">*</span>dst<span class="token punctuation">,</span>src<span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span><span class="token function">is_any_of</span><span class="token punctuation">(</span>sep<span class="token punctuation">)</span><span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span>token_compress_on<span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token comment">//token_compress_on 为压缩划分——分隔符的连续出现会视为仅一个分隔符</span>
            <span class="token punctuation">}</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li></ul></pre> 
    <h3><a name="t17"></a><a id="_855"></a>构建倒排索引</h3> 
    <p>构建倒排索引是构建索引的难点</p> 
    <p><strong>原理</strong>:</p> 
    <ol><li>拿到了DocInfo</li></ol> 
    <pre data-index="26" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">struct</span> <span class="token class-name">DocInfo</span>
    <span class="token punctuation">{<!-- --></span>
        std<span class="token double-colon punctuation">::</span>string title <span class="token punctuation">;</span>   <span class="token comment">//文档标题</span>
        std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span>  <span class="token comment">//文档去标签内容</span>
        std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span>      <span class="token comment">//文档对应的官网url</span>
        <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>      <span class="token comment">//文档ID</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li></ul></pre> 
    <p>例如:</p> 
    <pre data-index="27" class="prettyprint"><code class="prism language-txt has-numbering" onclick="mdcp.signin(event)" style="position: unset;">title: 吃葡萄
    content:吃葡萄不吐葡萄皮
    url:http://xxxx
    doc_id:123
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li></ul></pre> 
    <ol><li>根据DocInfo涵盖的文档内容形成一个InvertedElem或者多个InvertedElem,</li></ol> 
    <pre data-index="28" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//倒排索引结构体</span>
    <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>   <span class="token comment">// 文档ID</span>
        std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">// 文档相关关键字</span>
        <span class="token keyword">int</span> weight<span class="token punctuation">;</span>        <span class="token comment">// 文档权重</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
    
    <span class="token comment">//倒排拉链</span>
    <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li></ul></pre> 
    <p>由于当前我们是一个一个文档进行处理,一个文档会包含多个词,所以都对应到当前的doc_id .</p> 
    <p><strong>2.1</strong> 首先是对 title && content 分词—— 使用 <code>jieba分词(第三方库)</code></p> 
    <p>title: 吃/葡萄/吃葡萄 (<code>title_word</code>)</p> 
    <p>content:吃/葡萄/不吐/葡萄皮( <code>content_word</code> )</p> 
    <p><strong>2.2</strong> 词频统计</p> 
    <p>词和文档的相关性(词频越高或者在标题中出现的词,可以认为相关性高)</p> 
    <p>伪代码:</p> 
    <pre data-index="29" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//文档分词后统计每个词对应在title和content中出现的频率</span>
    <span class="token keyword">struct</span> <span class="token class-name">word_cnt</span>
    <span class="token punctuation">{<!-- --></span>
        title_cnt<span class="token punctuation">;</span>
        content_cnt<span class="token punctuation">;</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
    
    <span class="token comment">//每个词 与对应的 词频统计 放在map容器中</span>
    unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_stat<span class="token punctuation">;</span>
    
    <span class="token comment">//遍历title_word数组,统计每个词在title中的词频</span>
    <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>title_word<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        word_stat<span class="token punctuation">[</span>word<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token comment">//吃(1)/葡萄 (1)//吃葡萄(1)</span>
    <span class="token punctuation">}</span>
    
    <span class="token comment">//遍历content_word数组,统计每个词在content的词频</span>
    <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>content_word<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        word_stat<span class="token punctuation">[</span>word<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token comment">//吃(1)/葡萄(1)/不吐(1)/葡萄皮(1)</span>
    <span class="token punctuation">}</span>
    
    
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li></ul></pre> 
    <p>至此知道了文档中,title和content中的每个词的词频</p> 
    <p><strong>2.3</strong> 自定义相关性</p> 
    <p>伪代码</p> 
    <pre data-index="30" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>word_stat<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//具体一个词(word)和文档(ID:123)的对应关系</span>
        <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span> elem<span class="token punctuation">;</span>
        elem<span class="token punctuation">.</span>doc_id<span class="token operator">=</span><span class="token number">123</span><span class="token punctuation">;</span>
        elem<span class="token punctuation">.</span>word<span class="token operator">=</span>word<span class="token punctuation">.</span>first<span class="token punctuation">;</span>  
    
        <span class="token comment">//当一个词指向多个文档ID时,优先显示谁将由相关性决定</span>
        elem<span class="token punctuation">.</span>weight<span class="token operator">=</span><span class="token number">10</span><span class="token operator">*</span>word<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt <span class="token operator">+</span> word<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt <span class="token punctuation">;</span>
        <span class="token comment">//相关性,或者说权重的配比是一个很难的课题,这里只做简化处理</span>
      
        <span class="token comment">//为该词建立倒排拉链——一词可对应多个文档</span>
        inverted_index<span class="token punctuation">[</span>word<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li></ul></pre> 
    <ol start="3"><li>jieba分词的使用 —— cppjieba</li></ol> 
    <p>下载cppjieba库</p> 
    <p>获取链接 :</p> 
    <pre data-index="31" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token function">git</span> clone https://github.com/yanyiwu/cppjieba
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> 
    <p>下载完cppjieba后,还有一个细节,手动把 <code>cppjieba/deps/limonp/</code> 的文件拷贝到 <code>cpp/jieba/include/cppjieba/</code> 目录下,否则会编译报错</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951457.png" alt="在这里插入图片描述"></p> 
    <p>我们可以试一下这个第三方库,主要使用 <code>CutForSearch()</code>函数</p> 
    <pre data-index="32" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ ll
    total <span class="token number">372</span>
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">366424</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> a<span class="token punctuation">.</span>out
    drwxrwxr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">11</span> cppjieba
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">857</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">07</span> demo<span class="token punctuation">.</span>cpp
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">14</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">23</span> dict <span class="token operator">-></span> cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">17</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">26</span> inc <span class="token operator">-></span> cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">424</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">34</span> test<span class="token punctuation">.</span>cc
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ cat demo<span class="token punctuation">.</span>cpp 
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"inc/cppjieba/Jieba.hpp"</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    
    <span class="token keyword">using</span> <span class="token keyword">namespace</span> std<span class="token punctuation">;</span>
    
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span>
    
    <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token keyword">int</span> argc<span class="token punctuation">,</span> <span class="token keyword">char</span><span class="token operator">*</span><span class="token operator">*</span> argv<span class="token punctuation">)</span> 
    <span class="token punctuation">{<!-- --></span>
        cppjieba<span class="token double-colon punctuation">::</span>Jieba <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span>
                HMM_PATH<span class="token punctuation">,</span>
                USER_DICT_PATH<span class="token punctuation">,</span>
                IDF_PATH<span class="token punctuation">,</span>
                STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span>
        vector<span class="token operator"><</span>string<span class="token operator">></span> words<span class="token punctuation">;</span>
        string s<span class="token punctuation">;</span>
    
        s <span class="token operator">=</span> <span class="token string">"小明硕士毕业于中国科学院计算所,后在日本京都大学深造"</span><span class="token punctuation">;</span>
        cout <span class="token operator"><<</span> s <span class="token operator"><<</span> endl<span class="token punctuation">;</span>
        cout <span class="token operator"><<</span> <span class="token string">"[demo] CutForSearch"</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span>
        jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>s<span class="token punctuation">,</span> words<span class="token punctuation">)</span><span class="token punctuation">;</span>
        cout <span class="token operator"><<</span> limonp<span class="token double-colon punctuation">::</span><span class="token function">Join</span><span class="token punctuation">(</span>words<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"/"</span><span class="token punctuation">)</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span>
    
        <span class="token keyword">return</span> EXIT_SUCCESS<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ <span class="token punctuation">.</span><span class="token operator">/</span>a<span class="token punctuation">.</span>out 
    小明硕士毕业于中国科学院计算所,后在日本京都大学深造
    <span class="token punctuation">[</span>demo<span class="token punctuation">]</span> CutForSearch
    小明<span class="token operator">/</span>硕士<span class="token operator">/</span>毕业<span class="token operator">/</span>于<span class="token operator">/</span>中国<span class="token operator">/</span>科学<span class="token operator">/</span>学院<span class="token operator">/</span>科学院<span class="token operator">/</span>中国科学院<span class="token operator">/</span>计算<span class="token operator">/</span>计算所<span class="token operator">/</span>,<span class="token operator">/</span>后<span class="token operator">/</span>在<span class="token operator">/</span>日本<span class="token operator">/</span>京都<span class="token operator">/</span>大学<span class="token operator">/</span>日本京都大学<span class="token operator">/</span>深造
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li></ul></pre> 
    <p>可以看到词语得以很好的划分。</p> 
    <p><strong>下面引入jieba库来编写倒排索引的代码</strong></p> 
    <p>将 cppjieba 库存放在根目录的第三方目录 <code>thirdpart</code> 下,然后将<strong>库的头文件和词库</strong>在本项目目录中<strong>创建软连接</strong>:</p> 
    <pre data-index="33" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
    total <span class="token number">148</span>
    drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
    drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4399</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> index<span class="token punctuation">.</span>hpp
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">03</span> Makefile
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">36</span> parser
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
    drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> test
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">1244</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> tool<span class="token punctuation">.</span>hpp
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span> cppjieba
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span> dict
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
    total <span class="token number">148</span>
    drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">46</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">46</span> cppjieba <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span>
    drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">34</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">47</span> dict <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4399</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> index<span class="token punctuation">.</span>hpp
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">03</span> Makefile
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">36</span> parser
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
    drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> test
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">1244</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> tool<span class="token punctuation">.</span>hpp
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ls cppjieba<span class="token operator">/</span>
    DictTrie<span class="token punctuation">.</span>hpp     HMMModel<span class="token punctuation">.</span>hpp    Jieba<span class="token punctuation">.</span>hpp             limonp          MPSegment<span class="token punctuation">.</span>hpp  PreFilter<span class="token punctuation">.</span>hpp     SegmentBase<span class="token punctuation">.</span>hpp    TextRankExtractor<span class="token punctuation">.</span>hpp  Unicode<span class="token punctuation">.</span>hpp
    FullSegment<span class="token punctuation">.</span>hpp  HMMSegment<span class="token punctuation">.</span>hpp  KeywordExtractor<span class="token punctuation">.</span>hpp  MixSegment<span class="token punctuation">.</span>hpp  PosTagger<span class="token punctuation">.</span>hpp  QuerySegment<span class="token punctuation">.</span>hpp  SegmentTagged<span class="token punctuation">.</span>hpp  Trie<span class="token punctuation">.</span>hpp
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ls dict<span class="token operator">/</span>
    hmm_model<span class="token punctuation">.</span>utf8  idf<span class="token punctuation">.</span>utf8  jieba<span class="token punctuation">.</span>dict<span class="token punctuation">.</span>utf8  pos_dict  README<span class="token punctuation">.</span>md  stop_words<span class="token punctuation">.</span>utf8  user<span class="token punctuation">.</span>dict<span class="token punctuation">.</span>utf8
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li></ul></pre> 
    <p>我们把分词的代码作为一种常用工具放在头文件 <code>tool.hpp</code>中,于是分词的函数代码如下</p> 
    <pre data-index="34" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//tool.hpp</span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><boost/algorithm/string.hpp></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cppjieba/Jieba.hpp"</span></span>
    <span class="token keyword">namespace</span> ns_tool
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//...</span>
    
    
        <span class="token comment">//分词工具</span>
        <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span>
        <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span>
        <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span>
        <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span>
        <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span>
        <span class="token keyword">class</span> <span class="token class-name">JiebaTool</span>
        <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">private</span><span class="token operator">:</span>
            <span class="token keyword">static</span> cppjieba<span class="token double-colon punctuation">::</span>Jieba jieba<span class="token punctuation">;</span>
    
        <span class="token keyword">public</span><span class="token operator">:</span>
            <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">SplitToWord</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> out<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//使用jieba库函数对src分词,并存于out中</span>
                jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span><span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
    
        cppjieba<span class="token double-colon punctuation">::</span>Jieba <span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span>
                HMM_PATH<span class="token punctuation">,</span>
                USER_DICT_PATH<span class="token punctuation">,</span>
                IDF_PATH<span class="token punctuation">,</span>
                STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li></ul></pre> 
    <p>于是整个构建倒排索引的代码如下:</p> 
    <pre data-index="35" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">private</span><span class="token operator">:</span>
        <span class="token keyword">bool</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> DocInfo <span class="token operator">&</span>doc<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token comment">//构建完的正排,此时DocInfo[title,content,url,doc_id]</span>
            <span class="token comment">// word-> 倒排拉链</span>
      
            <span class="token comment">//每个词在文档中的词频统计 </span>
            <span class="token keyword">struct</span> <span class="token class-name">word_cnt</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token keyword">int</span> title_cnt<span class="token punctuation">;</span>
                <span class="token keyword">int</span> content_cnt<span class="token punctuation">;</span>
                <span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">:</span><span class="token function">title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token function">content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span>
            <span class="token punctuation">}</span><span class="token punctuation">;</span>
            std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_stat<span class="token punctuation">;</span><span class="token comment">//用来暂存关键词与词频的映射表</span>
      
    
            <span class="token comment">//标题分词</span>
            std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> title_word<span class="token punctuation">;</span>
            ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>title<span class="token punctuation">,</span><span class="token operator">&</span>title_word<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//标题词频统计</span>
            <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s<span class="token operator">:</span>title_word<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//将标题关键字全部转为小写统一计算词频(使用拷贝,不影响原来的关键字)</span>
                boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>
                word_stat<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
      
            <span class="token comment">//内容分词</span>
            std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> content_word<span class="token punctuation">;</span>
            ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>content<span class="token punctuation">,</span><span class="token operator">&</span>content_word<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//内容词频统计</span>
            <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s<span class="token operator">:</span>content_word<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//将内容关键字全部转为小写统一计算词频(使用拷贝,不影响原来的关键字)</span>
                boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>
                word_stat<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
      
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">X</span> <span class="token expression"><span class="token number">10</span></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">Y</span> <span class="token expression"><span class="token number">1</span></span></span>
            <span class="token comment">//建立该doc所有关键字对应的倒排拉链</span>
            <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span>word_pair<span class="token operator">:</span>word_stat<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                InvertedElem elem<span class="token punctuation">;</span>
                elem<span class="token punctuation">.</span>doc_id<span class="token operator">=</span>doc<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span>
                elem<span class="token punctuation">.</span>word<span class="token operator">=</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">;</span>
                <span class="token comment">//自定义相关性</span>
                elem<span class="token punctuation">.</span>weight<span class="token operator">=</span>word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt<span class="token operator">*</span>X<span class="token operator">+</span>word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt<span class="token operator">*</span>Y<span class="token punctuation">;</span>
          
                <span class="token comment">//将这个关键字构成的倒排索引元素push到倒排索引表的倒排拉链中</span>
                <span class="token comment">//(注意这里的关键字全部转为小写计算了词频),所以搜索时,需将用户输入的关键字先转为全小写</span>
    
                InvertedList <span class="token operator">&</span>inverted_list<span class="token operator">=</span>inverted_index<span class="token punctuation">[</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">;</span>
                inverted_list<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li><li style="color: rgb(153, 153, 153);">53</li><li style="color: rgb(153, 153, 153);">54</li><li style="color: rgb(153, 153, 153);">55</li><li style="color: rgb(153, 153, 153);">56</li><li style="color: rgb(153, 153, 153);">57</li><li style="color: rgb(153, 153, 153);">58</li></ul></pre> 
    <h1><a name="t18"></a><a id="4___Searcher_1168"></a>4. 搜索引擎模块 —— Searcher</h1> 
    <p>基本思路</p> 
    <pre data-index="36" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//searcher.hpp</span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"index.hpp"</span></span>
    
    <span class="token keyword">namespace</span> ns_searcher
    <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">class</span> <span class="token class-name">Searcher</span>
        <span class="token punctuation">{<!-- --></span>
        <span class="token keyword">private</span><span class="token operator">:</span>
            ns_index<span class="token double-colon punctuation">::</span>Index <span class="token operator">*</span>index<span class="token punctuation">;</span>
        <span class="token keyword">public</span><span class="token operator">:</span>
            <span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//1.创建index对象(单例)</span>
                <span class="token comment">//2.根据index对象建立索引</span>
            <span class="token punctuation">}</span>
    
            <span class="token comment">//搜索功能</span>
            <span class="token comment">//json_string 返回给用户浏览器的搜索结果</span>
            <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> json_string<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//1.[分词]:对搜索关键字query在服务端也要分词,然后查找index</span>
                <span class="token comment">//2.[触发]:根据分词的各个词进行index查找</span>
                <span class="token comment">//3.[合并排序]:汇总查找结果,按照相关性(权重weight)降序排序</span>
                <span class="token comment">//4.[构建]:将排好序的结果,生成json串 —— jsoncpp</span>
            <span class="token punctuation">}</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li></ul></pre> 
    <h2><a name="t19"></a><a id="__InitSearcher_1202"></a>初始化搜索对象 —— InitSearcher</h2> 
    <p>该函数负责两件事,构造索引对象并构建索引</p> 
    <p>Index为单例模式,调用函数GetInstance生成对象:</p> 
    <p>调用函数BuildIndex构建索引</p> 
    <pre data-index="37" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//1.创建index对象(单例)</span>
        index<span class="token operator">=</span>ns_index<span class="token double-colon punctuation">::</span><span class="token class-name">Index</span><span class="token double-colon punctuation">::</span><span class="token function">Getinstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"创建index单例完成..."</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token comment">//2.根据index对象建立索引(将已去除标签处理好的文件路径传入)</span>
        index<span class="token operator">-></span><span class="token function">BuildIndex</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"构建索引完成..."</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li></ul></pre> 
    <h2><a name="t20"></a><a id="__Search_1222"></a>搜索功能 —— Search</h2> 
    <ul><li> <p>[分词]</p> <p>继续使用结巴分词工具定义的函数 <code>SplitToWord</code>来对用户输入的索引词进行分词</p> </li><li> <p>[触发]</p> <p>调用 <code>获取倒排索引函数GetInvertedList()</code>获得所有关键词的倒排拉链</p> </li><li> <p>[合并排序]</p> <p>汇总倒排拉链中的所有倒排元素(文档ID相同的去重),按照权重降序排序</p> </li><li> <p>[构建]<br> 由倒排元素正排索引得到正文文档,将正文中的content进行摘录。合并所有文档后,使用json库生成序列化字符串,便于后续网络传输。</p> <p>摘录content的多少部分是我们自己定的规则:找到关键字在content中首次出现的位置pos,然后截取 —— 往前找50个字节(如没有50个,则从begin开始),往后找100个字节(如没有,则截取到end)的内容</p> </li></ul> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951637.png" alt="在这里插入图片描述"></p> 
    <h3><a name="t21"></a><a id="json_1241"></a>安装json库与使用示例</h3> 
    <pre data-index="38" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token function">sudo</span> yum <span class="token function">install</span> -y jsoncpp-devel
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> 
    <p>使用json</p> 
    <pre data-index="39" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><jsoncpp/json/json.h></span></span>
    
    <span class="token comment">//Value Reader(反序列化) Writer(序列化)</span>
    <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span>
        Json<span class="token double-colon punctuation">::</span>Value item1<span class="token punctuation">;</span>
        item1<span class="token punctuation">[</span><span class="token string">"key1"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value11"</span><span class="token punctuation">;</span>
        item1<span class="token punctuation">[</span><span class="token string">"key2"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value12"</span><span class="token punctuation">;</span>
    
        Json<span class="token double-colon punctuation">::</span>Value item2<span class="token punctuation">;</span>
        item2<span class="token punctuation">[</span><span class="token string">"key1"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value21"</span><span class="token punctuation">;</span>
        item2<span class="token punctuation">[</span><span class="token string">"key2"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value22"</span><span class="token punctuation">;</span>
    
        root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>item1<span class="token punctuation">)</span><span class="token punctuation">;</span>
        root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>item2<span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        Json<span class="token double-colon punctuation">::</span>StyledWriter writer<span class="token punctuation">;</span>
        <span class="token comment">//Json::FastWriter writer;</span>
        std<span class="token double-colon punctuation">::</span>string s<span class="token operator">=</span>writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span>s<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li></ul></pre> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951799.png" alt="在这里插入图片描述"></p> 
    <h3><a name="t22"></a><a id="Search__1280"></a>Search 完整代码</h3> 
    <pre data-index="40" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">public</span><span class="token operator">:</span>
        <span class="token comment">//搜索功能</span>
        <span class="token comment">//json_string 返回给用户浏览器的搜索结果</span>
        <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> json_string<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token comment">//1.[分词]:对搜索关键字query在服务端也要分词,然后查找index</span>
            std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span>
            ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span><span class="token operator">&</span>words<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//2.[触发]:就是根据分词的各个词进行index查找,忽略大小写,所以关键字需要转换为小写</span>
            ns_index<span class="token double-colon punctuation">::</span>InvertedList inverted_list_all<span class="token punctuation">;</span>
            <span class="token keyword">for</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string word<span class="token operator">:</span>words<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token comment">//获取倒排拉链</span>
                ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>inverted_list<span class="token operator">=</span>index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token comment">//如果倒排拉链不存在则continue</span>
                <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>inverted_list<span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">continue</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
                <span class="token comment">//将关键字的倒排拉链的倒排元素汇总</span>
                <span class="token comment">//不完美的地方,如果多个关键字出现在一个文档中,那么许多倒排元素中的文档ID其实是会重复的</span>
                inverted_list_all<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
        
            <span class="token comment">//3.[合并排序]:汇总查找结果,按照相关性(权重weight)进行降序排序</span>
            std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElem e1<span class="token punctuation">,</span><span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElem<span class="token operator">&</span> e2<span class="token punctuation">)</span><span class="token operator">-></span><span class="token keyword">bool</span><span class="token punctuation">{<!-- --></span>\
                <span class="token keyword">return</span> e1<span class="token punctuation">.</span>weight<span class="token operator">></span>e2<span class="token punctuation">.</span>weight<span class="token punctuation">;</span>\
                <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
            <span class="token comment">//4.[构建]:根据查找出的结果,生成json串 —— jsoncpp 完成序列化和反序列化</span>
            Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span>
            <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>inverted_list_all<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token comment">//正排索引获取文档内容</span>
                ns_index<span class="token double-colon punctuation">::</span>DocInfo<span class="token operator">*</span> doc<span class="token operator">=</span>index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>doc<span class="token punctuation">)</span>
                <span class="token punctuation">{<!-- --></span>
                    <span class="token keyword">continue</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
                Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span>
                elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span><span class="token operator">=</span>doc<span class="token operator">-></span>title<span class="token punctuation">;</span>
                <span class="token comment">//content是文档去标签的结果,但是内容太多需要提取出摘要GetAbstract</span>
                elem<span class="token punctuation">[</span><span class="token string">"abstract"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token function">GetAbstract</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>content<span class="token punctuation">,</span>item<span class="token punctuation">.</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
                elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span><span class="token operator">=</span>doc<span class="token operator">-></span>url<span class="token punctuation">;</span>
    
                <span class="token comment">//for debug 查看是否以权重降序排序</span>
                elem<span class="token punctuation">[</span><span class="token string">"doc_id"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span>
                elem<span class="token punctuation">[</span><span class="token string">"weight"</span><span class="token punctuation">]</span><span class="token operator">=</span>item<span class="token punctuation">.</span>weight<span class="token punctuation">;</span>
    
                root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
    
            Json<span class="token double-colon punctuation">::</span>StyledWriter writer<span class="token punctuation">;</span>
            <span class="token operator">*</span>json_string<span class="token operator">=</span>writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li><li style="color: rgb(153, 153, 153);">53</li><li style="color: rgb(153, 153, 153);">54</li><li style="color: rgb(153, 153, 153);">55</li><li style="color: rgb(153, 153, 153);">56</li></ul></pre> 
    <p>提取摘要</p> 
    <pre data-index="41" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token keyword">public</span><span class="token operator">:</span>
        std<span class="token double-colon punctuation">::</span>string <span class="token function">GetAbstract</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> html_content<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            <span class="token comment">//找到word在html_content中首次出现的位置,</span>
            <span class="token comment">//然后截取:往前找50个字节(如没有50个,则从begin开始),往后找100个字节(如没有截取到end)的内容</span>
        
            <span class="token keyword">const</span> <span class="token keyword">int</span> prev_step<span class="token operator">=</span><span class="token number">50</span><span class="token punctuation">;</span>
            <span class="token keyword">const</span> <span class="token keyword">int</span> post_step<span class="token operator">=</span><span class="token number">100</span><span class="token punctuation">;</span>
            <span class="token comment">//1.找到首次出现位置pos 使用std::search 函数 忽视大小写搜索</span>
            <span class="token keyword">auto</span> iter<span class="token operator">=</span>std<span class="token double-colon punctuation">::</span><span class="token function">search</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> a<span class="token punctuation">,</span><span class="token keyword">int</span> b<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span>\
                <span class="token keyword">return</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span><span class="token operator">==</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span>iter<span class="token operator">==</span>html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                <span class="token keyword">return</span> <span class="token string">"Not Found"</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            <span class="token keyword">int</span> pos<span class="token operator">=</span>std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span>
    
            <span class="token comment">//2.获取start的位置和last的位置</span>
            <span class="token keyword">int</span> start<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span>
            <span class="token keyword">int</span> last<span class="token operator">=</span>html_content<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span>
            <span class="token comment">//如果之前有50+个字节,更新start</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator">></span>start<span class="token operator">+</span>prev_step<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                start<span class="token operator">=</span>pos<span class="token operator">-</span>prev_step<span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            <span class="token comment">//如果之后有100+个字节,更新last </span>
            <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator">+</span>post_step<span class="token operator"><</span>last<span class="token punctuation">)</span>
            <span class="token punctuation">{<!-- --></span>
                last<span class="token operator">=</span>pos<span class="token operator">+</span>post_step<span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            <span class="token comment">//3.截取子串返回</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span>start<span class="token operator">>=</span>last<span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> 
            <span class="token keyword">return</span> html_content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>start<span class="token punctuation">,</span>last<span class="token operator">-</span>start<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span> 
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li></ul></pre> 
    <h2><a name="t23"></a><a id="_1381"></a>测试</h2> 
    <p>在完成网络传输模块之前,我们可以在本地进行测试,搜索关键词时是否能搜到想得到的结果:</p> 
    <pre data-index="42" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//debug.cc</span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><cstdio></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><cstring></span></span>
    <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input<span class="token operator">=</span><span class="token string">"data/raw_html/raw.txt"</span><span class="token punctuation">;</span>
    <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//for test</span>
        ns_searcher<span class="token double-colon punctuation">::</span>Searcher <span class="token operator">*</span>search<span class="token operator">=</span><span class="token keyword">new</span> ns_searcher<span class="token double-colon punctuation">::</span>Searcher<span class="token punctuation">;</span>
        search<span class="token operator">-></span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        std<span class="token double-colon punctuation">::</span>string query<span class="token punctuation">;</span>
        <span class="token keyword">char</span> buffer<span class="token punctuation">[</span><span class="token number">1024</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
        <span class="token keyword">while</span><span class="token punctuation">(</span><span class="token boolean">true</span><span class="token punctuation">)</span>
        <span class="token punctuation">{<!-- --></span>
            std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"please enter the query"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token function">fgets</span><span class="token punctuation">(</span>buffer<span class="token punctuation">,</span><span class="token keyword">sizeof</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token constant">stdin</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            buffer<span class="token punctuation">[</span><span class="token function">strlen</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span><span class="token comment">//去除回车</span>
            query<span class="token operator">=</span>buffer<span class="token punctuation">;</span>
          
            std<span class="token double-colon punctuation">::</span>string ans<span class="token punctuation">;</span>
            search<span class="token operator">-></span><span class="token function">Search</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span><span class="token operator">&</span>ans<span class="token punctuation">)</span><span class="token punctuation">;</span>
            std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span>ans<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li></ul></pre> 
    <h1><a name="t24"></a><a id="5___http_server__1416"></a>5. 服务器搭建 —— http_server 模块</h1> 
    <p>cpp-httplib库:https://gitee.com/sumert/cpp-httplib/tree/v0.7.15</p> 
    <p>(如果链接失效,直接在gitee搜索 <code>cpp-httplib</code>即可)</p> 
    <p>注意事项:cpp-httplib 在使用的时候需使用较新的gcc,否则会编译出错。</p> 
    <p>我们使用的云服务的gcc版本默认为 gcc 4.8.5</p> 
    <pre data-index="43" class="prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM-16-6-centos ~<span class="token punctuation">]</span>$ gcc -v
    Using built-in specs.
    <span class="token assign-left variable">COLLECT_GCC</span><span class="token operator">=</span>gcc
    <span class="token assign-left variable">COLLECT_LTO_WRAPPER</span><span class="token operator">=</span>/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
    Target: x86_64-redhat-linux
    Configured with: <span class="token punctuation">..</span>/configure --prefix<span class="token operator">=</span>/usr --mandir<span class="token operator">=</span>/usr/share/man --infodir<span class="token operator">=</span>/usr/share/info --with-bugurl<span class="token operator">=</span>http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads<span class="token operator">=</span>posix --enable-checking<span class="token operator">=</span>release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style<span class="token operator">=</span>gnu --enable-languages<span class="token operator">=</span>c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl<span class="token operator">=</span>/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog<span class="token operator">=</span>/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune<span class="token operator">=</span>generic --with-arch_32<span class="token operator">=</span>x86-64 --build<span class="token operator">=</span>x86_64-redhat-linux
    Thread model: posix
    gcc version <span class="token number">4.8</span>.5 <span class="token number">20150623</span> <span class="token punctuation">(</span>Red Hat <span class="token number">4.8</span>.5-44<span class="token punctuation">)</span> <span class="token punctuation">(</span>GCC<span class="token punctuation">)</span> 
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li></ul></pre> 
    <p><strong>所以需要我们升级一下gcc:</strong></p> 
    <p><a href="https://segmentfault.com/a/1190000019557540">CentOS 7上升级/安装gcc</a></p> 
    <pre data-index="44" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//安装scl</span>
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ sudo yum install centos<span class="token operator">-</span>release<span class="token operator">-</span>scl scl<span class="token operator">-</span>utils<span class="token operator">-</span>build
    
    <span class="token comment">//安装新版本gcc</span>
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ sudo yum install <span class="token operator">-</span>y devtoolset<span class="token operator">-</span><span class="token number">7</span><span class="token operator">-</span>gcc devtoolset<span class="token operator">-</span><span class="token number">7</span><span class="token operator">-</span>gccc<span class="token operator">++</span>
    
    <span class="token comment">//查看工具集</span>
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ ls <span class="token operator">/</span>opt<span class="token operator">/</span>rh
    devtoolset<span class="token operator">-</span><span class="token number">7</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li></ul></pre> 
    <p>因为不会覆盖系统默认的gcc,需要手动启动</p> 
    <p>命令行启动仅在本次会话有效。</p> 
    <pre data-index="45" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ scl enable devtoolset<span class="token operator">-</span><span class="token number">7</span> bash
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ gcc <span class="token operator">-</span>v
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li></ul></pre> 
    <p>若想永久有效,则需要启动时自动执行指令,在文件 <code>~/.bash_profile</code>中添加语句</p> 
    <p><code>scl enable devtoolset-7 bash</code></p> 
    <pre data-index="46" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ vim <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bash_profile 
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ cat <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bash_profile 
    # <span class="token punctuation">.</span>bash_profile
    
    <span class="token macro property"><span class="token directive-hash">#</span> <span class="token expression">Get the aliases <span class="token operator">and</span> functions</span></span>
    <span class="token keyword">if</span> <span class="token punctuation">[</span> <span class="token operator">-</span>f <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bashrc <span class="token punctuation">]</span><span class="token punctuation">;</span> then
    	<span class="token punctuation">.</span> <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bashrc
    fi
    
    <span class="token macro property"><span class="token directive-hash">#</span> <span class="token expression">User specific environment <span class="token operator">and</span> startup programs</span></span>
    
    PATH<span class="token operator">=</span>$PATH<span class="token operator">:</span>$HOME<span class="token operator">/</span><span class="token punctuation">.</span>local<span class="token operator">/</span>bin<span class="token operator">:</span>$HOME<span class="token operator">/</span>bin
    
    <span class="token keyword">export</span> PATH
    
    
    #每次启动的时候,都会执行这个scl命令
    scl enable devtoolset<span class="token operator">-</span><span class="token number">7</span> bash
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li></ul></pre> 
    <p><strong>安装 <code>cpp-httplib</code></strong></p> 
    <p>如果gcc不是特别新,可能会有运行时错误的问题。</p> 
    <p>所以建议使用:<a href="https://gitee.com/sumert/cpp-httplib/tree/v0.7.15">cpp-httplib 0.7.15</a></p> 
    <p>点击链接下载,</p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024951799.png" alt="在这里插入图片描述"></p> 
    <p>将压缩包放置 <code>thirdpart</code>文件夹中并解压(unzip):</p> 
    <pre data-index="47" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos thirdpart<span class="token punctuation">]</span>$ ll
    total <span class="token number">8</span>
    drwxrwxr<span class="token operator">-</span>x <span class="token number">6</span> sjl sjl <span class="token number">4096</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">50</span> cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span>
    drwxrwxr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">45</span> cppjieba
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos thirdpart<span class="token punctuation">]</span>$ 
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li></ul></pre> 
    <p>在项目文件夹中建立软连接:</p> 
    <pre data-index="48" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span><span class="token operator">/</span> cpp<span class="token operator">-</span>httplib
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
    total <span class="token number">1532</span>
    drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">40</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">54</span> cpp<span class="token operator">-</span>httplib <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span><span class="token operator">/</span>
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">46</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">46</span> cppjieba <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span>
    drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">608144</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> debug
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">640</span> Jul <span class="token number">28</span> <span class="token number">01</span><span class="token operator">:</span><span class="token number">05</span> debug<span class="token punctuation">.</span>cc
    lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">34</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">47</span> dict <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">409408</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> http_server
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl     <span class="token number">58</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> http_server<span class="token punctuation">.</span>cc
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">7489</span> Jul <span class="token number">27</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">08</span> index<span class="token punctuation">.</span>hpp
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">360</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> Makefile
    <span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">492840</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> parser
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4654</span> Jul <span class="token number">28</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">17</span> searcher<span class="token punctuation">.</span>hpp
    drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">47</span> test
    <span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">2047</span> Jul <span class="token number">27</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">43</span> tool<span class="token punctuation">.</span>hpp
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ 
    
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li></ul></pre> 
    <p>新建网页根目录(后续将包含首页及一系列资源),在WWWROOT的目录下写一个html文件</p> 
    <pre data-index="49" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ mkdir WWWROOT
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos WWWROOT<span class="token punctuation">]</span>$ touch index<span class="token punctuation">.</span>html
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li></ul></pre> 
    <h2><a name="t25"></a><a id="cpphttplib__1541"></a>cpp-httplib 的基本使用测试</h2> 
    <pre data-index="50" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment">//http_server.cc</span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cpp-httplib/httplib.h"</span></span>
    
    <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path<span class="token operator">=</span><span class="token string">"./WWWROOT"</span><span class="token punctuation">;</span>
    <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span>
    
        <span class="token comment">//设置首页</span>
        svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/hi"</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span>httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span>
            rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"gogogogogo"</span><span class="token punctuation">,</span><span class="token string">"text/plain; charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span><span class="token number">8081</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li></ul></pre> 
    <pre data-index="51" class="prettyprint"><code class="prism language-html has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token comment"><!-- index.html --></span>
    
    <span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>UTF-8<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span> for test <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span><span class="token punctuation">></span></span>Hello World!<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是一个httplib测试<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span>
    
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li></ul></pre> 
    <p>编译运行:</p> 
    <pre data-index="52" class="prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ g<span class="token operator">++</span> <span class="token operator">-</span>o http_server httpserver<span class="token punctuation">.</span>cc <span class="token operator">-</span>std<span class="token operator">=</span>c<span class="token operator">++</span><span class="token number">11</span> <span class="token operator">-</span>ljsoncpp <span class="token operator">-</span>lpthread
    <span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ <span class="token punctuation">.</span><span class="token operator">/</span>http_server
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li></ul></pre> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024952109.png" alt="在这里插入图片描述"></p> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024952281.png" alt="在这里插入图片描述"></p> 
    <h2><a name="t26"></a><a id="_HttpServer__1593"></a>编写 HttpServer 模块</h2> 
    <pre data-index="53" class="set-code-hide prettyprint"><code class="prism language-cpp has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
    <span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cpp-httplib/httplib.h"</span></span>
    
    <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path<span class="token operator">=</span><span class="token string">"./WWWROOT"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input<span class="token operator">=</span><span class="token string">"data/raw_html/raw.txt"</span><span class="token punctuation">;</span>
    
    <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//创建搜索器并初始化</span>
        ns_searcher<span class="token double-colon punctuation">::</span>Searcher search<span class="token punctuation">;</span>
        search<span class="token punctuation">.</span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span>
        <span class="token comment">//设置首页 </span>
        svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    
        svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/s"</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token operator">&</span>search<span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span>httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span>
            <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>req<span class="token punctuation">.</span><span class="token function">has_param</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment">//请求中若没有参数</span>
            <span class="token punctuation">{<!-- --></span>
                rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"请输入搜索词!"</span><span class="token punctuation">,</span><span class="token string">"text/plain; charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//返回Content—Type为文本</span>
                <span class="token keyword">return</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            std<span class="token double-colon punctuation">::</span>string word<span class="token operator">=</span>req<span class="token punctuation">.</span><span class="token function">get_param_value</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"用户搜索词: "</span><span class="token operator"><<</span>word<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
            <span class="token comment">//执行搜索服务</span>
            std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span>
            search<span class="token punctuation">.</span><span class="token function">Search</span><span class="token punctuation">(</span>word<span class="token punctuation">,</span><span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span>
            rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span>json_string<span class="token punctuation">,</span><span class="token string">"application/json"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span><span class="token number">8081</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li></ul></pre> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024953531.gif" alt="在这里插入图片描述"></p> 
    <p>OK,至此后端大抵完成,后面来完成前端工作。</p> 
    <h1><a name="t27"></a><a id="6__1636"></a>6. 前端模块</h1> 
    <h2><a name="t28"></a><a id="HTML__1638"></a>HTML 网页框架</h2> 
    <pre data-index="54" class="set-code-hide prettyprint"><code class="prism language-html has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span> <span class="token attr-name">lang</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>en<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>UTF-8<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">http-equiv</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>X-UA-Compatible<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>IE=edge<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>viewport<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>width=device-width, initial-scale=1.0<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span>BOOST搜索引擎<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>container<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>search<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>text<span class="token punctuation">"</span></span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>输入搜索关键字<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>button</span><span class="token punctuation">></span></span>Search<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>button</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
    
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>result<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span>
    <div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style="opacity: 0.899009;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li></ul></pre> 
    <p><img src="https://1000bd.com/contentImg/2022/08/14/024953817.png" alt="在这里插入图片描述"></p> 
    <h2><a name="t29"></a><a id="CSS__1696"></a>CSS 网页个性化设计</h2> 
    <p>设置样式的本质是找到标签设置属性(直接在html代码中的title之后进行编辑)</p> 
    <ol><li>选择特定标签:类选择器,标签选择,复合选择</li><li>设置指定标签的属性</li></ol> 
    <pre data-index="55" class="set-code-hide prettyprint"><code class="prism language-css has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><!DOCTYPE html>
    <html lang=<span class="token string">"en"</span>>
    <head>
        <meta charset=<span class="token string">"UTF-8"</span>>
        <meta http-equiv=<span class="token string">"X-UA-Compatible"</span> content=<span class="token string">"IE=edge"</span>>
        <meta name=<span class="token string">"viewport"</span> content=<span class="token string">"width=device-width, initial-scale=1.0"</span>>
        <title>BOOST搜索引擎
        /* css设计 */
        
    
    /* ... */
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107

    JavaScript 编写实现跳转

    使用原生JS成本较高(xmlhttprequest),这里使用JQuery。

    在html中添加外部链接,获取JQuery库

    <script src="http://code.jquery.com/jquery-2.1.1.min.js">script>
    
    • 1

    在html文件中插入代码:

    
     div>
        <script>  
            function Search(){
                // 是浏览器的一个弹出框
                // alert("hello js!");
              
                //1.提取数据 $可以理解为JQuery的别称
                let query = $(".container .search input").val();
                console.log("query = " + query);//console是浏览器的对话框,查看js的数据
    
                //2.发起http请求(把关键字上传给服务器),JQuery中的ajax:一个与服务器进行数据交互的函数
                $.ajax({
                    type:"GET",
                    url:"/s?word="+query,
                    //如果请求成功,打印出服务器返回的data(此时服务器一直在后台运行)
                    success:function(data){
                        console.log(data);
                        //将结果构建为网页信息
                        BuildHtml(data);
                    }
                });
            }
    
            function BuildHtml(data)
            {
                if(data=="" || data==null)
                {
                    document.write("搜索内容不存在");
                    return ;
                }
                //获取result标签
                let result_label = $(".container .result");
                //清空历史搜索数据
                result_label.empty();
    
                for(let elem of data)
                {
                    console.log(elem.title);
                    console.log(elem.url);
    
                    let a_label=$("",{
                        text: elem.title,
                        //标签链接
                        href: elem.url,
                        //点击链接跳转新启一页 
                        target: "_blank"
                    });
                    let p_label=$("

    ",{ text: elem.abstract }); let i_label=$("",{ text: elem.url, }); let div_label=$("

    ",{ class:"item" }); a_label.appendTo(div_label); p_label.appendTo(div_label); i_label.appendTo(div_label); div_label.appendTo(result_label); } } script> body> html>
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68

    至此整个前端的代码便全部完成。

    整体效果

    项目所有的文件如下:

    在这里插入图片描述

    makefile文件如下:

    PARSER=parser
    DUG=debug
    HTTP_SERVER=http_server
    cc=g++
    
    .PHONY:all
    all:$(PARSER) $(DUG) $(HTTP_SERVER)
    
    $(PARSER):parser.cc
    	$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
    
    $(DUG):debug.cc
    	$(cc) -o $@ $^ -std=c++11 -ljsoncpp
    
    $(HTTP_SERVER):http_server.cc
    	$(cc) -o $@ $^ -std=c++11 -ljsoncpp -lpthread
    
    .PHONY:clean
    clean:
    		rm -rf $(PARSER) $(DUG) $(HTTP_SERVER)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    make之后,运行 ./parse 会将处理好的所有html文件存放在raw.txt中

    随后启动服务器程序:./http_server

    然后打开网页,输入自己服务器的IP地址即可:

    在这里插入图片描述

    7.后端优化

    搜索去重

    在之前的search模块中讨论过,搜索的倒排拉链会产生重复,即不同的关键词可能来源于同一个文档,那么这样造成的后果就是搜索的结果可能就是重复的。

    为了测试这种可能性,我们自己新建一个test.html文件,并试图搜索这个文档的内容。

    • test.html
    DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
      <head>
      
        <title>测试用例title>
        <meta http-equiv="refresh" content="0; URL=http://www.boost.org/doc/libs/master/doc/html/hash.html">
      head>
      <body>
        今天是一个晴天
        <a href="http://www.boost.org/doc/libs/master/doc/html/hash.html">http://www.boost.org/doc/libs/master/doc/html/hash.htmla>
      body>
    html>
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    我们把test.html放在input路径下,并重新编译运行:

    [sjl@VM-16-6-centos boost_searcher]$ make
    g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
    g++ -o debug debug.cc -std=c++11 -ljsoncpp
    g++ -o http_server http_server.cc -std=c++11 -ljsoncpp -lpthread
    [sjl@VM-16-6-centos boost_searcher]$ ./parser 
    [sjl@VM-16-6-centos boost_searcher]$ ./http_server 
    创建index单例完成...
    构建索引完成....: 100%
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    在这里插入图片描述

    在这里插入图片描述

    可以看到结果是重复的!

    所以我们需要避免这种情况的出现

    将search.hpp做修改,详情见文末的项目代码链接

    改完之后:

    在这里插入图片描述

    去除暂停词

    在jieba分词库中包含了暂停词词库:

    在这里插入图片描述

    改动tool.hpp

    将暂停词库导入内存,在jieba分词结束后,再用暂停词库将关键词筛一遍,去除暂停词。

    具体见文尾的项目代码 tool.hpp

    效果展示:

    搜索暂停词后,将不会显示结果,

    在这里插入图片描述

    前期构建索引是需要筛一遍暂停词所以会比较慢,但是一旦构建完毕,索引的时间将会大幅缩减,因为省去了暂停词的索引过程。

    添加日志

    //log.hpp
    #pragma once
    
    #include 
    #include 
    #include 
    
    #define NORMAL  1
    #define WARNING 2
    #define DEBUG   3
    #define FATAL   4
     
    #define LOG(LEVEL,MESSAGE) log(#LEVEL,MESSAGE,__FILE__,__LINE__)
    
    void log(std::string level ,std::string message,std::string file,int line)
    {
        std::cout<<"["<<level<<"]"<<"["<<time(nullptr)<<"]"<<"["<<message<<"]"<<"["<<file<<" : "<<line<<"]"<<std::endl;
    
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    在所有的错误控制处以及信息提示出,使用LOG函数,并给予一定的错误等级与提示。

    部署服务

    在后台运行服务器,并把日志信息输出在 log.txt中(把错误输出也重定向到此文件中 2>&1):

    [sjl@VM-16-6-centos boost_searcher]$ nohup ./http_server &>log.txt 2>&1
    
    • 1

    输入一些搜索词后:

    [sjl@VM-16-6-centos boost_searcher]$ cat log.txt 
    nohup: ignoring input
    创建index单例完成...
    [NORMAL][1659167339][创建index单例完成...][searcher.hpp : 24]
    构建索引完成....: 100%
    [NORMAL][1659167389][构建索引完成...][searcher.hpp : 28]
    用户搜索词: vector
    [NORMAL][1659168113][用户搜索词: vector][http_server.cc : 25]
    用户搜索词: split
    [NORMAL][1659168141][用户搜索词: split][http_server.cc : 25]
    用户搜索词: filestream
    [NORMAL][1659168148][用户搜索词: filestream][http_server.cc : 25]
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    项目扩展方向:

    1. 该项目的数据源是基于 boost_1_79_0/doc/html/ 目录下的html文件索引。所以可以建立全站索引。
    2. 数据源可以定期使用爬虫程序对网页进行爬取,或者在网站更新时设置信号,提醒重新爬取网页。设计在线更新的方案(多线程,多进程)。
    3. 不使用组件,自己设计对应的各种方案。
    4. 添加竞价排名
    5. 热词统计,智能显示搜索关键词(字典树,优先级队列)
    6. 设置登录注册

    项目代码

    已上传:https://gitee.com/bigmulberry/search-engine-based-on—boost

  • 相关阅读:
    Flink入门系列06-window
    微信小程序转为App并上架应用市场
    基于Python+Django的开药系统【源码+LW+PPT+部署讲解】
    查看Docker镜像启动命令
    CentOS 7离线安装MySQL 5.6
    Go中的一些优化笔记,简约而不简单
    2023年第二十届五一数学建模B题:快递需求分析问题-思路详解
    牛客网刷题-(4)
    html css面试题
    c++ 学习之 利用哈希建立一个 集合
  • 原文地址:https://blog.csdn.net/qq_43041053/article/details/126266871