MapReduce Inverted Index

题目描述

倒排索引是 Elasticsearch 中非常重要的索引结构，是从文档单词到文档 ID 的过程。倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引(inverted index) 现实中，倒排索引主要应用于搜索引擎中，用于构建单词到文档的索引，从而能够快速的通过用户的输入查找相关的网页。
本题目需要实现构建倒排索引的过程。具体来说，给定一组英文文档，使用空格进行分词（文档中不包含标点符号），将所有单词转换为小写，并排除停用词（stop word）后，建立单词的倒排索引(输出key为单词，value为以文件名和单词出现次数组成的字符串，不同文件之间用";"分割，详见样例)。

样例

输入

//输入由多个文件的文本内容构成，下面列举了两个文件的文本内容
//www.bbc.comnewsworld-asia-china-60615280
Ukraine invasion Can China do more to stop Russia’s war in Ukraine
//www.bbc.comnewsworld-europe-60506682
Ukraine maps Ukraine says Russian ceasefire offer immoral
// stopwords.txt
can
and
to
in

输出

//输出格式为单词文件名1:次数1;文件名2:次数2;
Ukraine www.bbc.comnewsworld-asia-china-60615280:2;www.bbc.comnewsworld-europe-60506682:2
invasion www.bbc.comnewsworld-asia-china-60615280::1;
China www.bbc.comnewsworld-asia-china-60615280::1;
do www.bbc.comnewsworld-asia-china-60615280::1;
more www.bbc.comnewsworld-asia-china-60615280::1;
stop www.bbc.comnewsworld-asia-china-60615280::1;
Russia’s www.bbc.comnewsworld-asia-china-60615280::1;
war www.bbc.comnewsworld-asia-china-60615280::1;
maps www.bbc.comnewsworld-europe-60506682:1;
says www.bbc.comnewsworld-europe-60506682:1;
Russian www.bbc.comnewsworld-europe-60506682:1;
ceasefire www.bbc.comnewsworld-europe-60506682:1;
offer www.bbc.comnewsworld-europe-60506682:1;
immoral www.bbc.comnewsworld-europe-60506682:1;

新建DSPPCode.mapreduce.inverted_index.impl文件夹；在DSPPCode.mapreduce.inverted_index.impl中创建InvertedIndexMapperImpl, 继承InvertedIndexMapper, 实现抽象方法；在DSPPCode.mapreduce.inverted_index.impl中创建InvertedIndexReducerImpl, 继承InvertedIndexReducer, 实现抽象方法。

3、代码

InvertedIndexMapperImpl.java

package DSPPCode.mapreduce.inverted_index.impl;

import DSPPCode.mapreduce.inverted_index.question.InvertedIndexMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class InvertedIndexMapperImpl extends InvertedIndexMapper{
  private static Text Key = new Text();
  private static Text Value = new Text();
  private FileSplit split;

  @Override
  public void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
      throws IOException, InterruptedException {
    URI uri=context.getCacheFiles()[0];
    FileSystem fs = FileSystem.get(uri, new Configuration());

    FSDataInputStream a = fs.open(new Path(uri));
    BufferedReader x = new BufferedReader(new InputStreamReader(a));
    ArrayList<String> stopwords=new ArrayList<>();
    String l;
    while ((l=x.readLine())!=null){
      stopwords.add(l.toLowerCase());
    }
    // System.out.println(stopwords);

    // String[] list=value.toString().trim().split(" ");
    // split = (FileSplit) context.getInputSplit();
    // for (String word:list){
    //   word=word.toLowerCase();
    //   if (stopwords.contains(word)){
    //     continue;
    //   }
    //   keyInfo.set(word);
    //   valueInfo.set(split.getPath().getName());
    //   context.write(keyInfo,valueInfo);
    // }

    split = (FileSplit) context.getInputSplit();
    StringTokenizer list = new StringTokenizer(value.toString());
    while(list.hasMoreTokens()){
      String word = list.nextToken();
      word = word.toLowerCase();
      if(stopwords.contains(word)) {
        continue;
      }
      Key.set(word);
      Value.set(split.getPath().getName());
      context.write(Key, Value);
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

InvertedIndexReducerImpl.java

package DSPPCode.mapreduce.inverted_index.impl;

import DSPPCode.mapreduce.inverted_index.question.InvertedIndexReducer;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class InvertedIndexReducerImpl extends InvertedIndexReducer{
  @Override
  public void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
      throws IOException, InterruptedException {
    Map<String, Integer> map = new HashMap<>();
    for(Text value : values) {
      String val = value.toString();
      // System.out.println(val);
      map.merge(val, 1, Integer::sum);
    }

    StringBuilder stringBuilder = new StringBuilder();
    for (String x : map.keySet()) {
      stringBuilder.append(x).append(":").append(map.get(x)).append(";");
    }

    context.write(new Text(key), new Text(String.valueOf(stringBuilder)));
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

相关阅读:
Redis客户端常见异常
 从mysql 数据库表导入数据到elasticSearch的几种方式
 数据库基础（一）
协议栈——创建套接字
 使用WGCLOUD监控oracle表空间大小的笔记
 C++阶段05笔记03【C++提高编程资料(string容器、vector容器、deque容器、stack容器)】
11月22日：操作系统实验杂记（文本编辑器vim，查看文件内容cat命令，创建并使用Makefile文件，虚拟机共享文件夹）
都2022年了你还不了解什么是性能测试?
前端工程化之模块化基础
 应试教育导致学生迷信标准答案惯性导致思维僵化-移动机器人
原文地址：https://blog.csdn.net/weixin_45975575/article/details/125447121