今天就跟大家聊聊有关MapReduce中怎么实现倒排索引,可能很多人都不太了解,为了让大家更加了解,小编给大家总结了以下内容,希望大家根据这篇文章可以有所收获。
创新互联自2013年起,先为通海等服务建站,通海等地企业,进行企业商务咨询服务。为通海企业网站制作PC+手机+微官网三网同步一站式服务解决您的所有建站问题。
需求: 为a, b, c 3个文本文件中的单词建倒排索引
输出格式:
a:
hello world
hello hadoop
hello world
b:
spark hadoop
hello hadoop
world hadoop
c:
spark world
hello world
hello spark
map阶段
context.write("hello:a","1") context.write("hello:a","1") context.write("hello:a","1")
map阶段输出: <"hello:a",{1,1,1}>
combine阶段
context.write("hello","a:3"); context.write("hello","b:1"); context.write("hello","c:2");
combine阶段输出: <"hello",{"a:3","b:1","c:2"}>
reduce阶段
context.write("hello","a:3,b:1,c:2");
reduce阶段输出: <"hello","a:3,b:1,c:2">
定义Mapper类, 该类继承org.apache.hadoop.mapreduce.Mapper
并重写map()方法
public class IIMapper extends Mapper{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = StringUtils.split(line, " "); // 从context中获取文件切片inputSplit FileSplit inputSplit = (FileSplit) context.getInputSplit(); // 从inputSplit中获取文件的绝对路径path String path = inputSplit.getPath().toString(); int index = path.lastIndexOf("/"); // 从path中截取文件名 String fileName = path.substring(index + 1); for (String word : words) { context.write(new Text(word + ":" + fileName), new Text("1")); } // map输出结果 <"hello:a",{1,1,1}> } }
定义Combiner类, 该类继承org.apache.hadoop.mapreduce.Reducer
combine阶段是map阶段和reduce阶段的中间过程
并重写reduce()方法
public class IICombiner extends Reducer{ @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { String[] data = key.toString().split(":"); String word = data[0]; String fileName = data[1]; int count = 0; for (Text value : values) { count += Integer.parseInt(value.toString()); } context.write(new Text(word), new Text(fileName + ":" + count)); // combine输出结果 <"hello",{"a:3","b:1","c:2"}> } }
定义Reducer类, 该类继承org.apache.hadoop.mapreduce.Reducer
并重写reduce()方法
public class IIReducer extends Reducer{ @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { StringBuilder sb = new StringBuilder(); for (Text value : values) { sb.append(value.toString() + "\t"); } context.write(key, new Text(sb.toString())); // reduce输出结果 <"hello","a:3,b:1,c:2"> } }
测试倒排索引
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Job job = Job.getInstance(new Configuration()); job.setJarByClass(InverseIndexRunner.class); // 设置job的主类 job.setMapperClass(IIMapper.class); // 设置Mapper类 job.setCombinerClass(IICombiner.class); // 设置Combiner类 job.setReducerClass(IIReducer.class); // 设置Reducer类 job.setMapOutputKeyClass(Text.class); // 设置map阶段输出Key的类型 job.setMapOutputValueClass(Text.class); // 设置map阶段输出Value的类型 job.setOutputKeyClass(Text.class); // 设置reduce阶段输出Key的类型 job.setOutputValueClass(Text.class); // 设置reduce阶段输出Value的类型 // 设置job输入路径(从main方法参数args中获取) FileInputFormat.setInputPaths(job, new Path(args[0])); // 设置job输出路径(从main方法参数args中获取) FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); // 提交job }
job输出的结果文件:
hadoop a:1 b:3
hello b:1 c:2 a:3
spark b:1 c:2
world c:2 b:1 a:2
看完上述内容,你们对MapReduce中怎么实现倒排索引有进一步的了解吗?如果还想了解更多知识或者相关内容,请关注创新互联行业资讯频道,感谢大家的支持。