mapreduce 聚合_MapReduce：处理数据密集型文本处理

mapreduce 聚合

这篇文章继续进行有关使用MapReduce进行数据密集型处理的书中实现算法的系列文章。第一部分可以在这里找到。在上一篇文章中，我们讨论了使用本地聚合技术来减少通过网络进行混洗和传输的数据量的方法。减少传输的数据量是提高MapReduce作业效率的主要方法之一。单词计数MapReduce作业用于演示本地聚合。由于结果只需要总数，因此我们可以为合并器重新使用相同的化简器，因为更改加数的顺序或分组不会影响总和。

但是，如果您想要平均水平呢？然后，由于计算平均值的平均值不等于原始数字集的平均值，因此相同的方法将行不通。尽管有了一点见识，我们仍然可以使用本地聚合。对于这些示例，我们将使用Hadoop最终指南书中使用的NCDC天气数据集的示例。我们将计算1901年每个月的平均温度。可以在MapReduce的数据密集型处理的第3.1.3章中找到组合器和映射器内组合选项的平均值算法。

一种尺寸并不适合所有人

上次我们介绍了两种用于在MapReduce作业中减少数据的方法：Hadoop组合器和映射器内组合方法。 Hadoop框架将组合器视为一种优化，并且无法保证调用组合器的次数（如果有的话）。结果，映射器必须以减速器期望的形式发出数据，因此，如果不涉及组合器，则最终结果不会更改。为了调整计算平均值，我们需要返回到映射器并更改其输出。

映射器更改

在单词计数示例中，未优化的映射器仅发出单词和1的计数。合并器和映射器内组合映射器通过将每个单词作为哈希映射中的键（总计数为n）来优化此输出。值。每次看到一个单词，计数都会增加1。使用此设置时，如果未调用组合器，则缩减器将接收到该单词作为键，并将一长串的1？s加在一起，从而得到相同的输出（当然，使用映射器内组合映射器可以避免此问题，因为可以保证合并结果是映射器代码的一部分）。为了计算平均值，我们将使基本映射器发出一个字符串键（将天气观测的年和月连接在一起）和一个自定义可写对象，称为TemperatureAveragingPair。 TemperatureAveragingPair对象将包含两个数字（IntWritables），获取的温度和一个计数。我们将从Hadoop：权威指南中获取MaximumTemperatureMapper，并以此为灵感来创建AverageTemperatureMapper：

public class AverageTemperatureMapper extends Mapper<LongWritable, Text, Text, TemperatureAveragingPair> {//sample line of weather data//0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999private Text outText = new Text();private TemperatureAveragingPair pair = new TemperatureAveragingPair();private static final int MISSING = 9999;@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String yearMonth = line.substring(15, 21);int tempStartPosition = 87;if (line.charAt(tempStartPosition) == '+') {tempStartPosition += 1;}int temp = Integer.parseInt(line.substring(tempStartPosition, 92));if (temp != MISSING) {outText.set(yearMonth);pair.set(temp, 1);context.write(outText, pair);}}
}

通过使映射器输出键和TemperatureAveragingPair对象，无论调用组合器如何，我们的MapReduce程序都可以保证具有正确的结果。

合路器

我们需要减少发送的数据量，因此我们将对温度求和，并对计数求和并分别存储。这样，我们将减少发送的数据，但保留计算正确平均值所需的格式。如果/在调用组合器时，它将采用所有传入的TemperatureAveragingPair对象，并为同一键发出单个TemperatureAveragingPair对象，其中包含温度和计数值的总和。这是合并器的代码：

public class AverageTemperatureCombiner extends Reducer<Text,TemperatureAveragingPair,Text,TemperatureAveragingPair> {private TemperatureAveragingPair pair = new TemperatureAveragingPair();@Overrideprotected void reduce(Text key, Iterable<TemperatureAveragingPair> values, Context context) throws IOException, InterruptedException {int temp = 0;int count = 0;for (TemperatureAveragingPair value : values) {temp += value.getTemp().get();count += value.getCount().get();}pair.set(temp,count);context.write(key,pair);}
}

但是我们非常有兴趣确保我们减少了发送到reducer的数据量，因此我们将看看下一步如何实现。

在Mapper合并平均值中

与单词计数示例相似，为了计算平均值，映射器内组合映射器将使用哈希图，将连接的年+月作为键，将TemperatureAveragingPair作为值。每次获得相同的年+月组合时，我们都会将对对象从地图中取出，添加温度并将计数增加一个。调用cleanup方法后，我们将发出所有对及其各自的键：

public class AverageTemperatureCombiningMapper extends Mapper<LongWritable, Text, Text, TemperatureAveragingPair> {//sample line of weather data//0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999private static final int MISSING = 9999;private Map<String,TemperatureAveragingPair> pairMap = new HashMap<String,TemperatureAveragingPair>();@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String yearMonth = line.substring(15, 21);int tempStartPosition = 87;if (line.charAt(tempStartPosition) == '+') {tempStartPosition += 1;}int temp = Integer.parseInt(line.substring(tempStartPosition, 92));if (temp != MISSING) {TemperatureAveragingPair pair = pairMap.get(yearMonth);if(pair == null){pair = new TemperatureAveragingPair();pairMap.put(yearMonth,pair);}int temps = pair.getTemp().get() + temp;int count = pair.getCount().get() + 1;pair.set(temps,count);}}@Overrideprotected void cleanup(Context context) throws IOException, InterruptedException {Set<String> keys = pairMap.keySet();Text keyText = new Text();for (String key : keys) {keyText.set(key);context.write(keyText,pairMap.get(key));}}
}

通过遵循在映射调用之间跟踪数据的相同模式，我们可以通过实现映射器内合并策略来实现可靠的数据缩减。同样的注意事项适用于在对映射器的所有调用中保持状态，但是考虑使用这种方法可以提高处理效率，这值得考虑。

减速器

在这一点上，编写我们的reducer很容易，为每个键获取一个成对列表，将所有温度和计数求和，然后将温度总和除以计数总和。

public class AverageTemperatureReducer extends Reducer<Text, TemperatureAveragingPair, Text, IntWritable> {private IntWritable average = new IntWritable();@Overrideprotected void reduce(Text key, Iterable<TemperatureAveragingPair> values, Context context) throws IOException, InterruptedException {int temp = 0;int count = 0;for (TemperatureAveragingPair pair : values) {temp += pair.getTemp().get();count += pair.getCount().get();}average.set(temp / count);context.write(key, average);}
}

结果

使用合并器和映射器内合并映射器选项可以预测结果，从而显着减少数据输出。
未优化的映射器选项：

12/10/10 23:05:28 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:05:28 INFO mapred.JobClient:     Combine output records=0
12/10/10 23:05:28 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce shuffle bytes=111594
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:05:28 INFO mapred.JobClient:     Spilled Records=13128
12/10/10 23:05:28 INFO mapred.JobClient:     Map output bytes=98460
12/10/10 23:05:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:05:28 INFO mapred.JobClient:     Combine input records=0
12/10/10 23:05:28 INFO mapred.JobClient:     Map output records=6564
12/10/10 23:05:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce input records=6564

组合器选项：

12/10/10 23:07:19 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:07:19 INFO mapred.JobClient:     Combine output records=12
12/10/10 23:07:19 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce shuffle bytes=210
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:07:19 INFO mapred.JobClient:     Spilled Records=24
12/10/10 23:07:19 INFO mapred.JobClient:     Map output bytes=98460
12/10/10 23:07:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:07:19 INFO mapred.JobClient:     Combine input records=6564
12/10/10 23:07:19 INFO mapred.JobClient:     Map output records=6564
12/10/10 23:07:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce input records=12

映射器内合并选项：

12/10/10 23:09:09 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:09:09 INFO mapred.JobClient:     Combine output records=0
12/10/10 23:09:09 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce shuffle bytes=210
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:09:09 INFO mapred.JobClient:     Spilled Records=24
12/10/10 23:09:09 INFO mapred.JobClient:     Map output bytes=180
12/10/10 23:09:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:09:09 INFO mapred.JobClient:     Combine input records=0
12/10/10 23:09:09 INFO mapred.JobClient:     Map output records=12
12/10/10 23:09:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce input records=12

计算结果：
（注意：示例文件中的温度以摄氏度* 10为单位）

未优化	合路器	映射器内合并器映射器
190101 -25 190102 -91 190103 -49 190104 22 190105 76 190106 146 190107 192 190108 170 190109 114 190110 86 190111 -16 190112 -77	190101 -25 190102 -91 190103 -49 190104 22 190105 76 190106 146 190107 192 190108 170 190109 114 190110 86 190111 -16 190112 -77	190101 -25 190102 -91 190103 -49 190104 22 190105 76 190106 146 190107 192 190108 170 190109 114 190110 86 190111 -16 190112 -77

结论

我们已经讨论了本地聚合，无论是简单的情况（可以将reducer用作组合器），还是更复杂的情况（对于如何构造数据，同时仍能从本地聚合数据以提高处理效率）中获得一些见解。

进一步阅读