当前位置：首页 > news >正文

数据倾斜问题

news 2026/5/17 21:33:03

数据倾斜：主要就是在处理MR任务的时候，某个reduce的数据处理量比另外一些的reduce的数据量要大得多，其他reduce几乎不处理，这样的现象就是数据倾斜。

官方解释：数据倾斜指的是在数据处理过程中，由于某些键的分布极度不均匀，导致某些节点处理的数据量显著多于其他节点。‌这种情况会引发性能瓶颈，阻碍任务的并行执行，增加作业的整体执行时间。在Hadoop的MapReduce作业中，数据倾斜尤为明显，因为它会导致某些Reduce任务处理的数据量远大于其他任务，从而造成集群整体处理效率低下的问题。

这里比如有一个文本数据，里面内容全是：hadoop, hadoop, hadoop,hadoop ....,假设有800万条数据，这样更容易显示数据倾斜的效果，里面都是同样的单词，默认的hash取余分区的方法，明显不太适合，所以我们要自定义分区，重写分区方法。以及设置多个reduce，这里我设置为3，主要就是对数据倾斜的key进行一个增加后缀的方法，以及在Map阶段就增加后缀，实现过程是将每个hadoop都进行增加后缀，刚开始会全部默认存放到第一个分区里（0分区），然后写到分区后，自定义分区方法SkewPartitioner就会对里面的数据进行分析，如果后缀是1就分到1区里面，一共就0、1、2三个分区，以此来解决数据倾斜的问题。

注意：在Job端进行自定义分区器的设置：job,setPartitionerClass(SkewPartitioner.class)

具体代码如下：

package com.shujia.mr;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;public class Demo05SkewDataMR {public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {String line = value.toString();// 将每一行数据按照逗号/空格进行切分for (String word : line.split("[,\\s]")) {// 使用context.write将数据发送到下游// 将每个单词变成 单词,1 形式// 对数据倾斜的Key加上随机后缀if ("hadoop".equals(word)) {// 随机生成 0 1 2int prefix = (int) (Math.random() * 3);context.write(new Text(word + "_" + prefix), new IntWritable(1));} else {context.write(new Text(word), new IntWritable(1));}}}}public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {// 统计每个单词的数量int cnt = 0;for (IntWritable value : values) {cnt = cnt + value.get();}context.write(key, new IntWritable(cnt));}}// Driver端：组装（调度）及配置任务// 可以通过args接收参数// 本任务接收两个参数：输入路径、输出路径public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {Configuration conf = new Configuration();// 创建JobJob job = Job.getInstance(conf);// 配置任务job.setJobName("Demo05SkewDataMR");job.setJarByClass(Demo05SkewDataMR.class);// 设置自定义分区器job.setPartitionerClass(SkewPartitioner.class);// 手动设置Reduce的数量// 最终输出到HDFS的文件数量等于Reduce的数量job.setNumReduceTasks(3);// 配置Map端job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);// 配置Reduce端job.setReducerClass(MyReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);// 验证args的长度if (args.length != 2) {System.out.println("请传入输入输出目录！");return;}String input = args[0];String output = args[1];// 配置输入输出的路径FileInputFormat.addInputPath(job, new Path(input));Path ouputPath = new Path(output);// 通过FileSystem来实现覆盖写入FileSystem fs = FileSystem.get(conf);if (fs.exists(ouputPath)) {fs.delete(ouputPath, true);}// 该目录不能存在，会自动创建，如果已存在则会直接报错FileOutputFormat.setOutputPath(job, ouputPath);// 启动任务// 等待任务的完成job.waitForCompletion(true);}
}// 自定义分区：在Map阶段给key加上随机后缀，基于后缀返回不同的分区编号
class SkewPartitioner extends Partitioner<Text, IntWritable> {@Overridepublic int getPartition(Text text, IntWritable intWritable, int numPartitions) {String key = text.toString();int partitions = 0;// 只对数据倾斜的key做特殊处理if ("hadoop".equals(key.split("_")[0])) {switch (key) {
//                case "hadoop_0":
//                    partitions = 0;
//                    break;case "hadoop_1":partitions = 1;break;case "hadoop_2":partitions = 2;break;}} else {// 正常的key还是按照默认的Hash取余进行分区partitions = (key.hashCode() & Integer.MAX_VALUE) % numPartitions;}return partitions;}
}

数据倾斜问题

相关文章：

数据倾斜问题

大龄焦虑？老码农逆袭之路：拥抱大模型时代，焕发职业生涯新活力！

Vue 页面反复刷新常见问题及解决方案

Windows上指定盘符-安装WSL虚拟机（机械硬盘）

ffmpeg实现视频的合成与分割

团体标准的十大优势

java spring boot 动态添加 cron（表达式）任务、动态添加停止单个cron任务

sqlgun靶场漏洞挖掘

好用的 Markdown 编辑器组件

uniapp vite3 require导入commonJS 的js文件方法

通义灵码用户说：“人工编写测试用例需要数十分钟，通义灵码以毫秒级的速度生成测试代码，且准确率和覆盖率都令人满意”

MySQL中的约束

Leetcode 寻找重复数

大一新生以此篇开启你的算法之路

【AI大模型】ChatGPT模型原理介绍（上）

基于UE5和ROS2的激光雷达+深度RGBD相机小车的仿真指南(五)：Blender锥桶建模

C++竞赛初阶L1-15-第六单元-多维数组(34~35课)557: T456507 图像旋转

无线领夹麦克风哪个牌子好？西圣、罗德、猛犸领夹麦克风深度评测

React Native 0.76，New Architecture 将成为默认模式，全新的 RN 来了

Java并发：互斥锁，读写锁，Condition，StampedLock

通过taotoken审计日志追溯api调用详情与安全分析

GEO优化实操框架：GEO优化的正确姿势是“带着答案去找客户”

保姆级教程：用CH34xSerCfg修改USB转串口芯片的VID/PID，解决驱动冲突和串口号固定问题

GD32F103C8T6烧录方式全解析：串口ISP、ST-Link Utility、Keil在线，哪种最适合你？

基于Fire2012算法与FastLED库的Arduino LED篝火制作全攻略

百度网盘直链解析终极指南：如何实现高速下载的完整技术方案

仅限菲律宾本地团队使用的ElevenLabs隐藏功能：Tagalog重音标记语法（`[ˈba.ka]`）、连读规则注入与敬语语调开关（内测白名单已开放）

基于IMAP的邮件自动化处理工具mymailclaw配置与实战指南

基于Docker构建标准化开发环境：原理、实践与VSCode集成指南

Arm Neoverse CMN-700性能监控与优化实践