当前位置：首页 > news >正文

《斯坦福数据挖掘教程·第三版》读书笔记（英文版） Chapter 2 MapReduce and the New Software Stack

news 2025/6/11 7:44:23

来源：《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 2 MapReduce and the New Software Stack

Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the inter-rack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential.

在这里插入图片描述

Some important calculations take minutes or even hours on thousands of compute nodes. If we had to abort and restart the computation every time one component failed, then the computation might never complete successfully.
The solution to this problem takes two forms:

Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. If we did not back up the files at all, and the disk crashes, the files would be lost forever.
Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. This strategy is followed by the MapReduce programming system.

在这里插入图片描述

Summary of Chapter 2

Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish (the blocking property), it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master
executes fails.
Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
Spark: This popular workflow system introduces Resilient, Distributed Datasets (RDD’s) and a language in which many common operations on RDD’s can be written. Spark has a number of efficiencies, including lazy evaluation of RDD’s to avoid secondary storage of intermediate results and the recording of lineage for RDD’s so they can be reconstructed as needed.
TensorFlow: This workflow system is specifically designed to support machine-learning. Data is represented as multidimensional arrays, or tensors, and built-in operations perform many powerful operations, such as linear algebra and model training.
Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
Communication-Cost: Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
Matrix Multiplication by MapReduce: There is a family of one-pass MapReduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate $=\frac {2n^2}q$ , where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

END

《斯坦福数据挖掘教程·第三版》读书笔记（英文版） Chapter 2 MapReduce and the New Software Stack

Chapter 2 MapReduce and the New Software Stack

Summary of Chapter 2

相关文章：

《斯坦福数据挖掘教程·第三版》读书笔记（英文版） Chapter 2 MapReduce and the New Software Stack

HTML零基础快速入门（详细教程）

Kubernetes第5天

RK3568平台开发系列讲解（调试篇）debugfs 分析手段

【Spring框架全系列】SpringBoot配置日志文件

事务 ---MySQL的总结（六）

22 标准模板库STL之容器适配器

目标检测YOLO实战应用案例100讲-基于深度学习的自动驾驶目标检测算法研究

服务网关Gateway

（4）定时器

项目实现读写分离操作（mysql）

VP记录:Educational Codeforces Round 148 (Rated for Div. 2) A~D1

无线模块|如何选择天线和设计天线电路

（11）LCD1602液晶显示屏

类和对象下

【云计算•云原生】4.云原生之什么是Kubernetes

云厂商降价潮背后：来中小企业战场「拼刺刀」

2-单片机GPIO相关知识点及流水灯和按键采集小实验

基础知识（王爽老师书第一章）

非煤矿山电子封条建设算法 yolov8

第 86 场周赛：矩阵中的幻方、钥匙和房间、将数组拆分成斐波那契序列、猜猜这个单词

大数据学习（132）-HIve数据分析

稳定币的深度剖析与展望

使用Matplotlib创建炫酷的3D散点图：数据可视化的新维度

华为OD机考-机房布局

WPF八大法则：告别模态窗口卡顿

ZYNQ学习记录FPGA(一)ZYNQ简介

在鸿蒙HarmonyOS 5中使用DevEco Studio实现指南针功能

【Java多线程从青铜到王者】单例设计模式(八)

Fractal Generative Models论文阅读笔记与代码分析