当前位置: 首页 > news >正文

《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 2 MapReduce and the New Software Stack

Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the inter-rack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential.

在这里插入图片描述

Some important calculations take minutes or even hours on thousands of compute nodes. If we had to abort and restart the computation every time one component failed, then the computation might never complete successfully.
The solution to this problem takes two forms:

  1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. If we did not back up the files at all, and the disk crashes, the files would be lost forever.
  2. Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. This strategy is followed by the MapReduce programming system.

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Summary of Chapter 2

  • Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
  • Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
  • MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
  • The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
  • The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
  • Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
  • Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
  • Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish (the blocking property), it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master
    executes fails.
  • Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
  • Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
  • Spark: This popular workflow system introduces Resilient, Distributed Datasets (RDD’s) and a language in which many common operations on RDD’s can be written. Spark has a number of efficiencies, including lazy evaluation of RDD’s to avoid secondary storage of intermediate results and the recording of lineage for RDD’s so they can be reconstructed as needed.
  • TensorFlow: This workflow system is specifically designed to support machine-learning. Data is represented as multidimensional arrays, or tensors, and built-in operations perform many powerful operations, such as linear algebra and model training.
  • Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
  • Communication-Cost: Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
  • Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
  • Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
  • Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
  • Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
  • Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
  • Matrix Multiplication by MapReduce: There is a family of one-pass MapReduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2 n 2 q r =\frac {2n^2}q r=q2n2, where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

END

相关文章:

《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

来源:《斯坦福数据挖掘教程第三版》对应的公开英文书和PPT Chapter 2 MapReduce and the New Software Stack Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet …...

HTML零基础快速入门(详细教程)

1&#xff0c;HTML代码特点 <html><head></head><body>hello world!</body> </html>HTML代码有以下特点&#xff1a; html代码是通过标签来组织的&#xff0c;而标签是由尖括号< >组织的&#xff0c;也可被叫作元素&#xff08;ele…...

Kubernetes第5天

第七章 Service详解 本章节主要介绍kubernetes的流量负载组件&#xff1a;Service和Ingress。 Service介绍 ​ 在kubernetes中&#xff0c;pod是应用程序的载体&#xff0c;我们可以通过pod的ip来访问应用程序&#xff0c;但是pod的ip地址不是固定的&#xff0c;这也就意味着…...

RK3568平台开发系列讲解(调试篇)debugfs 分析手段

🚀返回专栏总目录 文章目录 一、enable debugfs二、debugfs API三、使用示例沉淀、分享、成长,让自己和他人都能有所收获!😄 📢Linux 上有一些典型的问题分析手段,从这些基本的分析方法入手,你可以一步步判断出问题根因。这些分析手段,可以简单地归纳为下图: 从这…...

【Spring框架全系列】SpringBoot配置日志文件

&#x1f367;&#x1f367;哈喽&#xff0c;大家好&#xff0c;我是小浪。那么上篇博客我们学习了SpringBoot配置文件的相关操作&#xff0c;本篇博客我们将学习一个新的知识点&#xff0c;SpringBoot日志文件。&#x1f5a5;&#x1f5a5; &#x1f4f2;目录 一、日志是什么…...

事务 ---MySQL的总结(六)

事务 多进程进行并改变同一个数据&#xff0c;如果没有进行版本控制&#xff0c;就会出现数据不确定的问题&#xff0c;为此引入了事务的概念。可以进行数据回滚&#xff0c;解决潜在的问题。 事务的概念 一组的DML组成&#xff0c;这一些的DML要么同时成功&#xff0c;要么同…...

22 标准模板库STL之容器适配器

概述 提到适配器,我们的第一印象是想到设计模式中的适配器模式:将一个类的接口转化为另一个类的接口,使原本不兼容而不能合作的两个类,可以一起工作。STL中的容器适配器与此类似,是一个封装了序列容器的类模板,它在一般序列容器的基础上提供了一些不同的功能和接口。之所…...

目标检测YOLO实战应用案例100讲-基于深度学习的自动驾驶目标检测算法研究

目录 基于深度学习的自动驾驶目标检测算法研究 相关理论基础 2.1 卷积神经网络基本原理...

服务网关Gateway

前言 API 网关出现的原因是微服务架构的出现&#xff0c;不同的微服务一般会有不同的网络地址&#xff0c;而外部客户端可能需要调用多个服务的接口才能完成一个业务需求&#xff0c;如果让客户端直接与各个微服务通信&#xff0c;会有以下的问题&#xff1a; 破坏了服务无状态…...

(4)定时器

51单片机的定时器属于单片机的内部资源&#xff0c;其电路的连接和运转均在单片机内部完成 作用&#xff1a; 用于计时系统替代长时间Delay&#xff0c;提高运行效率和速度任务切换 STC89C52定时器资源&#xff1a; 定时器个数&#xff1a;3个&#xff08;T0,T1,T2&#xf…...

项目实现读写分离操作(mysql)

读写分离 1.问题说明 2.读写分离 Master&#xff08;主库&#xff09;----(数据同步)—> Slave&#xff08;从库&#xff09; Mysql主从复制 mysql主从复制 介绍 mysql主从复制是一个异步的复制过程&#xff0c;底层是基于mysql数据库自带的二进制日志功能。就是一台或多台…...

VP记录:Educational Codeforces Round 148 (Rated for Div. 2) A~D1

传送门:CF 前题提要:本人临近期中,时间较紧,且关于D2暂时没有想到优化算法,因此准备留着以后有时间再继续解决 A题:A. New Palindrome 简单的模拟题,考虑记录每一个字母出现的次数.很容易发现奇数次的数字只能出现一次.因为最多只能在正中间放一个.并且因为不能和初始字符相…...

无线模块|如何选择天线和设计天线电路

无线模块的通信距离是一项重要指标&#xff0c;如何把有效通信距离最大化一直是大家疑惑的问题。本文根据调试经验及对天线的选择与使用方法做了一些说明&#xff0c;希望对工程师快速调试通信距离有所帮助。 一、天线的种类 随着技术的进步&#xff0c;为了节省研发周期&…...

(11)LCD1602液晶显示屏

LCD1602&#xff08;Liquid Crystal Display&#xff09;液晶显示屏是一种字符型液晶显示模块&#xff0c;可以显示ASCII码的标准字符和其它的一些内置特殊字符&#xff0c;还可以有8个自定义字符&#xff0c;自带芯片扫描 显示容量&#xff1a;162个字符&#xff0c;每个字符…...

类和对象下

文章目录 一、初始化列表1、语法&#xff1a;2、初始化顺序 二、static成员三、友元1、友元函数2、友元类 四、拷贝对象时的编译器优化例1、例2、例3、 一、初始化列表 1、语法&#xff1a; 初始化列表&#xff1a; 以一个冒号开始&#xff0c;接着是一个以逗号分隔的数据成员…...

【云计算•云原生】4.云原生之什么是Kubernetes

文章目录 Kubernetes概念Kubernetes核心概念集群podConfigMap Kubernetes架构master节点的组件worker节点组件 Kubernetes网络架构内部网络外部网络 k8s各端口含义 Kubernetes概念 K8S就是Kubernetes&#xff0c;Kubernetes首字母为K&#xff0c;末尾为s&#xff0c;中间一共有…...

云厂商降价潮背后:来中小企业战场「拼刺刀」

如果说过往云厂商的降价打响的是从C端进军B端的营销战&#xff0c;那么在这一轮降价潮背后&#xff0c;对应的则是云厂商从大型KA客户向中小企业进军的信号&#xff0c;强被集成&#xff0c;强获客。 云厂商又一轮降价潮袭来。 5月16日&#xff0c;移动云宣布部分产品线最高降…...

2-单片机GPIO相关知识点及流水灯和按键采集小实验

目录 小问题 &#xff1a;单片机上电后第一个执行的程序是&#xff1f; 【1】GPIO 1.定义 2.应用 I - Input 输入采集 O - Output 输出控制 3.GPIO结构框图 4.功能描述 输入功能 5.相关寄存器 【2】输出控制实验 实验&#xff1a;点亮一盏LED灯 1.实验…...

基础知识(王爽老师书第一章)

文章目录 基础知识1.1 引言1.2 机器语言1.2 引言汇编语言的产生1.3 汇编语言的组成1.4 存储器1.5 指令和数据1.6 存储单元1.7 CPU对存储器的读写1.8 地址总线1.9 数据总线1.10 控制总线小结检测点1.11.11 内存地址空间1.12 主板1.13 接口卡1.14 各类存储器芯片1.15 内存地址空间…...

非煤矿山电子封条建设算法 yolov8

非煤矿山电子封条建设算法模型通过yolov8网络模型AI视频智能分析技术&#xff0c;算法模型对作业状态以及出井入井人员数量变化、人员睡岗离岗等情况实时监测分析&#xff0c;及时发现异常动态&#xff0c;自动推送生成的违规截图报警信息。现代目标检测器大部分都会在正负样本…...

第 86 场周赛:矩阵中的幻方、钥匙和房间、将数组拆分成斐波那契序列、猜猜这个单词

Q1、[中等] 矩阵中的幻方 1、题目描述 3 x 3 的幻方是一个填充有 从 1 到 9 的不同数字的 3 x 3 矩阵&#xff0c;其中每行&#xff0c;每列以及两条对角线上的各数之和都相等。 给定一个由整数组成的row x col 的 grid&#xff0c;其中有多少个 3 3 的 “幻方” 子矩阵&am…...

大数据学习(132)-HIve数据分析

​​​​&#x1f34b;&#x1f34b;大数据学习&#x1f34b;&#x1f34b; &#x1f525;系列专栏&#xff1a; &#x1f451;哲学语录: 用力所能及&#xff0c;改变世界。 &#x1f496;如果觉得博主的文章还不错的话&#xff0c;请点赞&#x1f44d;收藏⭐️留言&#x1f4…...

稳定币的深度剖析与展望

一、引言 在当今数字化浪潮席卷全球的时代&#xff0c;加密货币作为一种新兴的金融现象&#xff0c;正以前所未有的速度改变着我们对传统货币和金融体系的认知。然而&#xff0c;加密货币市场的高度波动性却成为了其广泛应用和普及的一大障碍。在这样的背景下&#xff0c;稳定…...

使用Matplotlib创建炫酷的3D散点图:数据可视化的新维度

文章目录 基础实现代码代码解析进阶技巧1. 自定义点的大小和颜色2. 添加图例和样式美化3. 真实数据应用示例实用技巧与注意事项完整示例(带样式)应用场景在数据科学和可视化领域,三维图形能为我们提供更丰富的数据洞察。本文将手把手教你如何使用Python的Matplotlib库创建引…...

华为OD机考-机房布局

import java.util.*;public class DemoTest5 {public static void main(String[] args) {Scanner in new Scanner(System.in);// 注意 hasNext 和 hasNextLine 的区别while (in.hasNextLine()) { // 注意 while 处理多个 caseSystem.out.println(solve(in.nextLine()));}}priv…...

WPF八大法则:告别模态窗口卡顿

⚙️ 核心问题&#xff1a;阻塞式模态窗口的缺陷 原始代码中ShowDialog()会阻塞UI线程&#xff0c;导致后续逻辑无法执行&#xff1a; var result modalWindow.ShowDialog(); // 线程阻塞 ProcessResult(result); // 必须等待窗口关闭根本问题&#xff1a…...

ZYNQ学习记录FPGA(一)ZYNQ简介

一、知识准备 1.一些术语,缩写和概念&#xff1a; 1&#xff09;ZYNQ全称&#xff1a;ZYNQ7000 All Pgrammable SoC 2&#xff09;SoC:system on chips(片上系统)&#xff0c;对比集成电路的SoB&#xff08;system on board&#xff09; 3&#xff09;ARM&#xff1a;处理器…...

在鸿蒙HarmonyOS 5中使用DevEco Studio实现指南针功能

指南针功能是许多位置服务应用的基础功能之一。下面我将详细介绍如何在HarmonyOS 5中使用DevEco Studio实现指南针功能。 1. 开发环境准备 确保已安装DevEco Studio 3.1或更高版本确保项目使用的是HarmonyOS 5.0 SDK在项目的module.json5中配置必要的权限 2. 权限配置 在mo…...

【Java多线程从青铜到王者】单例设计模式(八)

wait和sleep的区别 我们的wait也是提供了一个还有超时时间的版本&#xff0c;sleep也是可以指定时间的&#xff0c;也就是说时间一到就会解除阻塞&#xff0c;继续执行 wait和sleep都能被提前唤醒(虽然时间还没有到也可以提前唤醒)&#xff0c;wait能被notify提前唤醒&#xf…...

Fractal Generative Models论文阅读笔记与代码分析

何恺明分型模型这篇文章在二月底上传到arXiv预出版网站到现在已经过了三个月&#xff0c;当时我也听说这篇文章时感觉是大有可为&#xff0c;但是几个月不知道忙啥了&#xff0c;可能错过很多机会&#xff0c;但是亡羊补牢嘛&#xff0c;而且截至目前&#xff0c;该文章应该也还…...