当前位置: 首页 > news >正文

《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 2 MapReduce and the New Software Stack

Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the inter-rack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential.

在这里插入图片描述

Some important calculations take minutes or even hours on thousands of compute nodes. If we had to abort and restart the computation every time one component failed, then the computation might never complete successfully.
The solution to this problem takes two forms:

  1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. If we did not back up the files at all, and the disk crashes, the files would be lost forever.
  2. Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. This strategy is followed by the MapReduce programming system.

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Summary of Chapter 2

  • Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
  • Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
  • MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
  • The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
  • The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
  • Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
  • Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
  • Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish (the blocking property), it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master
    executes fails.
  • Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
  • Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
  • Spark: This popular workflow system introduces Resilient, Distributed Datasets (RDD’s) and a language in which many common operations on RDD’s can be written. Spark has a number of efficiencies, including lazy evaluation of RDD’s to avoid secondary storage of intermediate results and the recording of lineage for RDD’s so they can be reconstructed as needed.
  • TensorFlow: This workflow system is specifically designed to support machine-learning. Data is represented as multidimensional arrays, or tensors, and built-in operations perform many powerful operations, such as linear algebra and model training.
  • Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
  • Communication-Cost: Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
  • Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
  • Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
  • Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
  • Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
  • Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
  • Matrix Multiplication by MapReduce: There is a family of one-pass MapReduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2 n 2 q r =\frac {2n^2}q r=q2n2, where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

END

相关文章:

《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

来源:《斯坦福数据挖掘教程第三版》对应的公开英文书和PPT Chapter 2 MapReduce and the New Software Stack Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet …...

HTML零基础快速入门(详细教程)

1&#xff0c;HTML代码特点 <html><head></head><body>hello world!</body> </html>HTML代码有以下特点&#xff1a; html代码是通过标签来组织的&#xff0c;而标签是由尖括号< >组织的&#xff0c;也可被叫作元素&#xff08;ele…...

Kubernetes第5天

第七章 Service详解 本章节主要介绍kubernetes的流量负载组件&#xff1a;Service和Ingress。 Service介绍 ​ 在kubernetes中&#xff0c;pod是应用程序的载体&#xff0c;我们可以通过pod的ip来访问应用程序&#xff0c;但是pod的ip地址不是固定的&#xff0c;这也就意味着…...

RK3568平台开发系列讲解(调试篇)debugfs 分析手段

🚀返回专栏总目录 文章目录 一、enable debugfs二、debugfs API三、使用示例沉淀、分享、成长,让自己和他人都能有所收获!😄 📢Linux 上有一些典型的问题分析手段,从这些基本的分析方法入手,你可以一步步判断出问题根因。这些分析手段,可以简单地归纳为下图: 从这…...

【Spring框架全系列】SpringBoot配置日志文件

&#x1f367;&#x1f367;哈喽&#xff0c;大家好&#xff0c;我是小浪。那么上篇博客我们学习了SpringBoot配置文件的相关操作&#xff0c;本篇博客我们将学习一个新的知识点&#xff0c;SpringBoot日志文件。&#x1f5a5;&#x1f5a5; &#x1f4f2;目录 一、日志是什么…...

事务 ---MySQL的总结(六)

事务 多进程进行并改变同一个数据&#xff0c;如果没有进行版本控制&#xff0c;就会出现数据不确定的问题&#xff0c;为此引入了事务的概念。可以进行数据回滚&#xff0c;解决潜在的问题。 事务的概念 一组的DML组成&#xff0c;这一些的DML要么同时成功&#xff0c;要么同…...

22 标准模板库STL之容器适配器

概述 提到适配器,我们的第一印象是想到设计模式中的适配器模式:将一个类的接口转化为另一个类的接口,使原本不兼容而不能合作的两个类,可以一起工作。STL中的容器适配器与此类似,是一个封装了序列容器的类模板,它在一般序列容器的基础上提供了一些不同的功能和接口。之所…...

目标检测YOLO实战应用案例100讲-基于深度学习的自动驾驶目标检测算法研究

目录 基于深度学习的自动驾驶目标检测算法研究 相关理论基础 2.1 卷积神经网络基本原理...

服务网关Gateway

前言 API 网关出现的原因是微服务架构的出现&#xff0c;不同的微服务一般会有不同的网络地址&#xff0c;而外部客户端可能需要调用多个服务的接口才能完成一个业务需求&#xff0c;如果让客户端直接与各个微服务通信&#xff0c;会有以下的问题&#xff1a; 破坏了服务无状态…...

(4)定时器

51单片机的定时器属于单片机的内部资源&#xff0c;其电路的连接和运转均在单片机内部完成 作用&#xff1a; 用于计时系统替代长时间Delay&#xff0c;提高运行效率和速度任务切换 STC89C52定时器资源&#xff1a; 定时器个数&#xff1a;3个&#xff08;T0,T1,T2&#xf…...

项目实现读写分离操作(mysql)

读写分离 1.问题说明 2.读写分离 Master&#xff08;主库&#xff09;----(数据同步)—> Slave&#xff08;从库&#xff09; Mysql主从复制 mysql主从复制 介绍 mysql主从复制是一个异步的复制过程&#xff0c;底层是基于mysql数据库自带的二进制日志功能。就是一台或多台…...

VP记录:Educational Codeforces Round 148 (Rated for Div. 2) A~D1

传送门:CF 前题提要:本人临近期中,时间较紧,且关于D2暂时没有想到优化算法,因此准备留着以后有时间再继续解决 A题:A. New Palindrome 简单的模拟题,考虑记录每一个字母出现的次数.很容易发现奇数次的数字只能出现一次.因为最多只能在正中间放一个.并且因为不能和初始字符相…...

无线模块|如何选择天线和设计天线电路

无线模块的通信距离是一项重要指标&#xff0c;如何把有效通信距离最大化一直是大家疑惑的问题。本文根据调试经验及对天线的选择与使用方法做了一些说明&#xff0c;希望对工程师快速调试通信距离有所帮助。 一、天线的种类 随着技术的进步&#xff0c;为了节省研发周期&…...

(11)LCD1602液晶显示屏

LCD1602&#xff08;Liquid Crystal Display&#xff09;液晶显示屏是一种字符型液晶显示模块&#xff0c;可以显示ASCII码的标准字符和其它的一些内置特殊字符&#xff0c;还可以有8个自定义字符&#xff0c;自带芯片扫描 显示容量&#xff1a;162个字符&#xff0c;每个字符…...

类和对象下

文章目录 一、初始化列表1、语法&#xff1a;2、初始化顺序 二、static成员三、友元1、友元函数2、友元类 四、拷贝对象时的编译器优化例1、例2、例3、 一、初始化列表 1、语法&#xff1a; 初始化列表&#xff1a; 以一个冒号开始&#xff0c;接着是一个以逗号分隔的数据成员…...

【云计算•云原生】4.云原生之什么是Kubernetes

文章目录 Kubernetes概念Kubernetes核心概念集群podConfigMap Kubernetes架构master节点的组件worker节点组件 Kubernetes网络架构内部网络外部网络 k8s各端口含义 Kubernetes概念 K8S就是Kubernetes&#xff0c;Kubernetes首字母为K&#xff0c;末尾为s&#xff0c;中间一共有…...

云厂商降价潮背后:来中小企业战场「拼刺刀」

如果说过往云厂商的降价打响的是从C端进军B端的营销战&#xff0c;那么在这一轮降价潮背后&#xff0c;对应的则是云厂商从大型KA客户向中小企业进军的信号&#xff0c;强被集成&#xff0c;强获客。 云厂商又一轮降价潮袭来。 5月16日&#xff0c;移动云宣布部分产品线最高降…...

2-单片机GPIO相关知识点及流水灯和按键采集小实验

目录 小问题 &#xff1a;单片机上电后第一个执行的程序是&#xff1f; 【1】GPIO 1.定义 2.应用 I - Input 输入采集 O - Output 输出控制 3.GPIO结构框图 4.功能描述 输入功能 5.相关寄存器 【2】输出控制实验 实验&#xff1a;点亮一盏LED灯 1.实验…...

基础知识(王爽老师书第一章)

文章目录 基础知识1.1 引言1.2 机器语言1.2 引言汇编语言的产生1.3 汇编语言的组成1.4 存储器1.5 指令和数据1.6 存储单元1.7 CPU对存储器的读写1.8 地址总线1.9 数据总线1.10 控制总线小结检测点1.11.11 内存地址空间1.12 主板1.13 接口卡1.14 各类存储器芯片1.15 内存地址空间…...

非煤矿山电子封条建设算法 yolov8

非煤矿山电子封条建设算法模型通过yolov8网络模型AI视频智能分析技术&#xff0c;算法模型对作业状态以及出井入井人员数量变化、人员睡岗离岗等情况实时监测分析&#xff0c;及时发现异常动态&#xff0c;自动推送生成的违规截图报警信息。现代目标检测器大部分都会在正负样本…...

内存分配函数malloc kmalloc vmalloc

内存分配函数malloc kmalloc vmalloc malloc实现步骤: 1)请求大小调整:首先,malloc 需要调整用户请求的大小,以适应内部数据结构(例如,可能需要存储额外的元数据)。通常,这包括对齐调整,确保分配的内存地址满足特定硬件要求(如对齐到8字节或16字节边界)。 2)空闲…...

FFmpeg 低延迟同屏方案

引言 在实时互动需求激增的当下&#xff0c;无论是在线教育中的师生同屏演示、远程办公的屏幕共享协作&#xff0c;还是游戏直播的画面实时传输&#xff0c;低延迟同屏已成为保障用户体验的核心指标。FFmpeg 作为一款功能强大的多媒体框架&#xff0c;凭借其灵活的编解码、数据…...

高频面试之3Zookeeper

高频面试之3Zookeeper 文章目录 高频面试之3Zookeeper3.1 常用命令3.2 选举机制3.3 Zookeeper符合法则中哪两个&#xff1f;3.4 Zookeeper脑裂3.5 Zookeeper用来干嘛了 3.1 常用命令 ls、get、create、delete、deleteall3.2 选举机制 半数机制&#xff08;过半机制&#xff0…...

css的定位(position)详解:相对定位 绝对定位 固定定位

在 CSS 中&#xff0c;元素的定位通过 position 属性控制&#xff0c;共有 5 种定位模式&#xff1a;static&#xff08;静态定位&#xff09;、relative&#xff08;相对定位&#xff09;、absolute&#xff08;绝对定位&#xff09;、fixed&#xff08;固定定位&#xff09;和…...

MySQL中【正则表达式】用法

MySQL 中正则表达式通过 REGEXP 或 RLIKE 操作符实现&#xff08;两者等价&#xff09;&#xff0c;用于在 WHERE 子句中进行复杂的字符串模式匹配。以下是核心用法和示例&#xff1a; 一、基础语法 SELECT column_name FROM table_name WHERE column_name REGEXP pattern; …...

mysql已经安装,但是通过rpm -q 没有找mysql相关的已安装包

文章目录 现象&#xff1a;mysql已经安装&#xff0c;但是通过rpm -q 没有找mysql相关的已安装包遇到 rpm 命令找不到已经安装的 MySQL 包时&#xff0c;可能是因为以下几个原因&#xff1a;1.MySQL 不是通过 RPM 包安装的2.RPM 数据库损坏3.使用了不同的包名或路径4.使用其他包…...

初学 pytest 记录

安装 pip install pytest用例可以是函数也可以是类中的方法 def test_func():print()class TestAdd: # def __init__(self): 在 pytest 中不可以使用__init__方法 # self.cc 12345 pytest.mark.api def test_str(self):res add(1, 2)assert res 12def test_int(self):r…...

Hive 存储格式深度解析:从 TextFile 到 ORC,如何选对数据存储方案?

在大数据处理领域&#xff0c;Hive 作为 Hadoop 生态中重要的数据仓库工具&#xff0c;其存储格式的选择直接影响数据存储成本、查询效率和计算资源消耗。面对 TextFile、SequenceFile、Parquet、RCFile、ORC 等多种存储格式&#xff0c;很多开发者常常陷入选择困境。本文将从底…...

以光量子为例,详解量子获取方式

光量子技术获取量子比特可在室温下进行。该方式有望通过与名为硅光子学&#xff08;silicon photonics&#xff09;的光波导&#xff08;optical waveguide&#xff09;芯片制造技术和光纤等光通信技术相结合来实现量子计算机。量子力学中&#xff0c;光既是波又是粒子。光子本…...

Go 并发编程基础:通道(Channel)的使用

在 Go 中&#xff0c;Channel 是 Goroutine 之间通信的核心机制。它提供了一个线程安全的通信方式&#xff0c;用于在多个 Goroutine 之间传递数据&#xff0c;从而实现高效的并发编程。 本章将介绍 Channel 的基本概念、用法、缓冲、关闭机制以及 select 的使用。 一、Channel…...