当前位置：首页 > news >正文

kubernetes-cni 框架源码分析

news 2025/12/4 14:12:24

深入探索 Kubernetes 网络模型和网络通信

Kubernetes 定义了一种简单、一致的网络模型，基于扁平网络结构的设计，无需将主机端口与网络端口进行映射便可以进行高效地通讯，也无需其他组件进行转发。该模型也使应用程序很容易从虚拟机或者主机物理机迁移到 Kubernetes 管理的 pod 中。

这篇文章主要深入探索 Kubernetes 网络模型，并了解容器、pod 间如何进行通讯。对于网络模型的实现将会在后面的文章介绍。

kubernetes网络模型

该模型定义了：

每个 pod 都有自己的 IP 地址，这个 IP 在集群范围内可达
Pod 中的所有容器共享 pod IP 地址（包括 MAC 地址），并且容器之前可以相互通信（使用 localhost）
Pod 可以使用 pod IP 地址与集群中任一节点上的其他 pod 通信，无需 NAT
Kubernetes 的组件之间可以相互通信，也可以与 pod 通信
网络隔离可以通过网络策略实现

上面的定义中提到了几个相关的组件：

Pod：Kubernetes 中的 pod 有点类似虚拟机有唯一的 IP 地址，同一个节点上的 pod 共享网络和存储。
Container：pod 是一组容器的集合，这些容器共享同一个网络命名空间。pod 内的容器就像虚拟机上的进程，进程之间可以使用 localhost 进行通信；容器有自己独立的文件系统、CPU、内存和进程空间。需要通过创建 Pod 来创建容器。
Node：pod 运行在节点上，集群中包含一个或多个节点。每个 pod 的网络命名空间都会连接到节点的命名空间上，以打通网络。

网络命名空间如何工作

在 Kubernetes 的发行版 k3s 创建一个 pod，这个 pod 有两个容器：发送请求的 curl 容器和提供 web 服务的 httpbin 容器。

虽然使用发行版，但是其仍然使用 Kubernetes 网络模型，并不妨碍我们了解网络模型。

apiVersion: v1
kind: Pod
metadata:name: multi-container-pod
spec:containers:- image: curlimages/curlname: curlcommand: ["sleep", "365d"]- image: kennethreitz/httpbinname: httpbin

登录到节点上，通过 lsns -t net 当前主机上的网络命名空间，但是并没有找到 httpbin 的进程。有个命名空间的命令是 /pause，这个 pause 进程实际上是每个 pod 中 不可见 的 sandbox 容器进程。关于 sanbox 容器的作用，将会在下一篇容器网络和 CNI 中介绍。

lsns -t netNS TYPE NPROCS    PID USER     NETNSID NSFS                                                COMMAND4026531992 net     126      1 root  unassigned                                                     /lib/systemd/systemd --system --deserialize 314026532247 net       1  83224 uuidd unassigned                                                     /usr/sbin/uuidd --socket-activation4026532317 net       4 129820 65535          0 /run/netns/cni-607c5530-b6d8-ba57-420e-a467d7b10c56 /pauselsns -t netNS TYPE NPROCS    PID USER     NETNSID NSFS                                                COMMAND4026531992 net     126      1 root  unassigned                                                     /lib/systemd/systemd --system --deserialize 314026532247 net       1  83224 uuidd unassigned                                                     /usr/sbin/uuidd --socket-activation4026532317 net       4 129820 65535          0 /run/netns/cni-607c5530-b6d8-ba57-420e-a467d7b10c56 /pauselsns -t netNS TYPE NPROCS    PID USER     NETNSID NSFS                                                COMMAND4026531992 net     126      1 root  unassigned                                                     /lib/systemd/systemd --system --deserialize 314026532247 net       1  83224 uuidd unassigned                                                     /usr/sbin/uuidd --socket-activation4026532317 net       4 129820 65535          0 /run/netns/cni-607c5530-b6d8-ba57-420e-a467d7b10c56 /pause

既然每个容器都有独立的进程空间，我们换下命令查看进程类型的空间：

lsns -t pidNS TYPE NPROCS    PID USER            COMMAND
4026531836 pid     127      1 root            /lib/systemd/systemd --system --deserialize 314026532387 pid       1 129820 65535           /pause
4026532389 pid       1 129855 systemd-network sleep 365d
4026532391 pid       2 129889 root            /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent

通过进程 PID 129889 可以找到其所属的命名空间：

ip netns identify 129889
cni-607c5530-b6d8-ba57-420e-a467d7b10c56

然后可以在该命名空间下使用 exec 执行命令：

ip netns exec cni-607c5530-b6d8-ba57-420e-a467d7b10c56 ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft foreverinet6 ::1/128 scope hostvalid_lft forever preferred_lft forever2: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group defaultlink/ether f2:c8:17:b6:5f:e5 brd ff:ff:ff:ff:ff:ff link-netnsid 0inet 10.42.1.14/24 brd 10.42.1.255 scope global eth0valid_lft forever preferred_lft foreverinet6 fe80::f0c8:17ff:feb6:5fe5/64 scope linkvalid_lft forever preferred_lft forever

从结果来看 pod 的 IP 地址 10.42.1.14 绑定在接口 eth0 上，而 eth0 被连接到 17 号接口上。

在节点主机上，查看 17 号接口信息。veth7912056b 是主机根命名空间下的虚拟以太接口（vitual ethernet device），是连接 pod 网络和节点网络的隧道，对端是 pod 命名空间下的接口 eth0。

ip link | grep -A1 ^1717: veth7912056b@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group defaultlink/ether d6:5e:54:7f:df:af brd ff:ff:ff:ff:ff:ff link-netns cni-607c5530-b6d8-ba57-420e-a467d7b10c56

上面的结果看到，该 veth 连到了个网桥（network bridge）cni0 上。

网桥工作在数据链路层（OSI 模型的第 2 层），连接多个网络（可多个网段）。当请求到达网桥，网桥会询问所有连接的接口（这里 pod 通过 veth 以网桥连接）是否拥有原始请求中的 IP 地址。如果有接口响应，网桥会将匹配信息（IP -> veth）记录，并将数据转发过去。

那如果没有接口响应怎么办？具体流程就要看各个网络插件的实现了。我准备在后面的文章中介绍常用的网络插件，比如 Calico、Flannel、Cilium 等。

接下来看下 Kubernetes 中的网络通信如何完成，一共有几种类型：

同 pod 内容器间通信
同节点上的 pod 间通信
不同节点上的 pod 间通信

Kubernetes网络如何工作

同POD内的容器间通信

同 pod 内的容器间通信最简单，这些容器共享网络命名空间，每个命名空间下都有 lo 回环接口，可以通过 localhost 来完成通信。

同节点上的POD通信

当我们将 curl 容器和 httpbin 分别在两个 pod 中运行，这两个 pod 有可能调度到同一个节点上。curl 发出的请求根据容器内的路由表到达了 pod 内的 eth0 接口。然后通过与 eth0 相连的隧道 veth1 到达节点的根网络空间。

veth1 通过网桥 cni0 与其他 pod 相连虚拟以太接口 vethX 相连，网桥会询问所有相连的接口是否拥有原始请求中的 IP 地址（比如这里的 10.42.1.9）。收到响应后，网桥会记录映射信息（10.42.1.9 => veth0），同时将数据转发过去。最终数据经过 veth0 隧道进入 pod httpbin 中。

不同节点的POD通信

跨节点的 pod 间通信会复杂一些，且 不同网络插件的处理方式不同，这里选择一种容易理解的方式来简单说明下。

前半部分的流程与同节点 pod 间通信类似，当请求到达网桥，网桥询问哪个 pod 拥有该 IP 但是没有得到回应。流程进入主机的路由寻址过程，到更高的集群层面。

在集群层面有一张路由表，里面存储着每个节点的 Pod IP 网段（节点加入到集群时会分配一个 Pod 网段（Pod CIDR），比如在 k3s 中默认的 Pod CIDR 是 10.42.0.0/16，节点获取到的网段是 10.42.0.0/24、10.42.1.0/24、10.42.2.0/24，依次类推）。通过节点的 Pod IP 网段可以判断出请求 IP 的节点，然后请求被发送到该节点。

总结

现在应该对 Kubernetes 的网络通信有初步的了解了吧。

整个传输过程需要各种不同组件的参与才完成，而这些组件与 pod 相同的生命周期，跟随 pod 的创建和销毁。容器的维护由 kubelet 委托给容器运行时（container runtime）来完成，而容器的网络命名空间则是由容器运行时委托网络插件共同完成。

创建 pod（容器）的网络命名空间
创建接口
创建 veth 对
设置命名空间网络
设置静态路由
配置以太网桥接器
分配 IP 地址
创建 NAT 规则
…

认识一下容器网络接口 CNI

上篇我们也提到不同网络插件对 Kubernetes 网络模型有不同的实现，主要集中在跨节点的 pod 间通信的实现上。用户可以根据需要选择合适的网络插件，这其中离不开 CNI（container network interface）。这些网络插件都实现了 CNI 标准，可以与容器编排系统和运行时良好的集成。

CNI是什么

CNI 是 CNCF 下的一个项目，除了提供了最重要的规范、用来 CNI 与应用集成的库、实行 CNI 插件的 CLI cnitool，以及可引用的插件。本文发布时，最新版本为 v1.1.2。

CNI 只关注容器的网络连接以及在容器销毁时清理/释放分配的资源，也正因为这个，即使容器发展迅速，CNI 也依然能保证简单并被广泛支持。

CNI规范

CNI 的规范涵盖了以下几部分：

网络配置文件格式
容器运行时与网络插件交互的协议
插件的执行流程
将委托其他插件的执行流程
返回给运行时的执行结果数据类型

网络配置格式

这里贴出规范中的配置示例，规范中定义了网络配置的格式，包括必须字段、可选字段以及各个字段的功能。示例使用定义了名为 dbnet 的网络，配置了插件 bridge 和 tuning，这两个插件。

CNI 的插件一般分为两种：

接口插件（interface plugin）：用来创建网络接口，比如示例中的 bridge。
链式插件（chained）：用来调整已创建好的网络接口，比如示例中的 tuning。

{"cniVersion": "1.0.0","name": "dbnet","plugins": [{"type": "bridge",// plugin specific parameters"bridge": "cni0","keyA": ["some more", "plugin specific", "configuration"],"ipam": {"type": "host-local",// ipam specific"subnet": "10.1.0.0/16","gateway": "10.1.0.1","routes": [{"dst": "0.0.0.0/0"}]},"dns": {"nameservers": [ "10.1.0.1" ]}},{"type": "tuning","capabilities": {"mac": true},"sysctl": {"net.core.somaxconn": "500"}},{"type": "portmap","capabilities": {"portMappings": true}}]
}

容器运行时与网络插件交互的协议

CNI 为容器运行时提供四个不同的操作：

ADD 将容器添加到网络，或修改配置
DEL 从网络中删除容器，或取消修改
CHECK 检查容器网络是否正常，如果容器的网络出现问题，则返回错误
VERSION 显示插件的版本

规范对操作的输入和输出内容进行了定义。主要几个核心的字段有：

CNI_COMMAND：上面的四个操作之一
CNI_CONTAINERID：容器 ID
CNI_NETNS：容器的隔离域，如果用的网络命名空间，这里的值是网络命名空间的地址
CNI_IFNAME：要在容器中创建的接口名，比如 eth0
CNI_ARGS：执行参数时传递的参数
CNI_PATH：插件可执行文件的路径

插件的执行流程

CNI 将容器上网络配置的 ADD、DELETE 和 CHECK 操作，成为附加（attachment）。

容器网络配置的操作，需要一个或多个插件的共同操作来完成，因此插件有一定的执行顺序。比如前面的示例配置中，要先创建接口，才能对接口进行调优。

拿 ADD 操作为例，首先执行的一般是 interface plugin，然后在执行 chained plugin。以前一个插件的输出 PrevResult 与下一个插件的配置会共同作为下一个插件的输入。如果是第一个插件，会将网络配置作为输入的一部分。插件可以将前一个插件的 PrevResult 最为自己的输出，也可以结合自身的操作对 PrevResult 进行更新。最后一个插件的输出 PrevResult 作为 CNI 的执行结果返回给容器运行时，容器运行时会保存改结果并将其作为其他操作的输入。

DELETE 的执行与 ADD 的顺序正好相反，要先移除接口上的配置或者释放已经分配的 IP，最后才能删除容器网络接口。DELETE 操作的输入就是容器运行时保存的 ADD 操作的结果。

除了定义单次操作中插件的执行顺序，CNI 还对操作的并行操作、重复操作等进行了说明。

插件委托

有一些操作，无论出于何种原因，都不能合理地作为一个松散的链接插件来实现。相反，CNI 插件可能希望将某些功能委托给另一个插件。一个常见的例子是 IP 地址管理（IP Adress Management，简称 IPAM），主要是为容器接口分配/回收 IP 地址、管理路由等。

CNI 定义了第三种插件 – IPAM 插件。CNI 插件可以在恰当的时机调用 IPAM 插件，IPAM 插件会将执行的结果返回给委托方。IPAM 插件会根据指定的协议（如 dhcp）、本地文件中的数据、或者网络配置文件中 ipam 字段的信息来完成操作：分配 IP、设置网关、路由等等。

"ipam": {"type": "host-local",// ipam specific"subnet": "10.1.0.0/16","gateway": "10.1.0.1","routes": [{"dst": "0.0.0.0/0"}]
}

执行结果

插件可以返回一下三种结果之一，规范对结果的格式进行了定义。

Success：同时会包含 PrevResult 信息，比如 ADD 操作后的 PrevResult 返回给容器运行时。
Error：包含必要的错误提示信息。
Version：这个是 VERSION 操作的返回结果。

CNI调用流程

Kubelet 监听到 Pod 调度到当前节点后，通过 rpc 调用 CRI(containerd, cri-o 等)，CRI 创建 Sandbox 容器，初始化 Cgroup 与 Namespace，然后再调用 CNI 插件分配 IP，最后完成容器创建与启动。

不同于 CRI、CSI 通过 rpc 通信，CNI 是通过二进制接口调用的，通过环境变量和标准输入传递具体网络配置，下图为 Flannel CNI 插件的工作流程，通过链式调用 CNI 插件实现对 Pod 的 IP 分配、网络配置：

CNI接口

CNI 的库是指 libcni，用于 CNI 和应用程序集成，定义了 CNI 相关的接口和配置。

type CNI interface {  AddNetworkList(ctx context.Context, net *NetworkConfigList, rt *RuntimeConf) (types.Result, error)  CheckNetworkList(ctx context.Context, net *NetworkConfigList, rt *RuntimeConf) error  DelNetworkList(ctx context.Context, net *NetworkConfigList, rt *RuntimeConf) error  GetNetworkListCachedResult(net *NetworkConfigList, rt *RuntimeConf) (types.Result, error)  GetNetworkListCachedConfig(net *NetworkConfigList, rt *RuntimeConf) ([]byte, *RuntimeConf, error)  AddNetwork(ctx context.Context, net *NetworkConfig, rt *RuntimeConf) (types.Result, error)  CheckNetwork(ctx context.Context, net *NetworkConfig, rt *RuntimeConf) error  DelNetwork(ctx context.Context, net *NetworkConfig, rt *RuntimeConf) error  GetNetworkCachedResult(net *NetworkConfig, rt *RuntimeConf) (types.Result, error)  GetNetworkCachedConfig(net *NetworkConfig, rt *RuntimeConf) ([]byte, *RuntimeConf, error)  ValidateNetworkList(ctx context.Context, net *NetworkConfigList) ([]string, error)  ValidateNetwork(ctx context.Context, net *NetworkConfig) ([]string, error)  
}

以添加网络的部分代码为例：

func (c *CNIConfig) addNetwork(ctx context.Context, name, cniVersion string, net *NetworkConfig, prevResult types.Result, rt *RuntimeConf) (types.Result, error) {  ...   return invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec)  
}

执行的逻辑简单来说就是：

查找可执行文件
加载网络配置
执行 ADD 操作
结果处理

对应的CNI插件中需要实现的接口:

CNI（Container Networking Interface）插件需要实现以下接口函数：

CmdAdd(args *skel.CmdArgs) error：
1. CmdAdd 函数在容器启动时调用，用于添加容器的网络配置。它接收一个 skel.CmdArgs 参数，包含了插件需要的输入信息，如网络配置、接口名称和网络命名空间。
2. 该函数负责执行网络设置操作，包括创建虚拟网络设备、配置 IP 地址、设置路由规则等。最后，它应返回一个 error 类型，表示操作的成功或失败。
CmdDel(args *skel.CmdArgs) error：
1. CmdDel 函数在容器停止时调用，用于删除容器的网络配置。它接收一个 skel.CmdArgs 参数，包含了插件需要的输入信息。
2. 该函数应该执行清理操作，包括删除虚拟网络设备、清除 IP 地址、路由规则等。最后，它应返回一个 error 类型，表示操作的成功或失败。
GetCapabilities() (*capabilities.Capabilities, error)：
1. GetCapabilities 函数返回插件的功能能力。它应返回一个 capabilities.Capabilities 对象，该对象描述了插件支持的功能和特性。
2. 通常，GetCapabilities 函数可以返回支持的 CNI 版本和插件的网络类型（"bridge"、"ipvlan" 等）等信息。
Check(args *skel.CmdArgs) error（可选）：
1. Check 函数用于检查插件的配置是否正确。它接收一个 skel.CmdArgs 参数，包含了插件需要的输入信息。
2. 该函数应该检查配置的有效性，并返回一个 error 类型，表示配置的检查结果（成功或失败）。

这些接口函数是 CNI 插件必须实现的核心函数。具体的插件实现可能会根据需要包含其他的辅助函数或结构体。CNI 插件还应该提供一个入口函数，例如 main 函数，用于初始化插件并根据命令行参数调用适当的功能函数。

CNI插件用例

https://qingwave.github.io/how-to-write-k8s-cni/

https://github.com/qingwave/mycni

CNI源码分析

源码版本

release/1.7

创建sandbox

criService.RunPodSandbox

// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
// the sandbox is in ready state.
func (c *criService) RunPodSandbox(ctx context.Context, r *runtime.RunPodSandboxRequest) (_ *runtime.RunPodSandboxResponse, retErr error) {config := r.GetConfig()log.G(ctx).Debugf("Sandbox config %+v", config)// Generate unique id and name for the sandbox and reserve the name.id := util.GenerateID()metadata := config.GetMetadata()if metadata == nil {return nil, errors.New("sandbox config must include metadata")}// sanbox的名称是使用meatada中的name，namespace, id等拼接成的name := makeSandboxName(metadata)log.G(ctx).WithField("podsandboxid", id).Debugf("generated id for sandbox name %q", name)// cleanupErr records the last error returned by the critical cleanup operations in deferred functions,// like CNI teardown and stopping the running sandbox task.// If cleanup is not completed for some reason, the CRI-plugin will leave the sandbox// in a not-ready state, which can later be cleaned up by the next execution of the kubelet's syncPod workflow.var cleanupErr error// Reserve the sandbox name to avoid concurrent `RunPodSandbox` request starting the// same sandbox.if err := c.sandboxNameIndex.Reserve(name, id); err != nil {return nil, fmt.Errorf("failed to reserve sandbox name %q: %w", name, err)}defer func() {// Release the name if the function returns with an error and all the resource cleanup is done.// When cleanupErr != nil, the name will be cleaned in sandbox_remove.if retErr != nil && cleanupErr == nil {c.sandboxNameIndex.ReleaseByName(name)}}()// Create initial internal sandbox object.// 创建sanbox实例sandbox := sandboxstore.NewSandbox(sandboxstore.Metadata{ID:             id,Name:           name,Config:         config,RuntimeHandler: r.GetRuntimeHandler(),},sandboxstore.Status{State:     sandboxstore.StateUnknown,CreatedAt: time.Now().UTC(),},)// 确保创建sandbox时需要的pause镜像存在// Ensure sandbox container image snapshot.image, err := c.ensureImageExists(ctx, c.config.SandboxImage, config)if err != nil {return nil, fmt.Errorf("failed to get sandbox image %q: %w", c.config.SandboxImage, err)}containerdImage, err := c.toContainerdImage(ctx, *image)if err != nil {return nil, fmt.Errorf("failed to get image from containerd %q: %w", image.ID, err)}// 获取 runtimeociRuntime, err := c.getSandboxRuntime(config, r.GetRuntimeHandler())if err != nil {return nil, fmt.Errorf("failed to get sandbox runtime: %w", err)}log.G(ctx).WithField("podsandboxid", id).Debugf("use OCI runtime %+v", ociRuntime)runtimeStart := time.Now()// Create sandbox container.// NOTE: sandboxContainerSpec SHOULD NOT have side// effect, e.g. accessing/creating files, so that we can test// it safely.// NOTE: the network namespace path will be created later and update through updateNetNamespacePath functionspec, err := c.sandboxContainerSpec(id, config, &image.ImageSpec.Config, "", ociRuntime.PodAnnotations)if err != nil {return nil, fmt.Errorf("failed to generate sandbox container spec: %w", err)}log.G(ctx).WithField("podsandboxid", id).Debugf("sandbox container spec: %#+v", spew.NewFormatter(spec))sandbox.ProcessLabel = spec.Process.SelinuxLabeldefer func() {if retErr != nil {selinux.ReleaseLabel(sandbox.ProcessLabel)}}()// handle any KVM based runtime// kvm runtime 需要特殊处理，例如 kata runtime, 这是基于虚拟化技术启动的容器if err := modifyProcessLabel(ociRuntime.Type, spec); err != nil {return nil, err}if config.GetLinux().GetSecurityContext().GetPrivileged() {// If privileged don't set selinux label, but we still record the MCS label so that// the unused label can be freed later.spec.Process.SelinuxLabel = ""}// 开始创建pause 容器// Generate spec options that will be applied to the spec later.specOpts, err := c.sandboxContainerSpecOpts(config, &image.ImageSpec.Config)if err != nil {return nil, fmt.Errorf("failed to generate sandbox container spec options: %w", err)}sandboxLabels := buildLabels(config.Labels, image.ImageSpec.Config.Labels, containerKindSandbox)runtimeOpts, err := generateRuntimeOptions(ociRuntime, c.config)if err != nil {return nil, fmt.Errorf("failed to generate runtime options: %w", err)}sOpts := []snapshots.Opt{snapshots.WithLabels(snapshots.FilterInheritedLabels(config.Annotations))}extraSOpts, err := sandboxSnapshotterOpts(config)if err != nil {return nil, err}sOpts = append(sOpts, extraSOpts...)opts := []containerd.NewContainerOpts{containerd.WithSnapshotter(c.runtimeSnapshotter(ctx, ociRuntime)),customopts.WithNewSnapshot(id, containerdImage, sOpts...),containerd.WithSpec(spec, specOpts...),containerd.WithContainerLabels(sandboxLabels),containerd.WithContainerExtension(sandboxMetadataExtension, &sandbox.Metadata),containerd.WithRuntime(ociRuntime.Type, runtimeOpts)}container, err := c.client.NewContainer(ctx, id, opts...)if err != nil {return nil, fmt.Errorf("failed to create containerd container: %w", err)}// pause是在pod中，但是却别于业务的特殊容器，这里索引与sandbox关联// Add container into sandbox store in INIT state.sandbox.Container = container//...if !hostNetwork(config) && userNsEnabled {// If userns is enabled, then the netns was created by the OCI runtime// when creating "task". The OCI runtime needs to create the netns// because, if userns is in use, the netns needs to be owned by the// userns. So, let the OCI runtime just handle this for us.// If the netns is not owned by the userns several problems will happen.// For instance, the container will lack permission (even if// capabilities are present) to modify the netns or, even worse, the OCI// runtime will fail to mount sysfs://  https://github.com/torvalds/linux/commit/7dc5dbc879bd0779924b5132a48b731a0bc04a1e#diff-4839664cd0c8eab716e064323c7cd71fR1164netStart := time.Now()/// 设置命名空间// If it is not in host network namespace then create a namespace and set the sandbox// handle. NetNSPath in sandbox metadata and NetNS is non empty only for non host network// namespaces. If the pod is in host network namespace then both are empty and should not// be used.var netnsMountDir = "/var/run/netns"if c.config.NetNSMountsUnderStateDir {netnsMountDir = filepath.Join(c.config.StateDir, "netns")}sandbox.NetNS, err = netns.NewNetNSFromPID(netnsMountDir, task.Pid())if err != nil {return nil, fmt.Errorf("failed to create network namespace for sandbox %q: %w", id, err)}// Verify task is still in created state.if st, err := task.Status(ctx); err != nil || st.Status != containerd.Created {return nil, fmt.Errorf("failed to create pod sandbox %q: err is %v - status is %q and is expected %q", id, err, st.Status, containerd.Created)}sandbox.NetNSPath = sandbox.NetNS.GetPath()defer func() {// Remove the network namespace only if all the resource cleanup is done.if retErr != nil && cleanupErr == nil {if cleanupErr = sandbox.NetNS.Remove(); cleanupErr != nil {log.G(ctx).WithError(cleanupErr).Errorf("Failed to remove network namespace %s for sandbox %q", sandbox.NetNSPath, id)return}sandbox.NetNSPath = ""}}()// Update network namespace in the container's specc.updateNetNamespacePath(spec, sandbox.NetNSPath)if err := container.Update(ctx,// Update spec of the containercontainerd.UpdateContainerOpts(containerd.WithSpec(spec)),// Update sandbox metadata to include NetNS infocontainerd.UpdateContainerOpts(containerd.WithContainerExtension(sandboxMetadataExtension, &sandbox.Metadata))); err != nil {return nil, fmt.Errorf("failed to update the network namespace for the sandbox container %q: %w", id, err)}// Define this defer to teardownPodNetwork prior to the setupPodNetwork function call.// This is because in setupPodNetwork the resource is allocated even if it returns error, unlike other resource creation functions.defer func() {// Teardown the network only if all the resource cleanup is done.if retErr != nil && cleanupErr == nil {deferCtx, deferCancel := util.DeferContext()defer deferCancel()// Teardown network if an error is returned.if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)}}}()// Setup network for sandbox.// Certain VM based solutions like clear containers (Issue containerd/cri-containerd#524)// rely on the assumption that CRI shim will not be querying the network namespace to check the// network states such as IP.// In future runtime implementation should avoid relying on CRI shim implementation details.// In this case however caching the IP will add a subtle performance enhancement by avoiding// calls to network namespace of the pod to query the IP of the veth interface on every// SandboxStatus request.// 设置sandbox的网络if err := c.setupPodNetwork(ctx, &sandbox); err != nil {return nil, fmt.Errorf("failed to setup network for sandbox %q: %w", id, err)}sandboxCreateNetworkTimer.UpdateSince(netStart)}err = c.nri.RunPodSandbox(ctx, &sandbox)if err != nil {return nil, fmt.Errorf("NRI RunPodSandbox failed: %w", err)}defer func() {if retErr != nil {deferCtx, deferCancel := util.DeferContext()defer deferCancel()c.nri.RemovePodSandbox(deferCtx, &sandbox)}}()// 启动sandbox containerif err := task.Start(ctx); err != nil {return nil, fmt.Errorf("failed to start sandbox container task %q: %w", id, err)}if err := sandbox.Status.Update(func(status sandboxstore.Status) (sandboxstore.Status, error) {// Set the pod sandbox as ready after successfully start sandbox container.status.Pid = task.Pid()status.State = sandboxstore.StateReadystatus.CreatedAt = info.CreatedAtreturn status, nil}); err != nil {return nil, fmt.Errorf("failed to update sandbox status: %w", err)}if err := c.sandboxStore.Add(sandbox); err != nil {return nil, fmt.Errorf("failed to add sandbox %+v into store: %w", sandbox, err)}// Send CONTAINER_CREATED event with both ContainerId and SandboxId equal to SandboxId.// Note that this has to be done after sandboxStore.Add() because we need to get// SandboxStatus from the store and include it in the event.c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_CREATED_EVENT)// start the monitor after adding sandbox into the store, this ensures// that sandbox is in the store, when event monitor receives the TaskExit event.//// TaskOOM from containerd may come before sandbox is added to store,// but we don't care about sandbox TaskOOM right now, so it is fine.c.eventMonitor.startSandboxExitMonitor(context.Background(), id, task.Pid(), exitCh)// Send CONTAINER_STARTED event with both ContainerId and SandboxId equal to SandboxId.c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_STARTED_EVENT)sandboxRuntimeCreateTimer.WithValues(ociRuntime.Type).UpdateSince(runtimeStart)return &runtime.RunPodSandboxResponse{PodSandboxId: id}, nil
}

criService.setupPodNetwork

pod的网络主要是在sandbox创建过程中，由cni插件创建的。

// setupPodNetwork setups up the network for a pod
func (c *criService) setupPodNetwork(ctx context.Context, sandbox *sandboxstore.Sandbox) error {var (id        = sandbox.IDconfig    = sandbox.Configpath      = sandbox.NetNSPath// 获取加载的CNI插件netPlugin = c.getNetworkPlugin(sandbox.RuntimeHandler)err       errorresult    *cni.Result)if netPlugin == nil {return errors.New("cni config not initialized")}opts, err := cniNamespaceOpts(id, config)if err != nil {return fmt.Errorf("get cni namespace options: %w", err)}log.G(ctx).WithField("podsandboxid", id).Debugf("begin cni setup")netStart := time.Now()// 使用CNI插件创建网络if c.config.CniConfig.NetworkPluginSetupSerially {result, err = netPlugin.SetupSerially(ctx, id, path, opts...)} else {result, err = netPlugin.Setup(ctx, id, path, opts...)}networkPluginOperations.WithValues(networkSetUpOp).Inc()networkPluginOperationsLatency.WithValues(networkSetUpOp).UpdateSince(netStart)if err != nil {networkPluginOperationsErrors.WithValues(networkSetUpOp).Inc()return err}logDebugCNIResult(ctx, id, result)// Check if the default interface has IP configif configs, ok := result.Interfaces[defaultIfName]; ok && len(configs.IPConfigs) > 0 {sandbox.IP, sandbox.AdditionalIPs = selectPodIPs(ctx, configs.IPConfigs, c.config.IPPreference)sandbox.CNIResult = resultreturn nil}return fmt.Errorf("failed to find network info for sandbox %q", id)
}

CNI插件

CNI插件的加载

NewCRIService

cni 的加载是在kubelet 中进行加载的

// NewCRIService returns a new instance of CRIService
func NewCRIService(config criconfig.Config, client *containerd.Client, nri *nri.API, warn warning.Service) (CRIService, error) {...c.cniNetConfMonitor = make(map[string]*cniNetConfSyncer)for name, i := range c.netPlugin {path := c.config.NetworkPluginConfDirif name != defaultNetworkPlugin {if rc, ok := c.config.Runtimes[name]; ok {path = rc.NetworkPluginConfDir}}if path != "" {// 开始按照默认流程加载CNI相关m, err := newCNINetConfSyncer(path, i, c.cniLoadOptions())if err != nil {return nil, fmt.Errorf("failed to create cni conf monitor for %s: %w", name, err)}c.cniNetConfMonitor[name] = m}}// Preload base OCI specsc.baseOCISpecs, err = loadBaseOCISpecs(&config)if err != nil {return nil, err}// Load all sandbox controllers(pod sandbox controller and remote shim controller)c.sandboxControllers[criconfig.ModePodSandbox] = podsandbox.New(config, client, c.sandboxStore, c.os, c, c.baseOCISpecs)c.sandboxControllers[criconfig.ModeShim] = client.SandboxController()c.nri = nrireturn c, nil
}

defaultCNIConfig

用户未配置cni目录时将使用默认配置, 可见默认cni配置文件存放路径

func defaultCNIConfig() *libcni {return &libcni{config: config{// 默认加载cni插件资源的目录pluginDirs:       []string{DefaultCNIDir},pluginConfDir:    DefaultNetDir,pluginMaxConfNum: DefaultMaxConfNum,prefix:           DefaultPrefix,},cniConfig: cnilibrary.NewCNIConfig([]string{DefaultCNIDir,},&invoke.DefaultExec{RawExec:       &invoke.RawExec{Stderr: os.Stderr},PluginDecoder: version.PluginDecoder{},},),networkCount: 1,}
}

const (DefaultNetDir        = "/etc/cni/net.d"DefaultCNIDir        = "/opt/cni/bin"VendorCNIDirTemplate = "%s/opt/%s/bin"
)

CNI插件的调用

CNIConfig.addNetwork

根据libcni.Setup ->attachNetworks->asynchAttach->Attach->AddNetworkList->addNetwork

可见执行cni的底层函数如下

func (c *CNIConfig) addNetwork(ctx context.Context, name, cniVersion string, net *NetworkConfig, prevResult types.Result, rt *RuntimeConf) (types.Result, error) {c.ensureExec()// 找到插件路径pluginPath, err := c.exec.FindInPath(net.Network.Type, c.Path)if err != nil {return nil, err}if err := utils.ValidateContainerID(rt.ContainerID); err != nil {return nil, err}if err := utils.ValidateNetworkName(name); err != nil {return nil, err}if err := utils.ValidateInterfaceName(rt.IfName); err != nil {return nil, err}newConf, err := buildOneConfig(name, cniVersion, net, prevResult, rt)if err != nil {return nil, err}// 调用路径下的cni插件执行文件// 传入参数ADD表示调用其中网络的cmdAdd 方法return invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec)
}

参考文章

https://atbug.com/how-kubelete-container-runtime-work-with-cni/

https://atbug.com/deep-dive-k8s-network-mode-and-communication/

https://www.cnblogs.com/lianngkyle/p/15171630.html

https://blog.csdn.net/weixin_40056921/article/details/129157735