释放双眼，带上耳机，听听看~！

了解多核处理器系统中的NUMA架构，掌握绑核技术的应用，提高并行计算的性能优化。本教程包括numactl、MPI runner、CPU核心绑定等内容。

在多张显卡进行 training 或者 inference 的时候，我发现 “绑核” 之后，往往能够得到更好的性能数据（throughput），而且 “绑核” 方法的好坏，可能对最终的性能指标都有一定的影响。最近也看了不少相关的问题，我计划整理一下“绑核”的相关内容。

这里的 “绑核” 即使用 numactl把需要运行的进程绑定到某个具体的CPU 核心上。当然除了使用 numactl 进行“绑核”，也有 MPI runner 绑定，SLURM 绑定，OpenMP 参数，不同的并行计算框架都有自己“绑核”的具体操作，本文不涉及。

MPI runner 相关内容可以参考《NUMA 处理器与进程绑定》

为什么需要进行“绑核”

在解释为什么需要numactl bind 之前，需要简单介绍一下多核处理器系统

多核处理器系统

现代多核处理器系统通常由多个层级构成：

插槽（socket） / 封装（package）：指物理和机械上可分的 CPU。家用 PC 通常为单 socket，服务器通常可支持 2 socket（也常称为“双路”），也有 4 或者 8 socket。不同的 socket 之间通常通过高速总线连接，如 Intel 的 QPI、AMD 的 Infinity Fabric。
核心（core）：指 CPU 上完整、可独立执行控制流的处理单元，也是操作系统调度进程的单位。目前常见的家用和服务器处理器均为多核处理器，如一块 AMD EPYC 7763 处理器有 64 个 core，而一块 Intel Xeon Platinum 8380 有 40 个 core。
超线程（hyper-threading，简写为 HT）：将单个物理 CPU 核心虚拟为多个逻辑核心，以充分利用其中计算单元的技术。目前使用的处理器通常为每核心 2 超线程。通常所说 CPU 的“线程（thread）”或“硬件线程（hardware thread / hwt）”数包含超线程（注意与操作系统中的软件线程区分），如 Intel Core i7-10700K 有 8 个核心、16 个线程，可记为 8C16T。

由于 “CPU” 一词可被用于指代任一级别的单元，不推荐单独使用此措辞，以免产生歧义。

NUMA (Non-Uniform Memory Access)

多核处理器系统中的NUMA架构与绑核技术

由于SMP（Symmetric multiprocessing，对称多处理)在扩展能力上的限制，人们开始探究如何进行有效地扩展从而构建大型系统的技术，NUMA就是这种努力下的结果之一。利用NUMA技术，可以把几十个CPU(甚至上百个CPU)组合在一个服务器内。NUMA服务器的基本特征是具有多个CPU模块，每个CPU模块由多个CPU(如4个)组成，并且具有独立的本地内存、I/O槽口等。由于其节点之间可以通过互联模块(如称为Crossbar Switch)进行连接和信息交互，因此每个CPU可以访问整个系统的内存(这是NUMA系统与MPP系统的重要差别)。显然，访问本地内存的速度将远远高于访问远地内存(系统内其它节点的内存)的速度，这也是非一致存储访问NUMA的由来。

现代处理器均采用 NUMA 架构，每个 socket 通过内存控制器连接本地内存（local memory），通过 socket 间的高速总线访问属于其他 socket 的远端内存（remote memory）。我们将直接连接的 CPU core 和内存和其他外设（如网卡、GPU）称为一个 NUMA domain（或 NUMA node），在同一个 domain 中（intra-domain）的访存性能（包括带宽和延迟）通常显著高于跨 NUMA（inter-domain）的性能，这种现象被称为 NUMA 效应。

NUMA 效应

在NUMA中还有三个节点的概念：

本地节点:对于某个节点中的所有CPU，此节点称为本地节点。
邻居节点:与本地节点相邻的节点称为邻居节点。
远端节点:非本地节点或邻居节点的节点，称为远端节点。
邻居节点和远端节点,都称作非本地节点(Off Node)。

CPU访问不同类型节点内存的速度是不相同的，访问本地节点的速度最快，访问远端节点的速度最慢，即访问速度与节点的距离有关，距离越远访问速度越慢，此距离称作Node Distance。应用程序要尽量的减少不通CPU模块之间的交互，如果应用程序能有方法固定在一个CPU模块里，那么应用的性能将会有很大的提升。

之所以要绑核，目的就是为了提高进程的对资源的访问以及运行效率。

当处理器中存在多个 CPU 核心时，操作系统的调度器会将进程在可用的核心间迁移，以试图维持负载均衡。在一个多 NUMA 系统中，如果进程被迁移到了与创建时不同的 NUMA domain，就可能影响性能（Linux 在 NUMA 感知调度和 NUMA 负载均衡上进行了一些努力，但由于多种原因效果并不理想）。此外，进程迁移时必须暂停，在新的核心上也会不可避免地遇到 cache、分支预测器等组件的冷启动开销，产生性能波动。因此，在运行计算密集的程序时，通常需要将进程、线程与 CPU 核心进行绑定（binding / pinning），即控制进程与 CPU 核心的亲和性（affinity），消除上述的各类影响。

如何进行 numactl bind

首先，我们可以通过numactl -H 查看当前 NUMA 系统的拓扑结构。

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
node 0 size: 64079 MB
node 0 free: 61836 MB
node 1 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 1 size: 64484 MB
node 1 free: 63580 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

这表明该系统有两个 NUMA node， Node0 包含的核心 core 是 0 ~ 35 72 ~ 107
Node1 包含的核心 core 是 36 ~ 71 108 ~ 143。

其次可以通过numastat -p 来查询某个具体的进程在 NUMA node 的内存使用情况。

对于 numactl 常用如下三个进行绑核

--physcpubind / -C n：绑定到编号为 n 的 core
--cpunodebind / -N n：绑定到编号为 n 的 socket
--membind / -m n：绑定到编号为 n 的 NUMA domain

在实际情况，对于多卡多进程的推理 AI workload，我们可以把每个推理的进程绑定到一个具体的core，以及对应的 NUMA DOMAIN memory。
每个两个CPU core 负责一个gpu-x 的推理任务，从而最大程度的提高并行执行的效率。
eg：

parallel --lb -d, --tagstring "[{#}]":::
"numactl -C 0~1 -m 0  python main.py -a resnet50 -e -b 1 --pretrained --gpu0 0$DATASETS--int8 1 --benchmark 1 --num-iterations 5000 --dummy"
"numactl -C 2~3 -m 0  python main.py -a resnet50 -e -b 1 --pretrained --gpu1 0$DATASETS--int8 1 --benchmark 1 --num-iterations 5000 --dummy"
"numactl -C 4~5 -m 0  python main.py -a resnet50 -e -b 1 --pretrained --gpu3 0$DATASETS--int8 1 --benchmark 1 --num-iterations 5000 --dummy"
"numactl -C 6~7 -m 0  python main.py -a resnet50 -e -b 1 --pretrained --gpu4 0$DATASETS--int8 1 --benchmark 1 --num-iterations 5000 --dummy"

当然上述 numactl bind 也会存在一点 memory usage 分配的问题。《numactl 内存绑定(membind) 失效》

更多的配置：

--all, -a
              Unset default cpuset awareness, so user can use all
              possible CPUs/nodes for following policy settings.
--interleave=nodes, -i nodes
              Set a memory interleave policy. Memory will be allocated
              using round robin on nodes.  When memory cannot be
              allocated on the current interleave target fall back to
              other nodes.  Multiple nodes may be specified on
              --interleave, --membind and --cpunodebind.
--membind=nodes, -m nodes
              Only allocate memory from nodes.  Allocation will fail
              when there is not enough memory available on these nodes.
              nodes may be specified as noted above.
--cpunodebind=nodes, -N nodes
              Only execute command on the CPUs of nodes.  Note that
              nodes may consist of multiple CPUs.  nodes may be
              specified as noted above.
--physcpubind=cpus, -C cpus
              Only execute process on cpus.  This accepts cpu numbers as
              shown in the processor fields of /proc/cpuinfo, or
              relative cpus as in relative to the current cpuset.  You
              may specify "all", which means all cpus in the current
              cpuset.  Physical cpus may be specified as N,N,N or  N-N
              or N,N-N or  N-N,N-N and so forth.  Relative cpus may be
              specified as +N,N,N or  +N-N or +N,N-N and so forth. The +
              indicates that the cpu numbers are relative to the
              process' set of allowed cpus in its current cpuset.  A !N-
              N notation indicates the inverse of N-N, in other words
              all cpus except N-N.  If used with + notation, specify
              !+N-N.
--localalloc, -l
              Try to allocate on the current node of the process, but if
              memory cannot be allocated there fall back to other nodes.
--preferred=node
              Preferably allocate memory on node, but if memory cannot
              be allocated there fall back to other nodes.  This option
              takes only a single node number.  Relative notation may be
              used.
--balancing, -b
              Enable Linux kernel NUMA balancing for the process if it
              is supported by kernel.  This should only be used with
              --membind, -m only, otherwise ignored.
--preferred-many=node
              Preferably allocate memory on nodes, but if memory cannot
              be allocated there fall back to other nodes.  This option
              takes a mask of preferred nodes where the closest node to
              local is considered most preferred.  Relative notation may
              be used.
--show, -s
              Show NUMA policy settings of the current process.
--hardware, -H
              Show inventory of available nodes on the system.

![image.png](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/a9a2f193f4cb4533af91ee28c331860b~tplv-k3u1fbpfcp-jj-mark:0:0:0:0:q75.jpg>

Reference:

《NUMA 处理器与进程绑定》
《SMP、NUMA体系结构》

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

多核处理器系统中的NUMA架构与绑核技术

为什么需要进行“绑核”

多核处理器系统

NUMA (Non-Uniform Memory Access)

NUMA 效应

如何进行 numactl bind

对于 numactl 常用如下三个进行绑核

果蝇大脑全连接体图谱发布：脑科学研究新进展

使用PyTorch中的深度网络训练自定义数据集教程

GeoSpy.ai

Globe Explorer

即梦Dreamina

Luma Dream Machine

Motionshop

StoryDiffusion

归档

{{userData.name}}已认证

为什么需要进行“绑核”

多核处理器系统

NUMA (Non-Uniform Memory Access)

NUMA 效应

如何进行 numactl bind

对于 numactl 常用如下三个进行绑核

果蝇大脑全连接体图谱发布：脑科学研究新进展

使用PyTorch中的深度网络训练自定义数据集教程

LLama2.c编译成Wasm在WasmEdge上运行的实现方法

解决GPU部署问题及性能优化方法

Transformers 模型部署到生产中的挑战与优化方法

GPU架构与工作原理详解