TL,DR: 短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退
原文 https://kristerw.github.io/2022/05/24/branchless/
加了一些自己的理解, 点击原文可以获得更好的跳转体验
所谓的branch-free编程 或者说 branch-less编程,就是一种优化if分支降低branch miss的手段
比如借助冒号表达式,借助 =等等
而这种优化通常都会优化成cmov
大概十几年前 linus对于cmov不屑一顾 https://yarchive.net/comp/linux/cmov.html,性能表现不行
测试代码在这里
代码语言:javascript复制#include <stdio.h>
/* How many iterations? */
#define ITERATIONS (100000000)
/* Which bit of the counter to test? */
#define BIT 1
#ifdef CMOV
#define choose(i, a, b) ({
unsigned long result;
asm("testl %1,%2 ; cmovne %3,%0"
:"=r" (result)
:"i" (BIT),
"g" (i),
"rm" (a),
"0" (b));
result; })
#else
#define choose(i, a, b) ({
unsigned long result;
asm("testl %1,%2 ; je 1f ; mov %3,%0n1:"
:"=r" (result)
:"i" (BIT),
"g" (i),
"g" (a),
"0" (b));
result; })
#endif
int main(int argc, char **argv)
{
int i;
unsigned long sum = 0;
for (i = 0; i < ITERATIONS; i ) {
unsigned long a = 5, b = 7;
sum = choose(i, a, b);
}
printf("%lun", sum);
return 0;
}
我测试了一下
代码语言:javascript复制gcc -Wall -O2 cmov.c
time ./a.out
600000000
real 0m0.540s
user 0m0.065s
sys 0m0.001s
gcc -Wall -O2 -DCMOV cmov.c
time ./a.out
600000000
real 0m0.522s
user 0m0.055s
sys 0m0.000s
显然cmov更快 linus这个结论过时了,毕竟07年。
新机器cmov是快的。这个测试比较简单没有引入cacheline的影响
这里有个测试显然cmov也是快的 https://github.com/marcin-osowski/cmov/
Agner Fog的文档里也说了 https://www.agner.org/optimize/optimizing_assembly.pdf
(for “a = b > c ? d : e;”): As a rule of thumb, we can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation of d or e when the other operand is chosen.
简单说就是依赖性弱cmov快否则jmp快 (你问chatgpt也是这个回答)
那么 __builtin_expect 会生成cmov吗?
答案是不能,但是__builtin_expect_with_probability 可以
godbolt https://godbolt.org/z/Gzr7bEaef
代码语言:javascript复制#define LIKELY(x) (__builtin_expect(!!(x), 1))
#define VERY_LIKELY(x) (__builtin_expect_with_probability(!!(x), 1, 0.999))
int foo1(int a, int b, int c)
{
if (a == b)
a &= c;
return a;
}
/*
foo1:
and edx, edi
mov eax, edi
cmp edi, esi
cmove eax, edx
ret
*/
int foo2(int a, int b, int c)
{
if (LIKELY(a == b))
a &= c;
return a;
}
/*
foo2:
and edx, edi
mov eax, edi
cmp edi, esi
cmove eax, edx
ret
*/
int foo3(int a, int b, int c)
{
if (VERY_LIKELY(a == b))
a &= c;
return a;
}
/*
foo3:
mov eax, edi
cmp edi, esi
jne .L7
and eax, edx
.L7:
ret
*/
冒号表达式一定能cmov吗?
这个例子来自lemire https://lemire.me/blog/2021/07/14/faster-sorted-array-unions-by-reducing-branches/
godbolt https://godbolt.org/z/ehhfzYqed
代码语言:javascript复制while ((pos1 < size1) & (pos2 < size2)) {
v1 = input1[pos1];
v2 = input2[pos2];
output_buffer[pos ] = (v1 <= v2) ? v1 : v2;
pos1 = (v1 <= v2) ? pos1 1 : pos1;
pos2 = (v1 >= v2) ? pos2 1 : pos2;
}
这个代码无法预测,没法优化成cmov,一会大于一会小于,怎么假设啊
但是我们可以利用交换 =
代码语言:javascript复制while ((pos1 < size1) & (pos2 < size2)) {
v1 = input1[pos1];
v2 = input2[pos2];
output_buffer[pos ] = (v1 <= v2) ? v1 : v2;
pos1 = (v1 <= v2);
pos2 = (v1 >= v2);
}
LLVM 后端引入分支避免生成cmov
godbolt https://godbolt.org/z/qxv96ofbM
代码语言:javascript复制
#include <cstddef>
#include <cstdint>
#include <utility>
void downHeap(uint64_t* top, size_t size, size_t pos) {
size_t parent = pos;
size_t child;
while ((child = 2 * parent 2) < size) {
auto left = child - 1;
child = top[left] < top[child] ? child : left; // <<-- Unpredictable, should be CMOV
if (top[parent] < top[child]) {
std::swap(top[parent], top[child]);
parent = child;
} else {
return;
}
}
if (--child < size && top[parent] < top[child]) {
std::swap(top[parent], top[child]);
}
}
用了cmov 但是性能反倒是更差 为了避免这个bug https://github.com/llvm/llvm-project/issues/39374
引入 __builtin_unpredictable 的出发点是可以屏蔽掉cmov生成,但是只能影响middle-end
后段select会不会优化成cmov不能通过__builtin_unpredictable修正
无分支代码一定会生成cmov吗
不一定,可能会插入分支
比如 本应该生成cmov llvm给加分支了 https://godbolt.org/z/oM5aWeYW4
godbolt https://godbolt.org/z/oM5aWeYW4
代码语言:javascript复制#include <cstddef>
#include <cstdint>
// Select an element from an array in constant time.
uint64_t constant_time_lookup(const size_t secret_idx,
const uint64_t table[8]) {
uint64_t result = 0;
for (size_t i = 0; i < 8; i ) {
const bool cond = i == secret_idx;
const uint64_t mask = (-(int64_t)cond);
result |= table[i] & mask;
}
return result;
}
这个例子我不是很理解,貌似是密码学防止攻击,故意不优化成cmov
gcc一个例子 godbolt https://godbolt.org/z/3GqMsY6v3
代码语言:javascript复制unsigned foo(unsigned a, unsigned b)
{
unsigned t = ((a > 2) != 0) << 1;
t |= ((a < 10) != 0) << 2;
return b >> t;
}
再比如
代码语言:javascript复制unsigned r = ((a & 0xffff0000) != 0) << 4;
会改成
代码语言:javascript复制unsigned r;
if (a > 65535)
r = 16;
else
r = 0;
结论
- • cmov不是万能药,短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退
- • __builtin_expect是优化预测分支的,不是优化无分支的,__builtin_expect_with_probability可以优化无分支
- • __builtin_unpredictable 想帮忙屏蔽cmov,但没啥用
- • 编译器可能会给简单的无分支代码加上分支
其他
- • Speculative问题 对cmov的影响这里就不讨论了,扯太远了 https://llvm.org/docs/SpeculativeLoadHardening.html
- • https://danluu.com/branch-prediction/ 介绍分支预测的,不错