聊聊cmov

2024-07-30 15:08:02 浏览数 (1)

TL,DR: 短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退

原文 https://kristerw.github.io/2022/05/24/branchless/

加了一些自己的理解, 点击原文可以获得更好的跳转体验

所谓的branch-free编程 或者说 branch-less编程,就是一种优化if分支降低branch miss的手段

比如借助冒号表达式,借助 =等等

而这种优化通常都会优化成cmov

大概十几年前 linus对于cmov不屑一顾 https://yarchive.net/comp/linux/cmov.html,性能表现不行

测试代码在这里

代码语言:javascript复制
#include <stdio.h>

/* How many iterations? */
#define ITERATIONS (100000000)

/* Which bit of the counter to test? */
#define BIT 1

#ifdef CMOV

#define choose(i, a, b) ({   
    unsigned long result;   
    asm("testl %1,%2 ; cmovne %3,%0" 
        :"=r" (result)   
        :"i" (BIT),   
         "g" (i),   
         "rm" (a),   
         "0" (b));   
    result; })

#else

#define choose(i, a, b) ({   
    unsigned long result;   
    asm("testl %1,%2 ; je 1f ; mov %3,%0n1:" 
        :"=r" (result)   
        :"i" (BIT),   
         "g" (i),   
         "g" (a),   
         "0" (b));   
    result; })

#endif

int main(int argc, char **argv)
{
    int i;
    unsigned long sum = 0;

    for (i = 0; i < ITERATIONS; i  ) {
        unsigned long a = 5, b = 7;
        sum  = choose(i, a, b);
    }
    printf("%lun", sum);
    return 0;
}

我测试了一下

代码语言:javascript复制
gcc -Wall -O2 cmov.c

time ./a.out
600000000

real 0m0.540s
user 0m0.065s
sys 0m0.001s

gcc -Wall -O2 -DCMOV cmov.c
time ./a.out

600000000

real 0m0.522s
user 0m0.055s
sys 0m0.000s

显然cmov更快 linus这个结论过时了,毕竟07年。

新机器cmov是快的。这个测试比较简单没有引入cacheline的影响

这里有个测试显然cmov也是快的 https://github.com/marcin-osowski/cmov/

Agner Fog的文档里也说了 https://www.agner.org/optimize/optimizing_assembly.pdf

(for “a = b > c ? d : e;”): As a rule of thumb, we can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation of d or e when the other operand is chosen.

简单说就是依赖性弱cmov快否则jmp快 (你问chatgpt也是这个回答)

那么 __builtin_expect 会生成cmov吗?

答案是不能,但是__builtin_expect_with_probability 可以

godbolt https://godbolt.org/z/Gzr7bEaef

代码语言:javascript复制
#define LIKELY(x) (__builtin_expect(!!(x), 1))
#define VERY_LIKELY(x) (__builtin_expect_with_probability(!!(x), 1, 0.999))

int foo1(int a, int b, int c)
{
    if (a == b)
        a &= c;
    return a;    
}

/*
foo1:
        and     edx, edi
        mov     eax, edi
        cmp     edi, esi
        cmove   eax, edx
        ret
*/

int foo2(int a, int b, int c)
{
    if (LIKELY(a == b))
        a &= c;
    return a;    
}
/*
foo2:
        and     edx, edi
        mov     eax, edi
        cmp     edi, esi
        cmove   eax, edx
        ret
*/
int foo3(int a, int b, int c)
{
    if (VERY_LIKELY(a == b))
        a &= c;
    return a;    
}


/*
foo3:
        mov     eax, edi
        cmp     edi, esi
        jne     .L7
        and     eax, edx
.L7:
        ret
*/

冒号表达式一定能cmov吗?

这个例子来自lemire https://lemire.me/blog/2021/07/14/faster-sorted-array-unions-by-reducing-branches/

godbolt https://godbolt.org/z/ehhfzYqed

代码语言:javascript复制
while ((pos1 < size1) & (pos2 < size2)) {
  v1 = input1[pos1];
  v2 = input2[pos2];
  output_buffer[pos  ] = (v1 <= v2) ? v1 : v2;
  pos1 = (v1 <= v2) ? pos1   1 : pos1;
  pos2 = (v1 >= v2) ? pos2   1 : pos2;
}

这个代码无法预测,没法优化成cmov,一会大于一会小于,怎么假设啊

但是我们可以利用交换 =

代码语言:javascript复制
while ((pos1 < size1) & (pos2 < size2)) {
  v1 = input1[pos1];
  v2 = input2[pos2];
  output_buffer[pos  ] = (v1 <= v2) ? v1 : v2;
  pos1  = (v1 <= v2);
  pos2  = (v1 >= v2);
}

LLVM 后端引入分支避免生成cmov

godbolt https://godbolt.org/z/qxv96ofbM

代码语言:javascript复制

#include <cstddef>
#include <cstdint>
#include <utility>

void downHeap(uint64_t* top, size_t size, size_t pos) {
  size_t parent = pos;
  size_t child;
  while ((child = 2 * parent   2) < size) {
    auto left = child - 1;
    child = top[left] < top[child] ? child : left;  // <<-- Unpredictable, should be CMOV
    if (top[parent] < top[child]) {
      std::swap(top[parent], top[child]);
      parent = child;
    } else {
      return;
    }
  }
  if (--child < size && top[parent] < top[child]) {
    std::swap(top[parent], top[child]);
  }
}

用了cmov 但是性能反倒是更差 为了避免这个bug https://github.com/llvm/llvm-project/issues/39374

引入 __builtin_unpredictable 的出发点是可以屏蔽掉cmov生成,但是只能影响middle-end

后段select会不会优化成cmov不能通过__builtin_unpredictable修正

无分支代码一定会生成cmov吗

不一定,可能会插入分支

比如 本应该生成cmov llvm给加分支了 https://godbolt.org/z/oM5aWeYW4

godbolt https://godbolt.org/z/oM5aWeYW4

代码语言:javascript复制
#include <cstddef>
#include <cstdint>

// Select an element from an array in constant time.
uint64_t constant_time_lookup(const size_t secret_idx,
                              const uint64_t table[8]) {
  uint64_t result = 0;
  for (size_t i = 0; i < 8; i  ) {
    const bool cond = i == secret_idx;
    const uint64_t mask = (-(int64_t)cond);
    result |= table[i] & mask;
  }
  return result;
}

这个例子我不是很理解,貌似是密码学防止攻击,故意不优化成cmov

gcc一个例子 godbolt https://godbolt.org/z/3GqMsY6v3

代码语言:javascript复制
unsigned foo(unsigned a, unsigned b)
{
  unsigned t = ((a > 2) != 0) << 1;
  t |= ((a < 10) != 0) << 2;
  return b >> t;
}

再比如

代码语言:javascript复制
unsigned r = ((a & 0xffff0000) != 0) << 4;

会改成

代码语言:javascript复制
unsigned r;
if (a > 65535)
  r = 16;
else
  r = 0;

结论

  • • cmov不是万能药,短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退
  • • __builtin_expect是优化预测分支的,不是优化无分支的,__builtin_expect_with_probability可以优化无分支
  • • __builtin_unpredictable 想帮忙屏蔽cmov,但没啥用
  • • 编译器可能会给简单的无分支代码加上分支

其他

  • • Speculative问题 对cmov的影响这里就不讨论了,扯太远了 https://llvm.org/docs/SpeculativeLoadHardening.html
  • • https://danluu.com/branch-prediction/ 介绍分支预测的,不错

0 人点赞