C++ 中文周刊第134期

RSS https://github.com/wanghenshui/cppweeklynews/releases.atom

欢迎投稿，推荐或自荐文章/软件/资源等

请提交 issue[2]

感谢不语赞助

和群友讨论的指针乱分类

指针定义九宫格	定义纯粹派必须是指针	定义中立派有指针的能力	定义自由派没有能力也行
形式纯粹派必须有*	void *	operator*()	"" 是char当然是指针
形式中立派得有指向含义	智能指针	引用也是指针	fd/handle也是指针当然数组也是指针
形式自由派有指针就行	表针也是指针指南针更是指针鼠标也是指针	针灸也是指针东方不败自然也是指针手指也是指针	广告也是指针地址通讯录也是指针酒店小卡片也是指针

资讯

标准委员会动态/ide/编译器信息放在这里

编译器信息最新动态推荐关注hellogcc公众号 OSDT Weekly 2023-10-11 第223期

brpc rpcz功能存在xss跨站漏洞，建议尽快升级 1.6.1

如果无法升级，可以打补丁 https://github.com/apache/brpc/pull/2411/files

文章

GCC Preparing To Introduce "-fhardened" Security Hardening Option[3]

Mick235711 投稿

gcc 14.1的新选项-fhardened会自动开启大部分相对轻量级的安全检查，包括3级保护（常见C库函数，比如strcpy等等的内存溢出诊断），标准库断言（这里面包括std::span和其他容器的operator[]边界检查），栈溢出检查等等

How to compare signed and unsigned integers in C 20?[4]

介绍 std::cmp_xx的之前也介绍过

使用 jegdb 来调试内存相关 crash

通过jemalloc meta信息反查bug，有点东西

flat_map性能调研[5]

了解个大概

C 26静态反射提案解析[6]

看一乐

视频

接着上期的内容

cppcon 2022

• A-Faster-Serialization-Library-Based-on-Compile-time-Reflection-and-C-20-Yu-Qi-CppCon-2022

介绍qimosmos的strucpack的。挺有意思。资料也很多，这里就不罗列了，感兴趣的可以搜一下，标记个TODO，咱们以后有时间单独讲一下

• binary-search-cppcon

优化二分

二分谁都知道吧

代码语言：javascript复制

int lower_bound(int x) {
    int l = 0, r = n - 1;
    while (l < r) {
        int m = (l   r) / 2;
        if (t[m] >= x)
            r = m;
        else
            l = m   1;
    }
    return t[l];
}

代码语言：javascript复制

不会写的深蹲十个

CPU执行基本流程还记得吧 fetch decode execute write。然后CPU有流水线pipeline优化，整个流程15-20cycle

流水线优化能并发上面的步骤，什么能阻碍流水线优化？

• structural hazard 太多指令用相同的CPU地址，这个是fetch decode机器码环节，无解
• data hazard 需要等待前面的数据，这个其实是软件问题，也没啥好办法
• control hazard CPU不知道下一次该执行什么指令，存在依赖分支miss的场景，这个是可以挽救修正的

我们回头再看二分的代码

while循环还好说，里面的if是非常重的

怎么改成无分支版本？换一种思路，我们不挪动index，我们挪动数组地址

代码语言：javascript复制

int lower_bound(int x) {
  int* base = t ,len = n;
  while(len> 1) {
    int half = len / 2;
    if (base[half - 1] < x) {
      base  = half;
      len = len - half; // = ceil(len / 2)
    } else {
      len = half; // = floor(len / 2)
    }
  }
  return *base;
}

代码语言：javascript复制

注意到重复代码，这样能把else去掉

代码语言：javascript复制

int lower_bound(int x) {
  int* base = t ,len = n;
  while(len> 1) {
    int half = len / 2;
    if (base[half - 1] < x) {
      base  = half;
    }
    len -= half; //  = ceil(len / 2)
  }
  return *base;
}

代码语言：javascript复制

显然，这个if也能优化掉

代码语言：javascript复制

  while(len> 1) {
    int half = len / 2;
    base  = (base[half - 1] < x) * half; // will be replaced with a "cmov"
    len -= half; //  = ceil(len / 2)
  }

代码语言：javascript复制

改成无分支版本，性能直接提升一大截，但是，对于大数组，性能是下降的，怎么办？prefetch

代码语言：javascript复制

  while(len > 1) {
    int half = len / 2;
    len -= half;
    __builtin_prefetch(&base[len / 2 - 1]);
    // middle of the left half
    __builtin_prefetch(&base[half   len / 2 - 1]); // middle of the right half
    base  = (base[half - 1] < x) * half;
  }

代码语言：javascript复制

接下来要上强度了朋友们

prefetch实际上也解决不了特大数组的问题，因为二分，一开始的块必然很大，你怎么prefetch也白搭

我们需要从另一种角度解决问题，比如二叉树堆线段树的特性

利用树的特点，以及树的局部性友好，对于二分开头有明显的加速效果

二叉树的的特点就决定了，肯定不需要手写分支

那怎么构造堆呢

代码语言：javascript复制

int a[n];
alignas(64) int t[n   1]; //the original sorted array and the eytzinger array we build
//^ we need one element more because of one-based indexing
void eytzinger(int k = 1) {
  static int i = 0;
  if (k <= n) {
    eytzinger(2 * k);
    t[k] = a[i  ];
    eytzinger(2 * k   1);
  }
}
int lower_bound(int x) {
  int k = 1;
  while (k <= n) {
    __builtin_prefetch(&t[k * 16]);
    k = 2 * k   (t[k]< x);
  }
  k >>= __builtin_ffs(~k);
  return t[k];
}

代码语言：javascript复制

性能好起来了，但感觉有优化空间

• prefetch感觉有点多
• 带宽bandwidth换延迟，如果内存带宽没这么富裕怎么办

考虑b树，深度更低，局部性更好，跳转更少, 降低带宽

如何构造

代码语言：javascript复制

const int B = 16, nblocks = (n   B - 1) / B;
int btree[nblocks][B];
int go(int k, int i) { return k * (B   1)   i   1; }
void build(int k = 0) {
  static int t = 0;
  while (k < nblocks) {
    for (int i = 0; i < B; i  ) {
      build(go(k, i));
      btree[k][i] = (t < n ? a[t  ] : INT_MAX);
    }
    build(go(k, B));
  }
}

代码语言：javascript复制

如何找节点的二分？

代码语言：javascript复制

// compute the "local" lower bound in a node
int rank(int x, int *node) {
  for (int i = 0; i < B; i  )
    if (node[i] >= x)
      return i;
  return B;
}

代码语言：javascript复制

优化if

代码语言：javascript复制

int rank(int x, int *node) {
  int mask = (1 << B);
  for (int i = 0; i < B; i  )
    mask |= (btree[k][i] >= x) << i;
  return __builtin_ffs(mask) - 1;
}

代码语言：javascript复制

优化for循环,SIMD

代码语言：javascript复制

typedef __m256i reg;
// compute a 8-bit mask corresponding to "<" elements
int cmp(reg x_vec, int* y_ptr) {
  reg y_vec = _mm256_load_si256((reg*) y_ptr); // load 8 sorted elements
  reg mask = _mm256_cmpgt_epi32(x_vec, y_vec); // compare against the key
  return _mm256_movemask_ps((__m256)mask); // extract the 8-bit mask
}
int rank(reg x_vec, int *node) {
  int mask = ~(
    cmp(x, node)  
    (cmp(x, node   8) << 8)
  );
  return __builtin_ffs(mask) - 1; // alternative: popcount
}

代码语言：javascript复制

最终代码

代码语言：javascript复制

int lower_bound(int _x) {
  int k = 0, res = INT_MAX;
  reg x = _mm256_set1_epi32(_x);
  while (k < nblocks) {
    int i = rank(x,btree[k]);
    if (i < B)// a local lower bound may not exist in the leaf node
      res = btree[k][i];
    k = go(k, i) ;
  }
  return res;
}

代码语言：javascript复制

这个if很难受，怎么优化？

考虑b 树，说实话我已经汗流浃背了。就不考虑了

作者还探索了其他树，优化更彻底

代码在这 https://github.com/sslotin/amh-code/blob/main/binsearch

文章在这里 https://en.algorithmica.org/hpc/data-structures/binary-search/

他的博客咱们推荐过很多次。写的很好，就是太深了得研究半天，这里标记个TODO，后面再看

见识到思路其实是很巧妙的，换种角度考虑问题

• Fast-High-Quality-Pseudo-Random-Numbers-CPPCon2022-Roth-Michaels

简单来说，PCG32 Xoshiro128比标准库的rand以及mt19937快得多

• HPX-A-C-Standard-Library-for-Parallelism-and-Concurrency-CppCon-2022-1

介绍HPX的，基本每年都介绍

介绍c 20线程相关的组件,jthread就不说了

stop resource

代码语言：javascript复制

void stoppable_func(std::stop_token st){
    while(!st.stop_requested()){
        do_stuff();
    }
}
void stopper(std::stop_source source){
    while(!done()){
        do_something();
    }
    source.request_stop();
}

代码语言：javascript复制

也可以定制

代码语言：javascript复制

Data read_file(std::stop_token st, std::filesystem::path filename ){
    auto handle=open_file(filename);
    std::stop_callback cb(st,[&]{ cancel_io(handle);});
    return read_data(handle); // blocking
}

代码语言：javascript复制

latch

代码语言：javascript复制

void foo(){
    unsigned const thread_count=...;
    std::latch done(thread_count);
    std::vector<std::optional<my_data>> data(thread_count);
    std::vector<std::jthread> threads;
    for(unsigned i=0;i<thread_count;  i)
        threads.push_back(std::jthread([&,i]{
            data[i]=make_data(i);
            done.count_down();
            do_more_stuff();
        }));
    done.wait();
    process_data(data);
}

代码语言：javascript复制

barrier，感觉就是latch加上callback了

代码语言：javascript复制

unsigned const num_threads=...;
void finish_task();
std::barrier<std::function<void()>> b(num_threads,finish_task);
void worker_thread(std::stop_token st,unsigned i){
    while(!st.stop_requested()){
        do_stuff(i);
        b.arrive_and_wait();
    }
}

代码语言：javascript复制

mutex 一种死锁场景

代码语言：javascript复制

class account {
std::mutex m;
currency_value balance;
 public:
  friend void transfer(account& from,account& to, currency_value amount) {
    std::scoped_lock lock_from(from.m);
    std::scoped_lock lock_to(to.m);
    from.balance -= amount;
    to.balance  = amount;
    }
};

代码语言：javascript复制

相信各位也看出来什么场景会死锁（同时发生互相转账）

c 20之后 scoped_lock可以同时锁多个锁

代码语言：javascript复制

friend void transfer(account& from,account& to, currency_value amount)
{
    std::scoped_lock locks(from.m,to.m);
    from.balance -= amount;
    to.balance  = amount;
}

代码语言：javascript复制

间接规避了死锁的问题其实相当于两把锁合成一个来锁。

相当于要么同时锁上，要么等待。避免了两个上锁之间的间隔，也就避免了循环死锁问题。增加点耗时就是了，反正不出错

还有一些别的。没啥新鲜的东西，就不说了

• Managing APIs in Enterprise Systems

这个是通过visit来合并不同API的

场景是两个不同的Response，通过一个接口处理

• Optimization-Remarks

Rpass llvm-opt-report opt-viewer 三板斧

opt viewer godbolt上也集成了 https://godbolt.org/z/jG5jq7c9a

作者写了个optview2

如何看懂optview告警

Symptom	Probable cause	Action
Inlining Failure		Add header / forceinline /increase threshold
Clobbered by store	Aliasing	restrict / force type diff
Clobbered by load	Escape	Attributes pure / const /noescape (typically before the remark site)
Failed to move load loop invariant	Escape	All the above copy to local
其他场景	看不懂	最小代码段扔进godbolt再看

• The-Surprising-Complexity-of-Formatting-Ranges

介绍 fmt 占位符解析实现的。很长

• Type-Erasure-The-Implementation-Details-Klaus-Iglberger-CppCon-2022

介绍type erasure技术（function，any），这个技术大家都知道，还介绍了一些优化，比如SBO

所谓SBO就是给对象加一个数组buffer，当对象足够小，就用buffer placement new，避免系统new

代码大概这样

代码语言：javascript复制

static constexpr size_t buffersize = 128UL;
static constexpr size_t alignment = 16UL;
alignas(alignment) std::array<std::byte,buffersize> buffer;
template< typename ShapeT >
Shape( ShapeT const& x ) {
    using M = Model<ShapeT>;
    static_assert( sizeof(M) <= buffersize, "Given type is too large" );
    static_assert( alignof(M) <= alignment, "Given type is overaligned" );
    ::new (pimpl()) M( shape );
}

代码语言：javascript复制

还有就是手工绑定 manual virtual dispatch MVD

去掉虚函数，定义好的接口直接绑定函数指针, 得绑定的很深，不灵活。用虚函数就是为了灵活

代码语言：javascript复制

class ShapeConstRef{
 public:
  template< typename ShapeT >
  ShapeConstRef( ShapeT const& shape )
  : shape_{ std::addressof(shape) },
    draw_{ []( void const* shape ){
            draw( *static_cast<ShapeT const*>(shape) );} 
        }
    {}
 private:
  friend void draw( ShapeConstRef const& shape ) {
      shape.draw_( shape.shape_ );
  }
  using DrawOperation = void(void const*);
  void const* shape_{ nullptr };
  DrawOperation* draw_{ nullptr };
};

class Circle { /*...*/ };
class Square { /*...*/ };
void draw( ShapeConstRef shape ) {
    /* Drawing the given shape */
}

int main(){
    Circle circle( 2.3 );
    Square square( 1.2 );
    draw( circle );
    draw( square );
    // ...
}

代码语言：javascript复制

当然从性能上考虑MVD是最快的

• Using-Modern-C-to-Eliminate-Virtual-Functions-Jonathan-Gopel-CppCon-2022

给了另一种多态方案，类似variant但是还不太一样，也没有virtual，我直接把两种代码贴出来，挺有意思，但是得预设写死

用virtual的版本

代码语言：javascript复制

struct FooInterface {
  [[nodiscard]] virtual auto func() const -> int = 0;
  FooInterface() = default;
  FooInterface(const FooInterface&) = default;
  FooInterface(FooInterface&&) = default;
  FooInterface& operator=(const FooInterface&) = default;
  FooInterface& operator=(FooInterface&&) = default;
  virtual ~FooInterface() = default;
};


class Baz {
public:
  auto store(std::unique_ptr<FooInterface> value) -> void {
    data.push_back(std::move(value));
  }
private:
  std::vector<std::unique_ptr<FooInterface>> data{};
};

代码语言：javascript复制

不用virtual的版本

代码语言：javascript复制

template <typename T>
concept CFoo = requires(T foo) {
{ foo.func() } -> std::intergral;
};

template <typename T, typename... Ts>
concept same_as_any = (... or std::same_as<T, Ts>);

template <CFoo... TFoos>
class Baz {
public:
template <same_as_any<TFoos...> T>
  auto store(T value) {
    return std::get<std::vector<T>>(data).push_back(value);
  }
private:
  std::tuple<std::vector<TFoos>...> data{};
};

代码语言：javascript复制

使用效果

代码语言：javascript复制

// with virtual
Baz baz{};
baz.store(std::make_unique<Foo1>());
baz.store(std::make_unique<Foo2>());
// without virtual
Baz<Foo1, Foo2> baz{};
baz.store(Foo1{});
baz.store(Foo2{})

代码语言：javascript复制

concept相当于做了绑定的工作，其实和上一个提到的MVD差不多，问题在于concept得写死，不够动态，如果能预设好的话concept非常干净

• Your-Compiler-Understands-It-But-Does-Anyone-Else

介绍了一些重构经验

多用excepted，多用lazy iteration，https://github.com/andreasbuhr/cppcoro

测试使用注入 https://github.com/ybainier/Hypodermic 说实话没看懂怎么用

• vittorio_romeo_pragmatic_simplicity

介绍了几个代码设计上的问题

比如noexcept怎么用，如果成员函数都加上noexcept合适吗？

auto怎么用，循环里的auto能随便用吗？

emplace_back一定是好的吗？

scope_lock如果参数为空有啥影响？

irange和手写循环哪个更好？

都是案例，这里直接留作讨论题了，欢迎读者思考

作者最后不忘卖书 Embracing Modern C Safely

这个国内我猜会引进翻译吧，pdf zlib可以搜到这里就不传播了

• What I Learned From Sockets Applying the Unix Readiness Model When Composing Concurrent Operations in C

这个讲网络库封装成sender，还算有点意思

• Taking_a_Byte_Out_of_C

这个是讲二进制问题的，start_lifitime_as接口。可能我理解的不够透彻，后面再看看

开源项目需要人手

• asteria[7] 一个脚本语言，可嵌入，长期找人，希望胖友们帮帮忙，也可以加群384042845和作者对线
• Unilang[8] deepin的一个通用编程语言，点子有点意思，也缺人，感兴趣的可以github讨论区或者deepin论坛看一看。这里也挂着长期推荐了
• gcc-mcf[9] 懂的都懂
• https://gitee.com/okstar-org/ok-edu-desktop 一个IM通信软件，做IM的可以关注，现在正在做全面整合阶段，开始组建商业团队阶段，年底开始融资，你参加了就快发财了，会的快来

互动环节

上次收集了一些面试题，加上群里讨论的面试题，这里放到最后，感兴趣的可以思考一下，浪费一下大家的时间

• blacktail投稿空class是几个字节？

我猜问提这个问题的面试官可能看了深度探索c 对象模型，可能不知道compressed_pair和no_unique_address

• yangyang投稿 leetcode979 在二叉树中分配硬币

说实话看题目反应半天

引用链接

[1] 手机qq点击进入: https://qm.qq.com/q/6NGizNPyG4 [2] 提交 issue: https://github.com/wanghenshui/cppweeklynews/issues [3] GCC Preparing To Introduce "-fhardened" Security Hardening Option: https://www.phoronix.com/news/GCC-fhardened-Hardening-Option [4] How to compare signed and unsigned integers in C 20?: https://www.sandordargo.com/blog/2023/10/11/cpp20-intcmp-utilities [5] flat_map性能调研: https://zhuanlan.zhihu.com/p/661418250?utm_psn=1698007939397492736 [6] C 26静态反射提案解析: https://zhuanlan.zhihu.com/p/661692275?utm_psn=1698007742550265856 [7] asteria: https://github.com/lhmouse/asteria [8] Unilang: https://github.com/linuxdeepin/unilang [9] gcc-mcf: https://gcc-mcf.lhmouse.com/

int 性能优化指针 c++

0 人点赞

C++ 中文周刊 第134期

资讯

文章

GCC Preparing To Introduce "-fhardened" Security Hardening Option[3]

How to compare signed and unsigned integers in C 20?[4]

使用 jegdb 来调试内存相关 crash

flat_map性能调研[5]

C 26静态反射提案解析[6]

视频

cppcon 2022

开源项目需要人手

互动环节

引用链接

C++ 中文周刊第134期