C 中文周刊 第89期
资讯
标准委员会动态/ide/编译器信息放在这里
编译器信息最新动态推荐关注hellogcc公众号 本周更新 2022-11-16 第176期
文章
- target_clones is a trap
使用target_clones能帮助生成平台兼容的SIMD代码,具体呢就是生成N份汇编。比如这种。https://godbolt.org/z/of5d6v
但问题在于,某些平台某些libc某些编译器可能不支持/支持程度不够,导致你用了这玩意但是实际上没生效,使用的时候需要注意
- C 性能优化之分支预测
一个perf查性能的思路。当然循环里的if是比较不合时宜的,不利于编译器展开
- For Software Performance, the Way Data is Accessed Matters!
循环访问大有学问,涉及到 循环的优化,上面也说了,循环里的条件判断非常不合理
- Did you know that tuple can be implement just with lambdas?
constexpr auto tuple = [][[nodiscard]](auto... args) {
return [=][[nodiscard]](auto fn) { return fn(args...); };
};
constexpr auto apply(auto fn, auto t) { return t(fn); };
static_assert(0 == apply([](auto... args) { return sizeof...(args); }, tuple()));
static_assert(1 == apply([](auto... args) { return sizeof...(args); }, tuple(1)));
static_assert(2 == apply([](auto... args) { return sizeof...(args); }, tuple(1, 2)));
namespace detail {
template <std::size_t N, typename T> struct elem_by_index { T &ref; };
template <typename T> struct elem_by_type { T &ref; };
} // namespace detail
template <auto N> [[nodiscard]] constexpr auto get(auto t) {
return t([]<typename... Ts>(Ts... elems) {
return [&]<std::size_t... Is>(std::index_sequence<Is...>) {
struct all_elems : detail::elem_by_index<Is, Ts>... {};
return []<typename U>(const detail::elem_by_index<N, U> &elem) {
return elem.ref;
}(all_elems{elems...});
}(std::index_sequence_for<Ts...>{});
});
}
template <class T> [[nodiscard]] constexpr auto get(auto t) {
return t([]<typename... Ts>(Ts... elems) {
struct all_elems : detail::elem_by_type<Ts>... {};
return [](const detail::elem_by_type<T> &elem) {
return elem.ref;
}(all_elems{elems...});
});
};
看不懂
- Modern vector programming with masked loads and stores
改写这个
代码语言:javascript复制int f() {
int* data = new int[2];
data[0] = 1;
data[1] = 2;
int x = data[0];
int y = data[1];
int z = data[2];
delete[] data;
return x y;
}
用avx512 第一版
代码语言:javascript复制float dot512fma(float *x1, float *x2, size_t length) {
// create a vector of 16 32-bit floats (zeroed)
__m512 sum = _mm512_setzero_ps();
for (size_t i = 0; i < length; i = 16) {
// load 16 32-bit floats
__m512 v1 = _mm512_loadu_ps(x1 i);
// load 16 32-bit floats
__m512 v2 = _mm512_loadu_ps(x2 i);
// do sum[0] = v1[i]*v2[i] (fused multiply-add)
sum = _mm512_fmadd_ps(v1, v2, sum);
}
// reduce: sums all elements
return _mm512_reduce_add_ps(sum);
}
问题在于 i<length
越界,如何解决 ?这就用标题讲的 mask load/store
float dot512fma(float *x1, float *x2, size_t length) {
// create a vector of 16 32-bit floats (zeroed)
__m512 sum = _mm512_setzero_ps();
size_t i = 0;
for (; i 16 <= length; i =16) {
// load 16 32-bit floats
__m512 v1 = _mm512_loadu_ps(x1 i);
// load 16 32-bit floats
__m512 v2 = _mm512_loadu_ps(x2 i);
// do sum[0] = v1[i]*v2[i] (fused multiply-add)
sum = _mm512_fmadd_ps(v1, v2, sum);
}
if (i < length) {
// load 16 32-bit floats, load only the first length-i floats
// other floats are automatically set to zero
__m512 v1 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x1 i);
// load 16 32-bit floats, load only the first length-i floats
__m512 v2 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x2 i);
// do sum[0] = v1[i]*v2[i] (fused multiply-add)
sum = _mm512_fmadd_ps(v1, v2, sum);
}
// reduce: sums all elements
return _mm512_reduce_add_ps(sum);
}
arm平台怎么做?
代码语言:javascript复制float dotsve(float *x1, float *x2, int64_t length) {
int64_t i = 0;
svfloat32_t sum = svdup_n_f32(0);
while(i svcntw() <= length) {
svfloat32_t in1 = svld1_f32(svptrue_b32(), x1 i);
svfloat32_t in2 = svld1_f32(svptrue_b32(), x2 i);
sum = svmad_f32_m(svptrue_b32(), in1, in2, sum);
i = svcntw();
}
svbool_t while_mask = svwhilelt_b32(i, length);
do {
svfloat32_t in1 = svld1_f32(while_mask, x1 i);
svfloat32_t in2 = svld1_f32(while_mask, x2 i);
sum = svmad_f32_m(svptrue_b32(), in1, in2, sum);
i = svcntw();
while_mask = svwhilelt_b32(i, length);
} while (svptest_any(svptrue_b32(), while_mask));
return svaddv_f32(svptrue_b32(),sum);
}
代码在这里,可以简单玩一下 https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2022/11/08
- Using final in C to improve performance
鼓励多用final。这个是常识了
- Writing a Compiler - Part 1 - Defining The Language 手把手教你写编译器,代码在这里godbolt
- Improving my C time queue
代码重构。没啥说的
- ODR violation detection
介绍 ODR相关的检测,很长
- C constexpr parlor tricks: How can I obtain the length of a string at compile time?
编译期的字符串,怎么编译期查长度?strlen不是constexpr,不行
代码语言:javascript复制#include <string>
constexpr std::size_t constexpr_strlen(const char* s)
{
return std::char_traits<char>::length(s);
// return std::string::traits_type::length(s);
}
constexpr std::size_t constexpr_wcslen(const wchar_t* s)
{
return std::char_traits<wchar_t>::length(s);
// return std::wstring::traits_type::length(s);
}
- C 20 Coroutines and io_uring - Part 1/3 手把手教你封装IOUring
- Exploring Clangs Enum implementation and How we Catch Undefined Behavior enum是有范围的,所以 ```cpp enum E1 {e1=0}; // Range of values [0,1]
void f() { E1 x = static_cast(2); // undefined behavior, 2 is outside the range of values }
``` 这种是UB,UBSan能抓到,也可以用 -fstrict-enums
来抓
- PL Pragmatics #1
介绍了一些语言相关的优化/论文/点子,比如GC/JIT之类的,jdk/python相关的进展,感兴趣的可以看看
视频
这周没看。有啥推荐的也可以发一下。CPPCON 2022新出了俩协程教程,没看
- Understanding C Coroutines by Example: Generators (Part 2 of 2) - Pavel Novikov - CppCon 2022
开源项目需要人手
- asteria 一个脚本语言,可嵌入,长期找人,希望胖友们帮帮忙,也可以加群384042845和作者对线
- pika 一个nosql 存储, redis over rocksdb,非常需要人贡献代码胖友们, 感兴趣的欢迎加群294254078前来对线
新项目介绍/版本更新
- cparse LR(1) parser generator