前言
近期发现项目组使用新版本的 opentelemetry-cpp 的时候偶现崩溃。崩溃的位置在STL的 std::future
析构的地方,而这个 std::future
由 std::async
创建。 比较违反直觉,这里记录分享一下分析和解决过程方面其他碰到的小伙伴们。
问题分析
相关代码和规范
首先我们来看下相关代码:
代码语言:javascript复制bool PeriodicExportingMetricReader::CollectAndExportOnce()
{
std::atomic<bool> cancel_export_for_timeout{false};
auto future_receive = std::async(std::launch::async, [this, &cancel_export_for_timeout] {
Collect([this, &cancel_export_for_timeout](ResourceMetrics &metric_data) {
if (cancel_export_for_timeout)
{
OTEL_INTERNAL_LOG_ERROR(
"[Periodic Exporting Metric Reader] Collect took longer configured time: "
<< export_timeout_millis_.count() << " ms, and timed out");
return false;
}
this->exporter_->Export(metric_data);
return true;
});
});
std::future_status status;
do
{
status = future_receive.wait_for(std::chrono::milliseconds(export_timeout_millis_));
if (status == std::future_status::timeout)
{
cancel_export_for_timeout = true;
break;
}
} while (status != std::future_status::ready);
bool notify_force_flush = is_force_flush_pending_.exchange(false, std::memory_order_acq_rel);
if (notify_force_flush)
{
is_force_flush_notified_.store(true, std::memory_order_release);
force_flush_cv_.notify_one();
}
return true;
}
这里的逻辑简单得说就是在收集上报指标的时候,在一个子线程执行导出。如果超时了就标记timeout中断上报流程。 按照 https://en.cppreference.com/w/cpp/thread/async 和 https://en.cppreference.com/w/cpp/thread/future/~future 的对标准的描述。
Async invocation If the async flag is set (i.e. (policy & std::launch::async) != 0), then
std::async
callsINVOKE(decay-copy(std::forward<F>(f)), decay-copy(std::forward<Args>(args))...)
as if in a new thread of execution represented by a std::thread object. (until C 23)std::async
callsstd::invoke(auto(std::forward<F>(f)), auto(std::forward<Args>(args))...)
as if in a new thread of execution represented by a std::thread object. (since C 23) The calls of decay-copy are evaluated(until C 23)The values produced by auto are materialized(since C 23) in the current thread. If the function f returns a value or throws an exception, it is stored in the shared state accessible through the std::future that std::async returns to the caller.
还有
- these actions will not block for the shared state to become ready, except that they may block if all of the following are true:
- the shared state was created by a call to std::async,
- the shared state is not yet ready, and
- the current object was the last reference to the shared state.
std::launch::async
模式的调用将在另一个线程中执行。且在 future
(设立没有把state share出去所以这里的满足 the current object was the last reference to the shared state.
)析构的时候会阻塞等到 future
ready。
由于栈上后构造的对象先释放,所以这里lambda里引用了栈上变量也不会有什么问题。但是这里Crash了,那么我们来看看崩溃栈。
崩溃栈
代码语言:javascript复制(gdb) bt
#0 0x00007f75ddb24387 in raise () from /lib64/libc.so.6
#1 0x00007f75ddb25a78 in abort () from /lib64/libc.so.6
#2 0x00007f75de21ea95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc .so.6
#3 0x00007f75de21ca06 in ?? () from /lib64/libstdc .so.6
#4 0x00007f75de21b9b9 in ?? () from /lib64/libstdc .so.6
#5 0x00007f75de21c624 in __gxx_personality_v0 () from /lib64/libstdc .so.6
#6 0x00007f75e9a928e3 in ?? () from /lib64/libgcc_s.so.1
#7 0x00007f75e9a92c7b in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8 0x00007f75de21cc46 in __cxa_throw () from /lib64/libstdc .so.6
#9 0x00007f75de271f30 in std::__throw_system_error(int) () from /lib64/libstdc .so.6
#10 0x00007f75de2730f8 in std::thread::join() () from /lib64/libstdc .so.6
#11 0x00007f75df2e420b in __pthread_once_slow () from /lib64/libpthread.so.0
#12 0x00007f75de21abd7 in ?? () from /lib64/libstdc .so.6
#13 0x00007f75de21acc4 in std::__future_base::_Async_state_common::~_Async_state_common() () from /lib64/libstdc .so.6
#14 0x00007f75da6487f8 in std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>::_Async_state_impl (__fn=...,
this=0x7f75cca00018) at /usr/include/c /4.8.2/future:1495
#15 __gnu_cxx::new_allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >::construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (this=<optimized out>, __p=0x7f75cca00018) at /usr/include/c /4.8.2/ext/new_allocator.h:120
#16 std::allocator_traits<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> > >::_S_construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __p=0x7f75cca00018) at /usr/include/c /4.8.2/bits/alloc_traits.h:254
#17 std::allocator_traits<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> > >::construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __p=0x7f75cca00018) at /usr/include/c /4.8.2/bits/alloc_traits.h:393
#18 std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>::_Sp_counted_ptr_inplace<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=0x7f75cca00000) at /usr/include/c /4.8.2/bits/shared_ptr_base.h:399
#19 __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> >::construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (this=<synthetic pointer>, __p=<optimized out>) at /usr/include/c /4.8.2/ext/new_allocator.h:120
#20 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> > >::_S_construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=<synthetic pointer>..., __p=<optimized out>) at /usr/include/c /4.8.2/bits/alloc_traits.h:254
#21 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> > >::construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=<synthetic pointer>..., __p=<optimized out>) at /usr/include/c /4.8.2/bits/alloc_traits.h:393
#22 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=<optimized out>) at /usr/include/c /4.8.2/bits/shared_ptr_base.h:502
#23 std::__shared_ptr<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, (__gnu_cxx::_Lock_policy)2u>::__shared_ptr<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=<optimized out>, __tag=...) at /usr/include/c /4.8.2/bits/shared_ptr_base.h:957
#24 std::shared_ptr<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >::shared_ptr<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __tag=..., this=<optimized out>) at /usr/include/c /4.8.2/bits/shared_ptr.h:316
#25 std::allocate_shared<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::Collec--Type <RET> for more, q to quit, c to continue without paging--
tAndExportOnce()::__lambda11()> > (__a=...) at /usr/include/c /4.8.2/bits/shared_ptr.h:598
#26 std::make_shared<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > () at /usr/include/c /4.8.2/bits/shared_ptr.h:614
#27 std::__future_base::_S_make_async_state<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__fn=...) at /usr/include/c /4.8.2/future:1525
#28 std::async<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11> (__policy=std::launch::async, __fn=...) at /usr/include/c /4.8.2/future:1538
#29 opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce (this=this@entry=0x7f75d760c000)
at /data/devops/workspace/p-4943d876cb734a609b55ad5c740c36ff/src/pool/0/server/main/third_party/packages/opentelemetry-cpp-v1.15.0/sdk/src/metrics/export/periodic_exporting_metric_reader.cc:89
#30 0x00007f75da648a38 in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::DoBackgroundWork (this=0x7f75d760c000)
at /data/devops/workspace/p-4943d876cb734a609b55ad5c740c36ff/src/pool/0/server/main/third_party/packages/opentelemetry-cpp-v1.15.0/sdk/src/metrics/export/periodic_exporting_metric_reader.cc:55
#31 0x00007f75de273340 in ?? () from /lib64/libstdc .so.6
#32 0x00007f75df2e5ea5 in start_thread () from /lib64/libpthread.so.0
#33 0x00007f75ddbecb0d in clone () from /lib64/libc.so.6
然后我们来看数据层面。
代码语言:javascript复制(gdb) f 14
#14 0x00007f7c4a1ebc93 in _Async_state_impl (__fn=<unknown type in /data/home/user00/tgf-server/d-8/tgf/matchsvr/bin/../../lib64/6689e6af/lib64/libopentelemetry_metrics.so, CU 0x3cc73e, DIE 0x415b2f>, this=0x7f7acd401018)
at /usr/include/c /4.8.2/future:1495
1495 : _M_result(new _Result<_Res>()), _M_fn(std::move(__fn))
(gdb) p _M_thread
$9 = {_M_id = {_M_thread = 0}}
(gdb) p _M_once
$10 = {_M_once = 1}
(gdb) p _M_result
$11 = std::unique_ptr<std::__future_base::_Result<void>> containing 0x0
(gdb)
可以看到,这个尝试 join 的 _M_thread
已经为空了 。
相关代码
代码语言:javascript复制 class __future_base::_Async_state_common : public __future_base::_State_base
{
protected:
#ifdef _GLIBCXX_ASYNC_ABI_COMPAT
~_Async_state_common();
#else
~_Async_state_common() = default;
#endif
// Allow non-timed waiting functions to block until the thread completes,
// as if joined.
virtual void _M_run_deferred() { _M_join(); }
void _M_join() { std::call_once(_M_once, &thread::join, ref(_M_thread)); }
thread _M_thread;
once_flag _M_once;
};
template<typename _BoundFn, typename _Res>
class __future_base::_Async_state_impl final
: public __future_base::_Async_state_common
{
public:
explicit
_Async_state_impl(_BoundFn&& __fn)
: _M_result(new _Result<_Res>()), _M_fn(std::move(__fn))
{
_M_thread = std::thread{ [this] {
_M_set_result(_S_task_setter(_M_result, _M_fn));
} };
}
~_Async_state_impl() { _M_join(); }
private:
typedef __future_base::_Ptr<_Result<_Res>> _Ptr_type;
_Ptr_type _M_result;
_BoundFn _M_fn;
};
另外 std::__future_base::_Async_state_common::~_Async_state_common()
的代码在头文件里没有,所以得去源仓库找 https://github.com/gcc-mirror/gcc/blob/releases/gcc-4.8.2/libstdc++-v3/src/c++11/compatibility-thread-c++0x.cc#L89 。这里直接贴出代码
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION
__future_base::_Async_state_common::~_Async_state_common() { _M_join(); }
// 后续无关代码不再贴出
}
按道理来说,_M_join()
里已经保证了 _M_thread.join()
只被调用一次,但是实际上析构函数里的 _M_thread.join()
确被诡异地触发了。
这个问题也有一些外部参考链接:
- [C 11] Segmentation fault with std::async and released shared state
- Using C 11 futures: Nested calls of std::async crash: Compiler/Standard library bug? 。
- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49204
- https://sourceware.org/bugzilla/show_bug.cgi?id=12683
这个问题只是偶现的,所以可能和上面链接里前几个都无关,可能和最后一个线程安全的边界条件有关。实际上我参与开源社区 opentelemetry-cpp 的时候也发下过几个 std::condition_variable
几处临界条件有问题的地方,这个以后再分享。
最后
要沿着线程安全的方向分析,可能就需要花费大量时间了,我暂时没有更深一步去分析。 目前要适配这个GCC STL实现BUG,最快的方法是绕过去,然后记住不要使用(C 又一坑)。相关的修复PR已经推送 https://github.com/open-telemetry/opentelemetry-cpp/issues/2982 并且已经合入。 我们项目组的开源构建系统 cmake-toolset 也已经Patch接入的 otel-cpp 版本。(详见: https://github.com/owent/cmake-toolset/commit/46643e8b695fff9c7f13d0da1ecca07c2f8e9444)
欢迎有兴趣的小伙伴互相交流研究。