前言

近期发现项目组使用新版本的 opentelemetry-cpp 的时候偶现崩溃。崩溃的位置在STL的 std::future 析构的地方,而这个 std::futurestd::async创建。 比较违反直觉,这里记录分享一下分析和解决过程方面其他碰到的小伙伴们。

问题分析

相关代码和规范

首先我们来看下相关代码:

bool PeriodicExportingMetricReader::CollectAndExportOnce()
{
  std::atomic<bool> cancel_export_for_timeout{false};
  auto future_receive = std::async(std::launch::async, [this, &cancel_export_for_timeout] {
    Collect([this, &cancel_export_for_timeout](ResourceMetrics &metric_data) {
      if (cancel_export_for_timeout)
      {
        OTEL_INTERNAL_LOG_ERROR(
            "[Periodic Exporting Metric Reader] Collect took longer configured time: "
            << export_timeout_millis_.count() << " ms, and timed out");
        return false;
      }
      this->exporter_->Export(metric_data);
      return true;
    });
  });

  std::future_status status;
  do
  {
    status = future_receive.wait_for(std::chrono::milliseconds(export_timeout_millis_));
    if (status == std::future_status::timeout)
    {
      cancel_export_for_timeout = true;
      break;
    }
  } while (status != std::future_status::ready);
  bool notify_force_flush = is_force_flush_pending_.exchange(false, std::memory_order_acq_rel);
  if (notify_force_flush)
  {
    is_force_flush_notified_.store(true, std::memory_order_release);
    force_flush_cv_.notify_one();
  }

  return true;
}

这里的逻辑简单得说就是在收集上报指标的时候,在一个子线程执行导出。如果超时了就标记timeout中断上报流程。 按照 https://en.cppreference.com/w/cpp/thread/asynchttps://en.cppreference.com/w/cpp/thread/future/%7Efuture 的对标准的描述。

Async invocation

If the async flag is set (i.e. (policy & std::launch::async) != 0), then

std::async calls INVOKE(decay-copy(std::forward<F>(f)), decay-copy(std::forward<Args>(args))...) as if in a new thread of execution represented by a std::thread object. (until C++23) std::async calls std::invoke(auto(std::forward<F>(f)), auto(std::forward<Args>(args))...) as if in a new thread of execution represented by a std::thread object. (since C++23)

The calls of decay-copy are evaluated(until C++23)The values produced by auto are materialized(since C++23) in the current thread. If the function f returns a value or throws an exception, it is stored in the shared state accessible through the std::future that std::async returns to the caller.

还有

  1. these actions will not block for the shared state to become ready, except that they may block if all of the following are true:
  2. the shared state was created by a call to std::async,
  3. the shared state is not yet ready, and
  4. the current object was the last reference to the shared state.

std::launch::async 模式的调用将在另一个线程中执行。且在 future (设立没有把state share出去所以这里的满足 the current object was the last reference to the shared state.)析构的时候会阻塞等到 future ready。

由于栈上后构造的对象先释放,所以这里lambda里引用了栈上变量也不会有什么问题。但是这里Crash了,那么我们来看看崩溃栈。

崩溃栈

(gdb) bt
#0  0x00007f75ddb24387 in raise () from /lib64/libc.so.6
#1  0x00007f75ddb25a78 in abort () from /lib64/libc.so.6
#2  0x00007f75de21ea95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007f75de21ca06 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f75de21b9b9 in ?? () from /lib64/libstdc++.so.6
#5  0x00007f75de21c624 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6  0x00007f75e9a928e3 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007f75e9a92c7b in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8  0x00007f75de21cc46 in __cxa_throw () from /lib64/libstdc++.so.6
#9  0x00007f75de271f30 in std::__throw_system_error(int) () from /lib64/libstdc++.so.6
#10 0x00007f75de2730f8 in std::thread::join() () from /lib64/libstdc++.so.6
#11 0x00007f75df2e420b in __pthread_once_slow () from /lib64/libpthread.so.0
#12 0x00007f75de21abd7 in ?? () from /lib64/libstdc++.so.6
#13 0x00007f75de21acc4 in std::__future_base::_Async_state_common::~_Async_state_common() () from /lib64/libstdc++.so.6
#14 0x00007f75da6487f8 in std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>::_Async_state_impl (__fn=..., 
    this=0x7f75cca00018) at /usr/include/c++/4.8.2/future:1495
#15 __gnu_cxx::new_allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >::construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (this=<optimized out>, __p=0x7f75cca00018) at /usr/include/c++/4.8.2/ext/new_allocator.h:120
#16 std::allocator_traits<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> > >::_S_construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __p=0x7f75cca00018) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254
#17 std::allocator_traits<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> > >::construct<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __p=0x7f75cca00018) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393
#18 std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>::_Sp_counted_ptr_inplace<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=0x7f75cca00000) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:399
#19 __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> >::construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (this=<synthetic pointer>, __p=<optimized out>) at /usr/include/c++/4.8.2/ext/new_allocator.h:120
#20 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> > >::_S_construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=<synthetic pointer>..., __p=<optimized out>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254
#21 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u> > >::construct<std::_Sp_counted_ptr_inplace<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, (__gnu_cxx::_Lock_policy)2u>, const std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=<synthetic pointer>..., __p=<optimized out>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393
#22 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:502
#23 std::__shared_ptr<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, (__gnu_cxx::_Lock_policy)2u>::__shared_ptr<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., this=<optimized out>, __tag=...) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:957
#24 std::shared_ptr<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >::shared_ptr<std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__a=..., __tag=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:316
#25 std::allocate_shared<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::allocator<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void> >, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::Collec--Type <RET> for more, q to quit, c to continue without paging--
tAndExportOnce()::__lambda11()> > (__a=...) at /usr/include/c++/4.8.2/bits/shared_ptr.h:598
#26 std::make_shared<std::__future_base::_Async_state_impl<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()>, void>, std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > () at /usr/include/c++/4.8.2/bits/shared_ptr.h:614
#27 std::__future_base::_S_make_async_state<std::_Bind_simple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11()> > (__fn=...) at /usr/include/c++/4.8.2/future:1525
#28 std::async<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::__lambda11> (__policy=std::launch::async, __fn=...) at /usr/include/c++/4.8.2/future:1538
#29 opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce (this=this@entry=0x7f75d760c000)
    at /data/devops/workspace/p-4943d876cb734a609b55ad5c740c36ff/src/pool/0/server/main/third_party/packages/opentelemetry-cpp-v1.15.0/sdk/src/metrics/export/periodic_exporting_metric_reader.cc:89
#30 0x00007f75da648a38 in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::DoBackgroundWork (this=0x7f75d760c000)
    at /data/devops/workspace/p-4943d876cb734a609b55ad5c740c36ff/src/pool/0/server/main/third_party/packages/opentelemetry-cpp-v1.15.0/sdk/src/metrics/export/periodic_exporting_metric_reader.cc:55
#31 0x00007f75de273340 in ?? () from /lib64/libstdc++.so.6
#32 0x00007f75df2e5ea5 in start_thread () from /lib64/libpthread.so.0
#33 0x00007f75ddbecb0d in clone () from /lib64/libc.so.6

然后我们来看数据层面。

(gdb) f 14
#14 0x00007f7c4a1ebc93 in _Async_state_impl (__fn=<unknown type in /data/home/user00/tgf-server/d-8/tgf/matchsvr/bin/../../lib64/6689e6af/lib64/libopentelemetry_metrics.so, CU 0x3cc73e, DIE 0x415b2f>, this=0x7f7acd401018)
    at /usr/include/c++/4.8.2/future:1495
1495          : _M_result(new _Result<_Res>()), _M_fn(std::move(__fn))
(gdb) p _M_thread
$9 = {_M_id = {_M_thread = 0}}
(gdb) p _M_once
$10 = {_M_once = 1}
(gdb) p _M_result
$11 = std::unique_ptr<std::__future_base::_Result<void>> containing 0x0
(gdb)

可以看到,这个尝试 join 的 _M_thread 已经为空了 。

相关代码

 class __future_base::_Async_state_common : public __future_base::_State_base
  {
  protected:
#ifdef _GLIBCXX_ASYNC_ABI_COMPAT
    ~_Async_state_common();
#else
    ~_Async_state_common() = default;
#endif

    // Allow non-timed waiting functions to block until the thread completes,
    // as if joined.
    virtual void _M_run_deferred() { _M_join(); }

    void _M_join() { std::call_once(_M_once, &thread::join, ref(_M_thread)); }

    thread _M_thread;
    once_flag _M_once;
  };

  template<typename _BoundFn, typename _Res>
    class __future_base::_Async_state_impl final
    : public __future_base::_Async_state_common
    {
    public:
      explicit
      _Async_state_impl(_BoundFn&& __fn)
      : _M_result(new _Result<_Res>()), _M_fn(std::move(__fn))
      {
        _M_thread = std::thread{ [this] {
          _M_set_result(_S_task_setter(_M_result, _M_fn));
        } };
      }

      ~_Async_state_impl() { _M_join(); }

    private:
      typedef __future_base::_Ptr<_Result<_Res>> _Ptr_type;
      _Ptr_type _M_result;
      _BoundFn _M_fn;
    };

另外 std::__future_base::_Async_state_common::~_Async_state_common() 的代码在头文件里没有,所以得去源仓库找 https://github.com/gcc-mirror/gcc/blob/releases/gcc-4.8.2/libstdc%2B%2B-v3/src/c%2B%2B11/compatibility-thread-c%2B%2B0x.cc#L89 。这里直接贴出代码

namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION
  __future_base::_Async_state_common::~_Async_state_common() { _M_join(); }

// 后续无关代码不再贴出
}

按道理来说,_M_join() 里已经保证了 _M_thread.join() 只被调用一次,但是实际上析构函数里的 _M_thread.join() 确被诡异地触发了。

这个问题也有一些外部参考链接:

这个问题只是偶现的,所以可能和上面链接里前几个都无关,可能和最后一个线程安全的边界条件有关。实际上我参与开源社区 opentelemetry-cpp 的时候也发下过几个 std::condition_variable 几处临界条件有问题的地方,这个以后再分享。

最后

要沿着线程安全的方向分析,可能就需要花费大量时间了,我暂时没有更深一步去分析。 目前要适配这个GCC STL实现BUG,最快的方法是绕过去,然后记住不要使用(C++ 又一坑)。相关的修复PR已经推送 https://github.com/open-telemetry/opentelemetry-cpp/issues/2982 并且已经合入。 我们项目组的开源构建系统 cmake-toolset 也已经Patch接入的 otel-cpp 版本。(详见: https://github.com/owent/cmake-toolset/commit/46643e8b695fff9c7f13d0da1ecca07c2f8e9444

欢迎有兴趣的小伙伴互相交流研究。