前言

很久很久以前,浮点数的性能和跨平台跨硬件架构一致性是无法获得保证的,所以我们一般在需要强一致性和高性能的游戏服务器中会禁用浮点数,转而使用自己实现的定点数。 这么多年过去了,前段时间想看看现代化硬件下是否仍然有性能问题和是否能够保证一致性,做了些简单的测试,这里记录一下。

关于一致性

首先贴几个参考文档:

按照这些材料的说法,影响浮点数计算结果的有几个因素。

首先,C/C++标准中的浮点数是IEEE 754标准,但是不同的硬件体系下可能会有一些扩展。比如x86和x86_64大多数浮点寄存器扩展到了80bits,仅仅是输出到内存中按IEEE 754标准来。那么在有编译优化的情况下,如果调整了代码顺序就可能影响最终结果。比如对于 A*C+B*C 如果被优化成 (A+B)*C ,数学上是没错的但是可能因为精度的影响造成不同的结果,并且在这方面不同编译器的优化策略还有可能不同。所幸,主流编译器都有选项来关闭这个优化,比如MSVC的 /fp:precise 和 Clang/GCC 的 -fno-fast-math

然后,有些平台有编译选项或者运行时库来设置硬件的浮点数控制字,可以用于统一浮点数运算器的行为。比如x86架构下Windows的 _controlfp_s 接口和 Linux 的

fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
_FPU_SETCW(cw);

但是按照文档所说,这些控制字不影响 SSE 指令。而且非常令人沮丧的是,GCC和Clang的数学库都使用了SSE指令,SSE2指令的控制字由单独的寄存器 MXCSR 控制。另外在ARM架构中,没有浮点数的精度控制字,只能设置Round规则。

ARM架构浮点数控制字文档:

还有个需要特别注意的是,有些协程组件没有保存浮点数控制字。如果系统中有使用协程,需要额外注意和确认这一点。

最后,GCC还有一个编译选项 -ffloat-store 。默认情况下,gcc会尝试使用寄存器缓存浮点数计算的中间结果,因为x86架构下浮点数寄存器是80bits的,这也可能导致精度差异。这个选项则是强制中间结果也刷入内存,以便对齐到 IEEE 754 。但是同样令人沮丧的是Clang不支持该选项,加上这个选项Clang会报该选项会被忽略的Warning,并且实际测出来无论是否加改选项,性能也和GCC不开这个选项的结果一致。这也意味着macOS、iOS、Android上无法使用(最新的Android NDK已经移除GCC了)。

在开启和处理了上述差异之后,我写的测试用例在各个平台下跑出的结果是一致的(x86,x86_64/armv7/arm64)。但是这也不排除我写的测试用例比较简单,有可能这在复杂的计算和涉及更复杂的编译优化的场景下才能触发。

浮点数性能

在现代化的硬件体系下,浮点数性能已经相当不错了。如果允许寄存器缓存中间计算结果,浮点数的加减法性能已经与整数相差无异,乘法还是要比整数差一个数量级,除法有时候甚至比整数要快。

我测试的设备为:

  • Windows 11/CPU:Intel i7-8700 3.2GHZ/MEM: 48GB
  • Linux CentOS 8/CPU: AMD EPYC 7K62 48-Core Processor 2.6GHz/MEM: 32GB
  • 小米6/CPU: 骁龙835
  • 小米12/CPU: 骁龙8 Gen1

Windows MSVC x64

编译选项: /O2 /DNDEBUG /Z7 /nologo /DWIN32 /D_WINDOWS /utf-8 /MP /W4 /wd4100 /wd4125 /wd4566 /wd4127 /wd4512 /GR- /Gy- /Zc:__cplusplus /fp:precise

Default:
Control word: 8001f
0.1*0.1=0.01

Current:
Control word: 8001f
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 3235us
  2439629427  2898576995  3714915387  4286696855   378431623  2252452707  2752353767   771443067  1608538503   614657667
Integer sub: 2519us
   146874035  1872938867  1488311003  3933596871  1332511591  2743417179   808845979   255589207  3248385047    59316451
Integer mul: 4038us
  3058914187   300037571   705100673   233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 35407us
    12726319    17123513    35291036   745375340    22809512     6164687    31464879    15696064     5218240    24524558
Float add: 3534us
   1.40684e+14   1.40869e+14    1.4102e+14   1.40896e+14   1.41232e+14   1.40765e+14   1.41651e+14   1.41237e+14   1.40636e+14   1.41185e+14
Float sub: 3256us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14   -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 55184us
   2.03832e+09   5.93147e+35      8.98e+08   1.04652e+07   1.34553e+09   6.19712e+08   1.37178e+35   2.89617e+18   4.08849e+25   5.28501e+08
Float div: 29846us
    3.1405e+10   3.07084e+18       29468.8   1.66257e+20   1.01115e+12   1.24143e+08   1.52193e+23   2.17134e+12   1.06556e+12    6.0793e+16
Float sqrt: 26628us
    2.7177e+09   2.96204e+09     9.308e+15    1.9039e+08   6.62332e+17   1.95626e+09   7.31856e+10    2.9189e+09    9.3298e+08   3.39628e+13
Float sin:
         0 0.0922683  0.183749  0.273663  0.361241  0.445738  0.526432  0.602634  0.673695  0.739008  0.798017  0.850217  0.895163  0.932472  0.961825  0.982973
Float cos:
         1  0.995734  0.982973  0.961826  0.932472  0.895163  0.850217  0.798018  0.739009  0.673696  0.602635  0.526433  0.445739  0.361243  0.273664  0.183751

Windows MSVC x86

编译选项: /O2 /DNDEBUG /Z7 /nologo /DWIN32 /D_WINDOWS /utf-8 /MP /W4 /wd4100 /wd4125 /wd4566 /wd4127 /wd4512 /GR- /Gy- /Zc:__cplusplus /fp:precise

Default:
Control word: 9001f
0.1*0.1=0.01

Current:
Control word: a001f
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 5199us
  2439629427  2898576995  3714915387  4286696855   378431623  2252452707  2752353767   771443067  1608538503   614657667
Integer sub: 4042us
   146874035  1872938867  1488311003  3933596871  1332511591  2743417179   808845979   255589207  3248385047    59316451
Integer mul: 5217us
  3058914187   300037571   705100673   233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 35150us
    12726319    17123513    35291036   745375340    22809512     6164687    31464879    15696064     5218240    24524558
Float add: 3680us
   1.40684e+14   1.40869e+14    1.4102e+14   1.40896e+14   1.41232e+14   1.40765e+14   1.41651e+14   1.41237e+14   1.40636e+14   1.41185e+14
Float sub: 3402us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14   -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 51159us
   2.03832e+09   5.93147e+35      8.98e+08   1.04652e+07   1.34553e+09   6.19712e+08   1.37178e+35   2.89617e+18   4.08849e+25   5.28501e+08
Float div: 40696us
    3.1405e+10   3.07084e+18       29468.8   1.66257e+20   1.01115e+12   1.24143e+08   1.52193e+23   2.17134e+12   1.06556e+12    6.0793e+16
Float sqrt: 45395us
    2.7177e+09   2.96204e+09     9.308e+15    1.9039e+08   6.62332e+17   1.95626e+09   7.31856e+10    2.9189e+09    9.3298e+08   3.39628e+13
Float sin:
         0 0.0922683  0.183749  0.273663  0.361241  0.445738  0.526432  0.602634  0.673695  0.739008  0.798017  0.850217  0.895163  0.932472  0.961825  0.982973
Float cos:
         1  0.995734  0.982973  0.961826  0.932472  0.895163  0.850217  0.798018  0.739009  0.673696  0.602635  0.526433  0.445739  0.361243  0.273664  0.183751

Linux GCC x86_64

编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread -mieee-fp -ffloat-store

编译选项 -mpc32 的作用和代码里写:

fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
_FPU_SETCW(cw);

是一样的。

数学库依赖 SSE指令

/data/home/owent/workspace/test/test_fpu/test_fpu.cpp: In function ‘void {anonymous}::start_benchmark_worker(size_t, size_t, benchmark_thread_data&, std::atomic<long unsigned int>&, std::atomic<long unsigned int>&)’:
/data/home/owent/workspace/test/test_fpu/test_fpu.cpp:483:60: error: SSE register return with SSE disabled
  483 |       data.result.float_sin_final_result.push_back(std::sin(3.14159f / 34 * i));
      |                                                    ~~~~~~~~^~~~~~~~~~~~~~~~~~~
gmake[2]: *** [CMakeFiles/clangconsole.dir/test_fpu.cpp.o] Error 1

-ffloat-store 使得计算的中间结果不适用寄存器,x86架构的浮点寄存器是10字节,可能会影响最终结果。 参见: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-ffloat-store 时:

Default:
Control word: 37f
0.1*0.1=0.01

Current:
Control word: 7f
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 3617us
  2439629427  2898576995  3714915387  4286696855   378431623  2252452707  2752353767   771443067  1608538503   614657667
Integer sub: 2544us
   146874035  1872938867  1488311003  3933596871  1332511591  2743417179   808845979   255589207  3248385047    59316451
Integer mul: 2238us
  3058914187   300037571   705100673   233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 48207us
    12726319    17123513    35291036   745375340    22809512     6164687    31464879    15696064     5218240    24524558
Float add: 15854us
   1.40684e+14   1.40869e+14    1.4102e+14   1.40896e+14   1.41232e+14   1.40765e+14   1.41651e+14   1.41237e+14   1.40636e+14   1.41185e+14
Float sub: 15853us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14   -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 22212us
   2.03832e+09   5.93147e+35      8.98e+08   1.04652e+07   1.34553e+09   6.19712e+08   1.37178e+35   2.89617e+18   4.08849e+25   5.28501e+08
Float div: 54831us
    3.1405e+10   3.07084e+18       29468.8   1.66257e+20   1.01115e+12   1.24143e+08   1.52193e+23   2.17134e+12   1.06556e+12    6.0793e+16
Float sqrt: 61222us
    2.7177e+09   2.96204e+09     9.308e+15    1.9039e+08   6.62332e+17   1.95626e+09   7.31856e+10    2.9189e+09    9.3298e+08   3.39628e+13
Float sin: 
         0 0.0922683  0.183749  0.273663  0.361241  0.445738  0.526432  0.602634  0.673695  0.739008  0.798017  0.850217  0.895163  0.932472  0.961825  0.982973
Float cos: 
         1  0.995734  0.982973  0.961826  0.932472  0.895163  0.850217  0.798018  0.739009  0.673696  0.602635  0.526433  0.445739  0.361243  0.273664  0.183751

不带 -ffloat-store 时:

Default:
Control word: 37f
0.1*0.1=0.01

Current:
Control word: 7f
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 3300us
  2439629427  2898576995  3714915387  4286696855   378431623  2252452707  2752353767   771443067  1608538503   614657667
Integer sub: 3056us
   146874035  1872938867  1488311003  3933596871  1332511591  2743417179   808845979   255589207  3248385047    59316451
Integer mul: 2486us
  3058914187   300037571   705100673   233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 48256us
    12726319    17123513    35291036   745375340    22809512     6164687    31464879    15696064     5218240    24524558
Float add: 4813us
   1.40684e+14   1.40869e+14    1.4102e+14   1.40896e+14   1.41232e+14   1.40765e+14   1.41651e+14   1.41237e+14   1.40636e+14   1.41185e+14
Float sub: 4506us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14   -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 11396us
   2.03832e+09   5.93147e+35      8.98e+08   1.04652e+07   1.34553e+09   6.19712e+08   1.37178e+35   2.89617e+18   4.08849e+25   5.28501e+08
Float div: 43665us
    3.1405e+10   3.07084e+18       29468.8   1.66257e+20   1.01115e+12   1.24143e+08   1.52193e+23   2.17134e+12   1.06556e+12    6.0793e+16
Float sqrt: 34217us
    2.7177e+09   2.96204e+09     9.308e+15    1.9039e+08   6.62332e+17   1.95626e+09   7.31856e+10    2.9189e+09    9.3298e+08   3.39628e+13
Float sin: 
         0 0.0922683  0.183749  0.273663  0.361241  0.445738  0.526432  0.602634  0.673695  0.739008  0.798017  0.850217  0.895163  0.932472  0.961825  0.982973
Float cos: 
         1  0.995734  0.982973  0.961826  0.932472  0.895163  0.850217  0.798018  0.739009  0.673696  0.602635  0.526433  0.445739  0.361243  0.273664  0.183751

Linux Clang x86_64

编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread -mieee-fp

  • Clang目前不支持 -ffloat-store : clang-13: warning: optimization flag '-ffloat-store' is not supported [-Wignored-optimization-argument]
  • Clang目前不支持编译选项 -mpc32 , 只能代码里写 _FPU_SETCW(...) 来控制

数学库依赖 SSE指令

In file included from /data/home/owent/workspace/test/test_fpu/test_fpu.cpp:18:
In file included from /opt/llvm-13.0/bin/../include/c++/v1/cmath:308:
/opt/llvm-13.0/bin/../include/c++/v1/math.h:946:107: error: SSE register return with SSE disabled
inline _LIBCPP_INLINE_VISIBILITY float       frexp(float __lcpp_x, int* __lcpp_e) _NOEXCEPT       {return ::frexpf(__lcpp_x, __lcpp_e);}
                                                                                                          ^
/opt/llvm-13.0/bin/../include/c++/v1/math.h:946:107: error: SSE register return with SSE disabled
Default:
Control word: 37f
0.1*0.1=0.01

Current:
Control word: 7f
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 3853us
  2439629427  2898576995  3714915387  4286696855   378431623  2252452707  2752353767   771443067  1608538503   614657667
Integer sub: 2705us
   146874035  1872938867  1488311003  3933596871  1332511591  2743417179   808845979   255589207  3248385047    59316451
Integer mul: 2339us
  3058914187   300037571   705100673   233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 48371us
    12726319    17123513    35291036   745375340    22809512     6164687    31464879    15696064     5218240    24524558
Float add: 4003us
   1.40684e+14   1.40869e+14    1.4102e+14   1.40896e+14   1.41232e+14   1.40765e+14   1.41651e+14   1.41237e+14   1.40636e+14   1.41185e+14
Float sub: 3735us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14   -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 10257us
   2.03832e+09   5.93147e+35      8.98e+08   1.04652e+07   1.34553e+09   6.19712e+08   1.37178e+35   2.89617e+18   4.08849e+25   5.28501e+08
Float div: 42894us
    3.1405e+10   3.07084e+18       29468.8   1.66257e+20   1.01115e+12   1.24143e+08   1.52193e+23   2.17134e+12   1.06556e+12    6.0793e+16
Float sqrt: 34741us
    2.7177e+09   2.96204e+09     9.308e+15    1.9039e+08   6.62332e+17   1.95626e+09   7.31856e+10    2.9189e+09    9.3298e+08   3.39628e+13
Float sin: 
         0 0.0922683  0.183749  0.273663  0.361241  0.445738  0.526432  0.602634  0.673695  0.739008  0.798017  0.850217  0.895163  0.932472  0.961825  0.982973
Float cos: 
         1  0.995734  0.982973  0.961826  0.932472  0.895163  0.850217  0.798018  0.739009  0.673696  0.602635  0.526433  0.445739  0.361243  0.273664  0.183751

Android arm64

编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread

  • Clang目前不支持 -ffloat-store
  • ARM无法设置浮点数精度,取整规则和上面x86保持一致

骁龙835 (MI 6)

Default:
Rounding word: 0
0.1*0.1=0.01

Current:
Rounding word: 0
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 9570us
  2439629427  2898576995  3714915387  4286696855  378431623  2252452707  2752353767  771443067  1608538503  614657667
Integer sub: 12413us
  146874035  1872938867  1488311003  3933596871  1332511591  2743417179  808845979  255589207  3248385047  59316451
Integer mul: 7995us
  3058914187  300037571  705100673  233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 62027us
  12726319  17123513  35291036  745375340  22809512  6164687  31464879  15696064  5218240  24524558
Float add: 9716us
  1.40684e+14  1.40869e+14  1.4102e+14  1.40896e+14  1.41232e+14  1.40765e+14  1.41651e+14  1.41237e+14  1.40636e+14  1.41185e+14
Float sub: 9313us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14  -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 51298us
  2.03832e+09  5.93147e+35  8.98e+08  1.04652e+07  1.34553e+09  6.19712e+08  1.37178e+35  2.89617e+18  4.08849e+25  5.28501e+08
Float div: 98013us
  3.1405e+10  3.07084e+18  29468.8  1.66257e+20  1.01115e+12  1.24143e+08  1.52193e+23  2.17134e+12  1.06556e+12  6.0793e+16
Float sqrt: 70492us
  2.7177e+09  2.96204e+09  9.308e+15  1.9039e+08  6.62332e+17  1.95626e+09  7.31856e+10  2.9189e+09  9.3298e+08  3.39628e+13
Float sin: 
 0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos: 
 1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751

骁龙8 Gen1 (MI 12)

Default:
Rounding word: 0
0.1*0.1=0.01

Current:
Rounding word: 0
0.1*0.1=0.01

Progress: 0/1
Progress: 4833/4833
Integer add: 9783us
  2439629427  2898576995  3714915387  4286696855  378431623  2252452707  2752353767  771443067  1608538503  614657667
Integer sub: 8433us
  146874035  1872938867  1488311003  3933596871  1332511591  2743417179  808845979  255589207  3248385047  59316451
Integer mul: 6595us
  3058914187  300037571  705100673  233718125  3947941235  3996371957  2978946479  2859514979  1770852851  1904696835
Integer div: 53200us
  12726319  17123513  35291036  745375340  22809512  6164687  31464879  15696064  5218240  24524558
Float add: 9711us
  1.40684e+14  1.40869e+14  1.4102e+14  1.40896e+14  1.41232e+14  1.40765e+14  1.41651e+14  1.41237e+14  1.40636e+14  1.41185e+14
Float sub: 18869us
  -1.23029e+14  -1.23465e+14  -1.23569e+14  -1.23352e+14  -1.2344e+14  -1.23229e+14  -1.23999e+14  -1.23416e+14  -1.23077e+14  -1.23381e+14
Float mul: 72346us
  2.03832e+09  5.93147e+35  8.98e+08  1.04652e+07  1.34553e+09  6.19712e+08  1.37178e+35  2.89617e+18  4.08849e+25  5.28501e+08
Float div: 73249us
  3.1405e+10  3.07084e+18  29468.8  1.66257e+20  1.01115e+12  1.24143e+08  1.52193e+23  2.17134e+12  1.06556e+12  6.0793e+16
Float sqrt: 40105us
  2.7177e+09  2.96204e+09  9.308e+15  1.9039e+08  6.62332e+17  1.95626e+09  7.31856e+10  2.9189e+09  9.3298e+08  3.39628e+13
Float sin: 
 0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos: 
 1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751

测试代码

然后贴一下测试代码,主要三个代码文件和一个cmake工程文件。

test_fpu.h

// Copyright 2022 Tencent

#pragma once

#include <stdint.h>
#include <chrono>
#include <cstddef>
#include <memory>
#include <string>
#include <vector>

bool init_fpu();

std::string dump_current_controlfp();

struct benchmark_handle;

struct benchmark_result {
  std::chrono::system_clock::duration integer_add_cost;
  std::chrono::system_clock::duration integer_sub_cost;
  std::chrono::system_clock::duration integer_mul_cost;
  std::chrono::system_clock::duration integer_div_cost;

  std::vector<uint32_t> integer_add_final_result;
  std::vector<uint32_t> integer_sub_final_result;
  std::vector<uint32_t> integer_mul_final_result;
  std::vector<uint32_t> integer_div_final_result;

  std::chrono::system_clock::duration float_add_cost;
  std::chrono::system_clock::duration float_sub_cost;
  std::chrono::system_clock::duration float_mul_cost;
  std::chrono::system_clock::duration float_div_cost;
  std::chrono::system_clock::duration float_sqrt_cost;

  std::vector<float> float_add_final_result;
  std::vector<float> float_sub_final_result;
  std::vector<float> float_mul_final_result;
  std::vector<float> float_div_final_result;
  std::vector<float> float_sqrt_final_result;

  std::vector<float> float_sin_final_result;
  std::vector<float> float_cos_final_result;
};

std::shared_ptr<benchmark_handle> start_benchmark(size_t thread_count = 8, size_t round = 10);

bool is_benchmark_running(const std::shared_ptr<benchmark_handle> &handle);

std::pair<size_t, size_t> get_benchmark_progress(const std::shared_ptr<benchmark_handle> &handle);

size_t get_benchmark_running_thread(const std::shared_ptr<benchmark_handle> &handle);

size_t get_benchmark_thread_count(const std::shared_ptr<benchmark_handle> &handle);

void pick_benchmark_result(const std::shared_ptr<benchmark_handle> &handle, std::vector<benchmark_result> &result);

test_fpu.cpp

// Copyright 2022 Tencent

#include "test_fpu.h"

#include <assert.h>

#if defined(_WIN32)
#  include <float.h>
#else
#  if defined(__aarch64__) || defined(__arm__)
#    include <fenv.h>
#  else  // defined(__i386__) || defined(__x86_64__)
#    include <fpu_control.h>
#  endif
#endif
#include <atomic>
#include <chrono>
#include <cmath>
#include <memory>
#include <random>
#include <sstream>
#include <string>
#include <thread>
#include <type_traits>

bool init_fpu() {
#if defined(_MSC_VER)
  unsigned int control_word;
  int err;
  err = _controlfp_s(&control_word, 0, 0);
  if (err) {
    return false;
  }

#  if !defined(_M_X64)
  err = _controlfp_s(&control_word, PC_24, MCW_PC);
  if (err) {
    return false;
  }
#  endif
  err = _controlfp_s(&control_word, RC_NEAR, MCW_RC);
  if (err) {
    return false;
  }
  return true;
#else
#  if defined(__aarch64__) || defined(__arm__)
  fesetround(FE_TONEAREST);
#  else
  fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
  _FPU_SETCW(cw);
#  endif
  return true;
#endif
}

std::string dump_current_controlfp() {
  std::stringstream ss;
  float a = 0.1f;

#if defined(_MSC_VER)
  unsigned int control_word;
  int err = _controlfp_s(&control_word, 0, 0);
  if (err) {
    ss << "Got error code: " << err;
    return ss.str();
  }

  ss << "Control word: " << std::hex << control_word << std::endl;
  float b = a * a;
  ss << a << "*" << a << "=" << b << std::endl;
#else
#  if defined(__aarch64__) || defined(__arm__)
  ss << "Rounding word: " << std::hex << fegetround() << std::endl;
#  else
  fpu_control_t cw;
  _FPU_GETCW(cw);
  ss << "Control word: " << std::hex << cw << std::endl;
#  endif
  float b = a * a;
  ss << a << "*" << a << "=" << b << std::endl;
#endif

  return ss.str();
}

struct benchmark_thread_data {
  std::unique_ptr<std::thread> thread;

  benchmark_result result;
};

struct benchmark_handle {
  size_t max_round;
  std::atomic<size_t> running_thread;
  std::atomic<size_t> progress_total;
  std::atomic<size_t> progress_done;
  std::vector<benchmark_thread_data> datas;
  std::unique_ptr<std::thread> controller_thread;

  ~benchmark_handle() {
    if (controller_thread && controller_thread->joinable()) {
      controller_thread->join();
    }
  }
};

namespace {
static constexpr size_t kMaxParameterCount = 1 << 20;
static constexpr size_t kMaxParameterArraySize = kMaxParameterCount * 2;
static uint32_t g_integer_parameters_odd[kMaxParameterArraySize] = {0};
static uint32_t g_integer_parameters_even[kMaxParameterArraySize] = {0};
static float g_float_parameters_odd[kMaxParameterArraySize] = {0};
static float g_float_parameters_even[kMaxParameterArraySize] = {0};

static void initialize_parameters(std::atomic<size_t> &progress_total, std::atomic<size_t> &progress_done) {
  if (g_integer_parameters_even[std::extent<decltype(g_integer_parameters_even)>::value - 1] != 0) {
    return;
  }

  progress_total += kMaxParameterArraySize >> 9;

  std::mt19937 rnd{9999991};
  size_t index = 0;
  while (index < kMaxParameterArraySize * 2) {
    uint32_t r = rnd();
    if (r < 9999991) {
      continue;
    }
    r = (r << 1) & 0x7ffffffe;
    if (index & 0x1) {
      g_integer_parameters_odd[index >> 1] = r | 0x1;
      g_float_parameters_odd[index >> 1] = static_cast<float>(r | 0x1);
    } else {
      g_integer_parameters_even[index >> 1] = r;
      g_float_parameters_even[index >> 1] = static_cast<float>(r);
    }

    ++index;
    if (0 == (index & ((1 << 10) - 1))) {
      ++progress_done;
    }
  }
}

template <class TDATA>
static inline void benchmark_add(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
  size_t s1 = start_parameter_idx;
  size_t s2 = start_parameter_idx;
  final_result += odd[s1 + 1] + odd[s1 + 2] + odd[s1 + 3] + odd[s1 + 4] + odd[s1 + 5] + odd[s1 + 6] + odd[s1 + 7] +
                  odd[s1 + 8] + odd[s1 + 9] + odd[s1 + 10] + odd[s1 + 11] + odd[s1 + 12] + odd[s1 + 13] + odd[s1 + 14] +
                  odd[s1 + 15] + odd[s1];
  final_result += even[s2 + 1] + even[s2 + 2] + even[s2 + 3] + even[s2 + 4] + even[s2 + 5] + even[s2 + 6] +
                  even[s2 + 7] + even[s2 + 8] + even[s2 + 9] + even[s2 + 10] + even[s2 + 11] + even[s2 + 12] +
                  even[s2 + 13] + even[s2 + 14] + even[s2 + 15] + even[s2];
}

template <class TDATA>
static inline void benchmark_sub(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
  size_t s1 = start_parameter_idx;
  size_t s2 = start_parameter_idx;
  final_result += odd[s1 + 1] - odd[s1 + 2] - odd[s1 + 3] - odd[s1 + 4] - odd[s1 + 5] - odd[s1 + 6] - odd[s1 + 7] -
                  odd[s1 + 8] - odd[s1 + 9] - odd[s1 + 10] - odd[s1 + 11] - odd[s1 + 12] - odd[s1 + 13] - odd[s1 + 14] -
                  odd[s1 + 15] - odd[s1];
  final_result += even[s2 + 1] - even[s2 + 2] - even[s2 + 3] - even[s2 + 4] - even[s2 + 5] - even[s2 + 6] -
                  even[s2 + 7] - even[s2 + 8] - even[s2 + 9] - even[s2 + 10] - even[s2 + 11] - even[s2 + 12] -
                  even[s2 + 13] - even[s2 + 14] - even[s2 + 15] - even[s2];
}

template <class>
struct benchmark_mul_helper;

template <>
struct benchmark_mul_helper<uint32_t> {
  static inline void do_operator(uint32_t odd[], uint32_t &final_result, size_t start_parameter_idx) {
    final_result *= odd[start_parameter_idx];
    final_result *= odd[start_parameter_idx++];
  }
};

template <>
struct benchmark_mul_helper<float> {
  static inline void do_operator(float odd[], float &final_result, size_t start_parameter_idx) {
    if (std::isinf(final_result * odd[start_parameter_idx])) {
      int exp;
      final_result = std::frexp(final_result, &exp);
      // memset(&final_result, 0, 1);
      // *(reinterpret_cast<uint8_t *>(&final_result) + sizeof(float) - 1) = 0;
    }
    final_result *= odd[start_parameter_idx++];
  }
};

template <class TDATA>
static inline void benchmark_mul(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
}

template <class>
struct benchmark_div_helper;

template <>
struct benchmark_div_helper<uint32_t> {
  static inline void do_operator(uint32_t odd[], uint32_t even[], uint32_t &final_result, size_t start_parameter_idx) {
    final_result *= even[start_parameter_idx];
    uint32_t devided = (odd[start_parameter_idx] & 0xff);
    if (final_result > devided) {
      final_result /= devided;
    } else {
      final_result %= devided;
    }
  }
};

template <>
struct benchmark_div_helper<float> {
  static inline void do_operator(float odd[], float even[], float &final_result, size_t start_parameter_idx) {
    float r = final_result * even[start_parameter_idx];
    if (!std::isinf(r)) {
      final_result = r;
    }
    if (final_result > odd[start_parameter_idx]) {
      final_result /= odd[start_parameter_idx];
    } else {
      int exp;
      float devided = std::frexp(odd[start_parameter_idx], &exp);
      final_result /= devided;
    }
  }
};

template <class TDATA>
static inline void benchmark_div(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
  benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
}

template <class>
struct benchmark_sqrt_helper;

template <>
struct benchmark_sqrt_helper<float> {
  static inline void do_operator(float odd[], float &final_result, size_t start_parameter_idx) {
    float v = odd[start_parameter_idx];
    if (start_parameter_idx & 0xc) {
      final_result = std::sqrt(final_result * v + v * v);
    } else if (start_parameter_idx & 0x3) {
      final_result = std::sqrt(final_result * v * v);
    } else {
      final_result = std::sqrt(final_result * final_result * v);
    }
  }
};

template <class TDATA>
static inline void benchmark_sqrt(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
  benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
}

static void start_benchmark_worker(size_t idx, size_t max_round, benchmark_thread_data &data,
                                   std::atomic<size_t> &progress_total, std::atomic<size_t> &progress_done) {
  constexpr const size_t step = 1 << 4;
  constexpr const size_t iterator_count = kMaxParameterCount >> 4;
  progress_total += max_round * 9;
  progress_total += 2;  // sin + cos

  // integer add
  {
    data.result.integer_add_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      uint32_t result = g_integer_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_add(g_integer_parameters_odd, g_integer_parameters_even, result, i);
      }
      ++progress_done;

      data.result.integer_add_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.integer_add_cost = end - begin;
  }

  // integer sub
  {
    data.result.integer_sub_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      uint32_t result = g_integer_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_sub(g_integer_parameters_odd, g_integer_parameters_even, result, i);
      }
      ++progress_done;

      data.result.integer_sub_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.integer_sub_cost = end - begin;
  }

  // integer mul
  {
    data.result.integer_mul_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      uint32_t result = g_integer_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_mul(g_integer_parameters_odd, g_integer_parameters_even, result, i);
      }
      ++progress_done;

      data.result.integer_mul_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.integer_mul_cost = end - begin;
  }

  // integer div
  {
    data.result.integer_div_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      uint32_t result = g_integer_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_div(g_integer_parameters_odd, g_integer_parameters_even, result, i);
      }
      ++progress_done;

      data.result.integer_div_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.integer_div_cost = end - begin;
  }

  // float add
  {
    data.result.float_add_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      float result = g_float_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_add(g_float_parameters_odd, g_float_parameters_even, result, i);
      }
      ++progress_done;

      data.result.float_add_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.float_add_cost = end - begin;
  }

  // float sub
  {
    data.result.float_sub_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      float result = g_float_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_sub(g_float_parameters_odd, g_float_parameters_even, result, i);
      }
      ++progress_done;

      data.result.float_sub_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.float_sub_cost = end - begin;
  }

  // float mul
  {
    data.result.float_mul_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      float result = g_float_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_mul(g_float_parameters_odd, g_float_parameters_even, result, i);
      }
      ++progress_done;

      data.result.float_mul_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.float_mul_cost = end - begin;
  }

  // float div
  {
    data.result.float_div_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      float result = g_float_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_div(g_float_parameters_odd, g_float_parameters_even, result, i);
      }
      ++progress_done;
      data.result.float_div_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.float_div_cost = end - begin;
  }

  // float sqrt
  {
    data.result.float_sqrt_final_result.resize(max_round);
    auto begin = std::chrono::system_clock::now();
    for (size_t round = 0; round < max_round; ++round) {
      size_t start_index = kMaxParameterCount / max_round * round;
      size_t iterator_end = start_index + iterator_count;
      float result = g_float_parameters_odd[start_index];
      for (size_t i = start_index; i < iterator_end; i += step) {
        benchmark_sqrt(g_float_parameters_odd, g_float_parameters_even, result, i);
      }
      ++progress_done;
      data.result.float_sqrt_final_result[round] = result;
    }
    auto end = std::chrono::system_clock::now();
    data.result.float_sqrt_cost = end - begin;
  }

  // float sin
  {
    for (int i = 0; i < 16; ++i) {
      data.result.float_sin_final_result.push_back(std::sin(3.14159f / 34 * i));
    }
    ++progress_done;
  }

  // float cos
  {
    for (int i = 0; i < 16; ++i) {
      data.result.float_cos_final_result.push_back(std::cos(3.14159f / 34 * i));
    }
    ++progress_done;
  }
}

static void start_benchmark_controller(std::shared_ptr<benchmark_handle> handle) {
  initialize_parameters(handle->progress_total, handle->progress_done);

  size_t idx = 0;
  for (auto &data : handle->datas) {
    data.result.float_add_cost = std::chrono::system_clock::duration::zero();
    data.result.float_sub_cost = std::chrono::system_clock::duration::zero();
    data.result.float_mul_cost = std::chrono::system_clock::duration::zero();
    data.result.float_div_cost = std::chrono::system_clock::duration::zero();
    data.result.float_sqrt_cost = std::chrono::system_clock::duration::zero();
    data.result.integer_add_cost = std::chrono::system_clock::duration::zero();
    data.result.integer_sub_cost = std::chrono::system_clock::duration::zero();
    data.result.integer_mul_cost = std::chrono::system_clock::duration::zero();
    data.result.integer_div_cost = std::chrono::system_clock::duration::zero();
    data.thread = std::unique_ptr<std::thread>(new std::thread([idx, &data, &handle]() {
      ++handle->running_thread;
      start_benchmark_worker(idx, handle->max_round, data, handle->progress_total, handle->progress_done);
      --handle->running_thread;
    }));
    ++idx;
  }

  for (auto &data : handle->datas) {
    if (data.thread && data.thread->joinable()) {
      data.thread->join();
    }
  }
}
}  // namespace

std::shared_ptr<benchmark_handle> start_benchmark(size_t thread_count, size_t round) {
  if (thread_count > 32) {
    thread_count = 32;
  }

  std::shared_ptr<benchmark_handle> ret = std::make_shared<benchmark_handle>();
  if (!ret) {
    return ret;
  }

  ret->max_round = round;
  ret->running_thread.store(0);
  ret->progress_total.store(1);
  ret->progress_done.store(0);
  ret->datas.resize(thread_count);
  ret->controller_thread = std::unique_ptr<std::thread>(new std::thread([ret]() {
    start_benchmark_controller(ret);
    ++ret->progress_done;
  }));
  return ret;
}

bool is_benchmark_running(const std::shared_ptr<benchmark_handle> &handle) {
  if (!handle) {
    return false;
  }

  if (!handle->controller_thread) {
    return false;
  }

  return handle->progress_done.load() < handle->progress_total.load();
}

std::pair<size_t, size_t> get_benchmark_progress(const std::shared_ptr<benchmark_handle> &handle) {
  if (!handle) {
    return std::pair<size_t, size_t>{0, 0};
  }

  return std::pair<size_t, size_t>{handle->progress_done, handle->progress_total};
}

size_t get_benchmark_running_thread(const std::shared_ptr<benchmark_handle> &handle) {
  if (!handle) {
    return 0;
  }

  return handle->running_thread.load();
}

size_t get_benchmark_thread_count(const std::shared_ptr<benchmark_handle> &handle) {
  if (!handle) {
    return 0;
  }

  return handle->datas.size();
}

void pick_benchmark_result(const std::shared_ptr<benchmark_handle> &handle, std::vector<benchmark_result> &result) {
  if (!handle) {
    return;
  }

  result.reserve(handle->datas.size());
  for (auto &data : handle->datas) {
    result.push_back(data.result);
  }
}

main.cpp

// Copyright 2022 Tencent

#include <chrono>
#include <cstring>
#include <functional>
#include <iomanip>
#include <iostream>
#include <memory>
#include <thread>
#include <type_traits>
#include <vector>

#include "test_fpu.h"

int main(int argc, char* argv[]) {
  std::cout << "Default:" << std::endl;
  std::cout << dump_current_controlfp() << std::endl;
  init_fpu();
  std::cout << "Current:" << std::endl;
  std::cout << dump_current_controlfp() << std::endl;

  size_t thread_count = 8;
  size_t round = 10;

  if (argc > 1) {
    thread_count = strtoul(argv[1], nullptr, 10);
  }

  if (argc > 2) {
    round = strtoul(argv[2], nullptr, 10);
  }

  auto benchmark = start_benchmark(thread_count, round);

  while (is_benchmark_running(benchmark)) {
    auto progress = get_benchmark_progress(benchmark);
    std::cout << "Progress: " << progress.first << "/" << progress.second << std::endl;

    std::this_thread::sleep_for(std::chrono::milliseconds(500));
  }

  {
    auto progress = get_benchmark_progress(benchmark);
    std::cout << "Progress: " << progress.first << "/" << progress.second << std::endl;
  }

  std::vector<benchmark_result> results;
  pick_benchmark_result(benchmark, results);

  std::ios_base::sync_with_stdio(false);

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.integer_add_cost;
    }
    std::cout << "Integer add: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.integer_add_final_result) {
        std::cout << "  " << std::setw(10) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.integer_sub_cost;
    }
    std::cout << "Integer sub: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.integer_sub_final_result) {
        std::cout << "  " << std::setw(10) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.integer_mul_cost;
    }
    std::cout << "Integer mul: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.integer_mul_final_result) {
        std::cout << "  " << std::setw(10) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.integer_div_cost;
    }
    std::cout << "Integer div: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.integer_div_final_result) {
        std::cout << "  " << std::setw(10) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.float_add_cost;
    }
    std::cout << "Float add: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_add_final_result) {
        std::cout << "  " << std::setw(12) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.float_sub_cost;
    }
    std::cout << "Float sub: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_sub_final_result) {
        std::cout << "  " << std::setw(12) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.float_mul_cost;
    }
    std::cout << "Float mul: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_mul_final_result) {
        std::cout << "  " << std::setw(12) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.float_div_cost;
    }
    std::cout << "Float div: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_div_final_result) {
        std::cout << "  " << std::setw(12) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
    for (auto& result : results) {
      total_cost += result.float_sqrt_cost;
    }
    std::cout << "Float sqrt: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
              << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_sqrt_final_result) {
        std::cout << "  " << std::setw(12) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::cout << "Float sin: " << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_sin_final_result) {
        std::cout << " " << std::setw(9) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  {
    std::cout << "Float cos: " << std::endl;
    for (auto& result : results) {
      for (auto& final_result : result.float_cos_final_result) {
        std::cout << " " << std::setw(9) << final_result;
      }
      std::cout << std::endl;
      break;
    }
  }

  std::cout.flush();
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
project(testfpu)

set(CMAKE_CXX_STANDARD 20)

add_executable(testfpu main.cpp test_fpu.h test_fpu.cpp)

target_include_directories(testfpu PRIVATE ${CMAKE_CURRENT_LIST_DIR})

if(CMAKE_CXX_COMPILER_ID MATCHES "Clang|AppleClang")
  target_link_directories(testfpu PRIVATE c++ c++abi)
  target_compile_options(testfpu PRIVATE -stdlib=libc++)
endif()

include(CheckCCompilerFlag)
if(MSVC)
  set(ADDITIONAL_COMPILE_OPTIONS
      "$<$<NOT:$<CONFIG:Debug>>:/O2;/DNDEBUG>"
      /Z7
      /nologo
      /DWIN32
      /D_WINDOWS
      /utf-8
      /MP
      /W4
      /wd4100
      /wd4125
      /wd4566
      /wd4127
      /wd4512
      /GR-
      /Gy-
      /Zc:__cplusplus
      # Flags for floating point
      /fp:precise)
else()
  set(ADDITIONAL_COMPILE_OPTIONS
      -O2
      -fno-rtti
      -g
      -ggdb
      -Wall
      -Wextra
      -Wno-implicit-fallthrough
      -Wno-unused-local-typedefs
      -fno-fast-math
      -ffp-contract=off)
  include(FindThreads)
  if(TARGET Threads::Threads)
    target_link_libraries(testfpu PRIVATE Threads::Threads)
  endif()
  if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
    list(APPEND ADDITIONAL_COMPILE_OPTIONS -fdiagnostics-color=auto
         -ffloat-store)
    # -mpc32 is the same as: fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED)
    # | _FPU_RC_NEAREST | _FPU_SINGLE; _FPU_SETCW(cw);
  elseif(CMAKE_CXX_COMPILER_ID MATCHES "Clang|AppleClang")
    # list(APPEND ADDITIONAL_COMPILE_OPTIONS -ffloat-store)
  endif()
  check_c_compiler_flag(-Wno-unused-parameter
                        CFLAGS_FLAGS_NO_UNUSED_PARAMETER_AVAILABLE)
  if(CFLAGS_FLAGS_NO_UNUSED_PARAMETER_AVAILABLE)
    list(APPEND ADDITIONAL_COMPILE_OPTIONS -Wno-unused-parameter)
  endif()
  check_c_compiler_flag(-rdynamic LD_FLAGS_RDYNAMIC_AVAILABLE)
  if(LD_FLAGS_RDYNAMIC_AVAILABLE)
    message(STATUS "Check Flag: -rdynamic -- yes")
    list(APPEND ADDITIONAL_COMPILE_OPTIONS -rdynamic)
  else()
    message(STATUS "Check Flag: -rdynamic -- no")
  endif()
  if(CMAKE_SYSTEM_PROCESSOR MATCHES "x86|x64|x86_64|AMD64")
    list(APPEND ADDITIONAL_COMPILE_OPTIONS -mieee-fp)
  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES
         "armv7|armv7s|armeabi|armeabi-v7a|arm64-v8a|aarch64|arm64")
    check_c_compiler_flag(-mhard-float CFLAGS_FLAGS_MHARD_FLOAT_AVAILABLE)
    if(CFLAGS_FLAGS_MHARD_FLOAT_AVAILABLE)
      list(APPEND ADDITIONAL_COMPILE_OPTIONS -mhard-float)
    endif()
  endif()
endif()

set_source_files_properties(
  test_fpu.h test_fpu.cpp PROPERTIES COMPILE_OPTIONS
                                     "${ADDITIONAL_COMPILE_OPTIONS}")

最后

个人的建议是如果只考虑性能,并且如果只是用于公式计算和存储等简单的算法,现在已经可以在游戏服务器中使用浮点数了。但是如果目标是保证跨平台一致性,或者需要复杂的乘法运算,还是要谨慎。 欢迎有兴趣的小伙伴们交流分享,特别是如果具体的和一致性相关的测试代码欢迎补充。