前言
很久很久以前,浮点数的性能和跨平台跨硬件架构一致性是无法获得保证的,所以我们一般在需要强一致性和高性能的游戏服务器中会禁用浮点数,转而使用自己实现的定点数。 这么多年过去了,前段时间想看看现代化硬件下是否仍然有性能问题和是否能够保证一致性,做了些简单的测试,这里记录一下。
关于一致性
首先贴几个参考文档:
- http://christian-seiler.de/projekte/fpmath/
- https://gafferongames.com/post/floating_point_determinism/
- https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
- https://docs.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=msvc-170
- http://christian-seiler.de/projekte/fpmath/
按照这些材料的说法,影响浮点数计算结果的有几个因素。
首先,C/C++标准中的浮点数是IEEE 754标准,但是不同的硬件体系下可能会有一些扩展。比如x86和x86_64大多数浮点寄存器扩展到了80bits,仅仅是输出到内存中按IEEE 754标准来。那么在有编译优化的情况下,如果调整了代码顺序就可能影响最终结果。比如对于 A*C+B*C
如果被优化成 (A+B)*C
,数学上是没错的但是可能因为精度的影响造成不同的结果,并且在这方面不同编译器的优化策略还有可能不同。所幸,主流编译器都有选项来关闭这个优化,比如MSVC的 /fp:precise
和 Clang/GCC 的 -fno-fast-math
。
然后,有些平台有编译选项或者运行时库来设置硬件的浮点数控制字,可以用于统一浮点数运算器的行为。比如x86架构下Windows的 _controlfp_s
接口和 Linux 的
fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
_FPU_SETCW(cw);
但是按照文档所说,这些控制字不影响 SSE 指令。而且非常令人沮丧的是,GCC和Clang的数学库都使用了SSE指令,SSE2指令的控制字由单独的寄存器 MXCSR
控制。另外在ARM架构中,没有浮点数的精度控制字,只能设置Round规则。
ARM架构浮点数控制字文档:
- ARM - 32位
- ARM - 64位
还有个需要特别注意的是,有些协程组件没有保存浮点数控制字。如果系统中有使用协程,需要额外注意和确认这一点。
最后,GCC还有一个编译选项 -ffloat-store
。默认情况下,gcc会尝试使用寄存器缓存浮点数计算的中间结果,因为x86架构下浮点数寄存器是80bits的,这也可能导致精度差异。这个选项则是强制中间结果也刷入内存,以便对齐到 IEEE 754 。但是同样令人沮丧的是Clang不支持该选项,加上这个选项Clang会报该选项会被忽略的Warning,并且实际测出来无论是否加改选项,性能也和GCC不开这个选项的结果一致。这也意味着macOS、iOS、Android上无法使用(最新的Android NDK已经移除GCC了)。
在开启和处理了上述差异之后,我写的测试用例在各个平台下跑出的结果是一致的(x86,x86_64/armv7/arm64)。但是这也不排除我写的测试用例比较简单,有可能这在复杂的计算和涉及更复杂的编译优化的场景下才能触发。
浮点数性能
在现代化的硬件体系下,浮点数性能已经相当不错了。如果允许寄存器缓存中间计算结果,浮点数的加减法性能已经与整数相差无异,乘法还是要比整数差一个数量级,除法有时候甚至比整数要快。
我测试的设备为:
- Windows 11/CPU:Intel i7-8700 3.2GHZ/MEM: 48GB
- Linux CentOS 8/CPU: AMD EPYC 7K62 48-Core Processor 2.6GHz/MEM: 32GB
- 小米6/CPU: 骁龙835
- 小米12/CPU: 骁龙8 Gen1
Windows MSVC x64
编译选项: /O2 /DNDEBUG /Z7 /nologo /DWIN32 /D_WINDOWS /utf-8 /MP /W4 /wd4100 /wd4125 /wd4566 /wd4127 /wd4512 /GR- /Gy- /Zc:__cplusplus /fp:precise
Default:
Control word: 8001f
0.1*0.1=0.01
Current:
Control word: 8001f
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 3235us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 2519us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 4038us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 35407us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 3534us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 3256us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 55184us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 29846us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 26628us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
Windows MSVC x86
编译选项: /O2 /DNDEBUG /Z7 /nologo /DWIN32 /D_WINDOWS /utf-8 /MP /W4 /wd4100 /wd4125 /wd4566 /wd4127 /wd4512 /GR- /Gy- /Zc:__cplusplus /fp:precise
Default:
Control word: 9001f
0.1*0.1=0.01
Current:
Control word: a001f
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 5199us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 4042us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 5217us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 35150us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 3680us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 3402us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 51159us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 40696us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 45395us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
Linux GCC x86_64
编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread -mieee-fp -ffloat-store
编译选项 -mpc32
的作用和代码里写:
fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
_FPU_SETCW(cw);
是一样的。
数学库依赖 SSE指令
/data/home/owent/workspace/test/test_fpu/test_fpu.cpp: In function ‘void {anonymous}::start_benchmark_worker(size_t, size_t, benchmark_thread_data&, std::atomic<long unsigned int>&, std::atomic<long unsigned int>&)’:
/data/home/owent/workspace/test/test_fpu/test_fpu.cpp:483:60: error: SSE register return with SSE disabled
483 | data.result.float_sin_final_result.push_back(std::sin(3.14159f / 34 * i));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~
gmake[2]: *** [CMakeFiles/clangconsole.dir/test_fpu.cpp.o] Error 1
-ffloat-store
使得计算的中间结果不适用寄存器,x86架构的浮点寄存器是10字节,可能会影响最终结果。 参见: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
带 -ffloat-store
时:
Default:
Control word: 37f
0.1*0.1=0.01
Current:
Control word: 7f
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 3617us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 2544us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 2238us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 48207us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 15854us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 15853us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 22212us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 54831us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 61222us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
不带 -ffloat-store
时:
Default:
Control word: 37f
0.1*0.1=0.01
Current:
Control word: 7f
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 3300us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 3056us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 2486us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 48256us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 4813us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 4506us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 11396us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 43665us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 34217us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
Linux Clang x86_64
编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread -mieee-fp
- Clang目前不支持
-ffloat-store
:clang-13: warning: optimization flag '-ffloat-store' is not supported [-Wignored-optimization-argument]
- Clang目前不支持编译选项
-mpc32
, 只能代码里写_FPU_SETCW(...)
来控制
数学库依赖 SSE指令
In file included from /data/home/owent/workspace/test/test_fpu/test_fpu.cpp:18:
In file included from /opt/llvm-13.0/bin/../include/c++/v1/cmath:308:
/opt/llvm-13.0/bin/../include/c++/v1/math.h:946:107: error: SSE register return with SSE disabled
inline _LIBCPP_INLINE_VISIBILITY float frexp(float __lcpp_x, int* __lcpp_e) _NOEXCEPT {return ::frexpf(__lcpp_x, __lcpp_e);}
^
/opt/llvm-13.0/bin/../include/c++/v1/math.h:946:107: error: SSE register return with SSE disabled
Default:
Control word: 37f
0.1*0.1=0.01
Current:
Control word: 7f
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 3853us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 2705us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 2339us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 48371us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 4003us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 3735us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 10257us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 42894us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 34741us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
Android arm64
编译选项: -O2 -fno-rtti -fno-fast-math -ffp-contract=off -pthread
- Clang目前不支持
-ffloat-store
- ARM无法设置浮点数精度,取整规则和上面x86保持一致
骁龙835 (MI 6)
Default:
Rounding word: 0
0.1*0.1=0.01
Current:
Rounding word: 0
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 9570us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 12413us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 7995us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 62027us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 9716us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 9313us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 51298us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 98013us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 70492us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
骁龙8 Gen1 (MI 12)
Default:
Rounding word: 0
0.1*0.1=0.01
Current:
Rounding word: 0
0.1*0.1=0.01
Progress: 0/1
Progress: 4833/4833
Integer add: 9783us
2439629427 2898576995 3714915387 4286696855 378431623 2252452707 2752353767 771443067 1608538503 614657667
Integer sub: 8433us
146874035 1872938867 1488311003 3933596871 1332511591 2743417179 808845979 255589207 3248385047 59316451
Integer mul: 6595us
3058914187 300037571 705100673 233718125 3947941235 3996371957 2978946479 2859514979 1770852851 1904696835
Integer div: 53200us
12726319 17123513 35291036 745375340 22809512 6164687 31464879 15696064 5218240 24524558
Float add: 9711us
1.40684e+14 1.40869e+14 1.4102e+14 1.40896e+14 1.41232e+14 1.40765e+14 1.41651e+14 1.41237e+14 1.40636e+14 1.41185e+14
Float sub: 18869us
-1.23029e+14 -1.23465e+14 -1.23569e+14 -1.23352e+14 -1.2344e+14 -1.23229e+14 -1.23999e+14 -1.23416e+14 -1.23077e+14 -1.23381e+14
Float mul: 72346us
2.03832e+09 5.93147e+35 8.98e+08 1.04652e+07 1.34553e+09 6.19712e+08 1.37178e+35 2.89617e+18 4.08849e+25 5.28501e+08
Float div: 73249us
3.1405e+10 3.07084e+18 29468.8 1.66257e+20 1.01115e+12 1.24143e+08 1.52193e+23 2.17134e+12 1.06556e+12 6.0793e+16
Float sqrt: 40105us
2.7177e+09 2.96204e+09 9.308e+15 1.9039e+08 6.62332e+17 1.95626e+09 7.31856e+10 2.9189e+09 9.3298e+08 3.39628e+13
Float sin:
0 0.0922683 0.183749 0.273663 0.361241 0.445738 0.526432 0.602634 0.673695 0.739008 0.798017 0.850217 0.895163 0.932472 0.961825 0.982973
Float cos:
1 0.995734 0.982973 0.961826 0.932472 0.895163 0.850217 0.798018 0.739009 0.673696 0.602635 0.526433 0.445739 0.361243 0.273664 0.183751
测试代码
然后贴一下测试代码,主要三个代码文件和一个cmake工程文件。
test_fpu.h
// Copyright 2022 Tencent
#pragma once
#include <stdint.h>
#include <chrono>
#include <cstddef>
#include <memory>
#include <string>
#include <vector>
bool init_fpu();
std::string dump_current_controlfp();
struct benchmark_handle;
struct benchmark_result {
std::chrono::system_clock::duration integer_add_cost;
std::chrono::system_clock::duration integer_sub_cost;
std::chrono::system_clock::duration integer_mul_cost;
std::chrono::system_clock::duration integer_div_cost;
std::vector<uint32_t> integer_add_final_result;
std::vector<uint32_t> integer_sub_final_result;
std::vector<uint32_t> integer_mul_final_result;
std::vector<uint32_t> integer_div_final_result;
std::chrono::system_clock::duration float_add_cost;
std::chrono::system_clock::duration float_sub_cost;
std::chrono::system_clock::duration float_mul_cost;
std::chrono::system_clock::duration float_div_cost;
std::chrono::system_clock::duration float_sqrt_cost;
std::vector<float> float_add_final_result;
std::vector<float> float_sub_final_result;
std::vector<float> float_mul_final_result;
std::vector<float> float_div_final_result;
std::vector<float> float_sqrt_final_result;
std::vector<float> float_sin_final_result;
std::vector<float> float_cos_final_result;
};
std::shared_ptr<benchmark_handle> start_benchmark(size_t thread_count = 8, size_t round = 10);
bool is_benchmark_running(const std::shared_ptr<benchmark_handle> &handle);
std::pair<size_t, size_t> get_benchmark_progress(const std::shared_ptr<benchmark_handle> &handle);
size_t get_benchmark_running_thread(const std::shared_ptr<benchmark_handle> &handle);
size_t get_benchmark_thread_count(const std::shared_ptr<benchmark_handle> &handle);
void pick_benchmark_result(const std::shared_ptr<benchmark_handle> &handle, std::vector<benchmark_result> &result);
test_fpu.cpp
// Copyright 2022 Tencent
#include "test_fpu.h"
#include <assert.h>
#if defined(_WIN32)
# include <float.h>
#else
# if defined(__aarch64__) || defined(__arm__)
# include <fenv.h>
# else // defined(__i386__) || defined(__x86_64__)
# include <fpu_control.h>
# endif
#endif
#include <atomic>
#include <chrono>
#include <cmath>
#include <memory>
#include <random>
#include <sstream>
#include <string>
#include <thread>
#include <type_traits>
bool init_fpu() {
#if defined(_MSC_VER)
unsigned int control_word;
int err;
err = _controlfp_s(&control_word, 0, 0);
if (err) {
return false;
}
# if !defined(_M_X64)
err = _controlfp_s(&control_word, PC_24, MCW_PC);
if (err) {
return false;
}
# endif
err = _controlfp_s(&control_word, RC_NEAR, MCW_RC);
if (err) {
return false;
}
return true;
#else
# if defined(__aarch64__) || defined(__arm__)
fesetround(FE_TONEAREST);
# else
fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_RC_NEAREST | _FPU_SINGLE;
_FPU_SETCW(cw);
# endif
return true;
#endif
}
std::string dump_current_controlfp() {
std::stringstream ss;
float a = 0.1f;
#if defined(_MSC_VER)
unsigned int control_word;
int err = _controlfp_s(&control_word, 0, 0);
if (err) {
ss << "Got error code: " << err;
return ss.str();
}
ss << "Control word: " << std::hex << control_word << std::endl;
float b = a * a;
ss << a << "*" << a << "=" << b << std::endl;
#else
# if defined(__aarch64__) || defined(__arm__)
ss << "Rounding word: " << std::hex << fegetround() << std::endl;
# else
fpu_control_t cw;
_FPU_GETCW(cw);
ss << "Control word: " << std::hex << cw << std::endl;
# endif
float b = a * a;
ss << a << "*" << a << "=" << b << std::endl;
#endif
return ss.str();
}
struct benchmark_thread_data {
std::unique_ptr<std::thread> thread;
benchmark_result result;
};
struct benchmark_handle {
size_t max_round;
std::atomic<size_t> running_thread;
std::atomic<size_t> progress_total;
std::atomic<size_t> progress_done;
std::vector<benchmark_thread_data> datas;
std::unique_ptr<std::thread> controller_thread;
~benchmark_handle() {
if (controller_thread && controller_thread->joinable()) {
controller_thread->join();
}
}
};
namespace {
static constexpr size_t kMaxParameterCount = 1 << 20;
static constexpr size_t kMaxParameterArraySize = kMaxParameterCount * 2;
static uint32_t g_integer_parameters_odd[kMaxParameterArraySize] = {0};
static uint32_t g_integer_parameters_even[kMaxParameterArraySize] = {0};
static float g_float_parameters_odd[kMaxParameterArraySize] = {0};
static float g_float_parameters_even[kMaxParameterArraySize] = {0};
static void initialize_parameters(std::atomic<size_t> &progress_total, std::atomic<size_t> &progress_done) {
if (g_integer_parameters_even[std::extent<decltype(g_integer_parameters_even)>::value - 1] != 0) {
return;
}
progress_total += kMaxParameterArraySize >> 9;
std::mt19937 rnd{9999991};
size_t index = 0;
while (index < kMaxParameterArraySize * 2) {
uint32_t r = rnd();
if (r < 9999991) {
continue;
}
r = (r << 1) & 0x7ffffffe;
if (index & 0x1) {
g_integer_parameters_odd[index >> 1] = r | 0x1;
g_float_parameters_odd[index >> 1] = static_cast<float>(r | 0x1);
} else {
g_integer_parameters_even[index >> 1] = r;
g_float_parameters_even[index >> 1] = static_cast<float>(r);
}
++index;
if (0 == (index & ((1 << 10) - 1))) {
++progress_done;
}
}
}
template <class TDATA>
static inline void benchmark_add(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
size_t s1 = start_parameter_idx;
size_t s2 = start_parameter_idx;
final_result += odd[s1 + 1] + odd[s1 + 2] + odd[s1 + 3] + odd[s1 + 4] + odd[s1 + 5] + odd[s1 + 6] + odd[s1 + 7] +
odd[s1 + 8] + odd[s1 + 9] + odd[s1 + 10] + odd[s1 + 11] + odd[s1 + 12] + odd[s1 + 13] + odd[s1 + 14] +
odd[s1 + 15] + odd[s1];
final_result += even[s2 + 1] + even[s2 + 2] + even[s2 + 3] + even[s2 + 4] + even[s2 + 5] + even[s2 + 6] +
even[s2 + 7] + even[s2 + 8] + even[s2 + 9] + even[s2 + 10] + even[s2 + 11] + even[s2 + 12] +
even[s2 + 13] + even[s2 + 14] + even[s2 + 15] + even[s2];
}
template <class TDATA>
static inline void benchmark_sub(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
size_t s1 = start_parameter_idx;
size_t s2 = start_parameter_idx;
final_result += odd[s1 + 1] - odd[s1 + 2] - odd[s1 + 3] - odd[s1 + 4] - odd[s1 + 5] - odd[s1 + 6] - odd[s1 + 7] -
odd[s1 + 8] - odd[s1 + 9] - odd[s1 + 10] - odd[s1 + 11] - odd[s1 + 12] - odd[s1 + 13] - odd[s1 + 14] -
odd[s1 + 15] - odd[s1];
final_result += even[s2 + 1] - even[s2 + 2] - even[s2 + 3] - even[s2 + 4] - even[s2 + 5] - even[s2 + 6] -
even[s2 + 7] - even[s2 + 8] - even[s2 + 9] - even[s2 + 10] - even[s2 + 11] - even[s2 + 12] -
even[s2 + 13] - even[s2 + 14] - even[s2 + 15] - even[s2];
}
template <class>
struct benchmark_mul_helper;
template <>
struct benchmark_mul_helper<uint32_t> {
static inline void do_operator(uint32_t odd[], uint32_t &final_result, size_t start_parameter_idx) {
final_result *= odd[start_parameter_idx];
final_result *= odd[start_parameter_idx++];
}
};
template <>
struct benchmark_mul_helper<float> {
static inline void do_operator(float odd[], float &final_result, size_t start_parameter_idx) {
if (std::isinf(final_result * odd[start_parameter_idx])) {
int exp;
final_result = std::frexp(final_result, &exp);
// memset(&final_result, 0, 1);
// *(reinterpret_cast<uint8_t *>(&final_result) + sizeof(float) - 1) = 0;
}
final_result *= odd[start_parameter_idx++];
}
};
template <class TDATA>
static inline void benchmark_mul(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_mul_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
}
template <class>
struct benchmark_div_helper;
template <>
struct benchmark_div_helper<uint32_t> {
static inline void do_operator(uint32_t odd[], uint32_t even[], uint32_t &final_result, size_t start_parameter_idx) {
final_result *= even[start_parameter_idx];
uint32_t devided = (odd[start_parameter_idx] & 0xff);
if (final_result > devided) {
final_result /= devided;
} else {
final_result %= devided;
}
}
};
template <>
struct benchmark_div_helper<float> {
static inline void do_operator(float odd[], float even[], float &final_result, size_t start_parameter_idx) {
float r = final_result * even[start_parameter_idx];
if (!std::isinf(r)) {
final_result = r;
}
if (final_result > odd[start_parameter_idx]) {
final_result /= odd[start_parameter_idx];
} else {
int exp;
float devided = std::frexp(odd[start_parameter_idx], &exp);
final_result /= devided;
}
}
};
template <class TDATA>
static inline void benchmark_div(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
benchmark_div_helper<TDATA>::do_operator(odd, even, final_result, start_parameter_idx++);
}
template <class>
struct benchmark_sqrt_helper;
template <>
struct benchmark_sqrt_helper<float> {
static inline void do_operator(float odd[], float &final_result, size_t start_parameter_idx) {
float v = odd[start_parameter_idx];
if (start_parameter_idx & 0xc) {
final_result = std::sqrt(final_result * v + v * v);
} else if (start_parameter_idx & 0x3) {
final_result = std::sqrt(final_result * v * v);
} else {
final_result = std::sqrt(final_result * final_result * v);
}
}
};
template <class TDATA>
static inline void benchmark_sqrt(TDATA odd[], TDATA even[], TDATA &final_result, size_t start_parameter_idx) {
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
benchmark_sqrt_helper<TDATA>::do_operator(odd, final_result, start_parameter_idx++);
}
static void start_benchmark_worker(size_t idx, size_t max_round, benchmark_thread_data &data,
std::atomic<size_t> &progress_total, std::atomic<size_t> &progress_done) {
constexpr const size_t step = 1 << 4;
constexpr const size_t iterator_count = kMaxParameterCount >> 4;
progress_total += max_round * 9;
progress_total += 2; // sin + cos
// integer add
{
data.result.integer_add_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
uint32_t result = g_integer_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_add(g_integer_parameters_odd, g_integer_parameters_even, result, i);
}
++progress_done;
data.result.integer_add_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.integer_add_cost = end - begin;
}
// integer sub
{
data.result.integer_sub_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
uint32_t result = g_integer_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_sub(g_integer_parameters_odd, g_integer_parameters_even, result, i);
}
++progress_done;
data.result.integer_sub_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.integer_sub_cost = end - begin;
}
// integer mul
{
data.result.integer_mul_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
uint32_t result = g_integer_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_mul(g_integer_parameters_odd, g_integer_parameters_even, result, i);
}
++progress_done;
data.result.integer_mul_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.integer_mul_cost = end - begin;
}
// integer div
{
data.result.integer_div_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
uint32_t result = g_integer_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_div(g_integer_parameters_odd, g_integer_parameters_even, result, i);
}
++progress_done;
data.result.integer_div_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.integer_div_cost = end - begin;
}
// float add
{
data.result.float_add_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
float result = g_float_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_add(g_float_parameters_odd, g_float_parameters_even, result, i);
}
++progress_done;
data.result.float_add_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.float_add_cost = end - begin;
}
// float sub
{
data.result.float_sub_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
float result = g_float_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_sub(g_float_parameters_odd, g_float_parameters_even, result, i);
}
++progress_done;
data.result.float_sub_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.float_sub_cost = end - begin;
}
// float mul
{
data.result.float_mul_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
float result = g_float_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_mul(g_float_parameters_odd, g_float_parameters_even, result, i);
}
++progress_done;
data.result.float_mul_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.float_mul_cost = end - begin;
}
// float div
{
data.result.float_div_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
float result = g_float_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_div(g_float_parameters_odd, g_float_parameters_even, result, i);
}
++progress_done;
data.result.float_div_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.float_div_cost = end - begin;
}
// float sqrt
{
data.result.float_sqrt_final_result.resize(max_round);
auto begin = std::chrono::system_clock::now();
for (size_t round = 0; round < max_round; ++round) {
size_t start_index = kMaxParameterCount / max_round * round;
size_t iterator_end = start_index + iterator_count;
float result = g_float_parameters_odd[start_index];
for (size_t i = start_index; i < iterator_end; i += step) {
benchmark_sqrt(g_float_parameters_odd, g_float_parameters_even, result, i);
}
++progress_done;
data.result.float_sqrt_final_result[round] = result;
}
auto end = std::chrono::system_clock::now();
data.result.float_sqrt_cost = end - begin;
}
// float sin
{
for (int i = 0; i < 16; ++i) {
data.result.float_sin_final_result.push_back(std::sin(3.14159f / 34 * i));
}
++progress_done;
}
// float cos
{
for (int i = 0; i < 16; ++i) {
data.result.float_cos_final_result.push_back(std::cos(3.14159f / 34 * i));
}
++progress_done;
}
}
static void start_benchmark_controller(std::shared_ptr<benchmark_handle> handle) {
initialize_parameters(handle->progress_total, handle->progress_done);
size_t idx = 0;
for (auto &data : handle->datas) {
data.result.float_add_cost = std::chrono::system_clock::duration::zero();
data.result.float_sub_cost = std::chrono::system_clock::duration::zero();
data.result.float_mul_cost = std::chrono::system_clock::duration::zero();
data.result.float_div_cost = std::chrono::system_clock::duration::zero();
data.result.float_sqrt_cost = std::chrono::system_clock::duration::zero();
data.result.integer_add_cost = std::chrono::system_clock::duration::zero();
data.result.integer_sub_cost = std::chrono::system_clock::duration::zero();
data.result.integer_mul_cost = std::chrono::system_clock::duration::zero();
data.result.integer_div_cost = std::chrono::system_clock::duration::zero();
data.thread = std::unique_ptr<std::thread>(new std::thread([idx, &data, &handle]() {
++handle->running_thread;
start_benchmark_worker(idx, handle->max_round, data, handle->progress_total, handle->progress_done);
--handle->running_thread;
}));
++idx;
}
for (auto &data : handle->datas) {
if (data.thread && data.thread->joinable()) {
data.thread->join();
}
}
}
} // namespace
std::shared_ptr<benchmark_handle> start_benchmark(size_t thread_count, size_t round) {
if (thread_count > 32) {
thread_count = 32;
}
std::shared_ptr<benchmark_handle> ret = std::make_shared<benchmark_handle>();
if (!ret) {
return ret;
}
ret->max_round = round;
ret->running_thread.store(0);
ret->progress_total.store(1);
ret->progress_done.store(0);
ret->datas.resize(thread_count);
ret->controller_thread = std::unique_ptr<std::thread>(new std::thread([ret]() {
start_benchmark_controller(ret);
++ret->progress_done;
}));
return ret;
}
bool is_benchmark_running(const std::shared_ptr<benchmark_handle> &handle) {
if (!handle) {
return false;
}
if (!handle->controller_thread) {
return false;
}
return handle->progress_done.load() < handle->progress_total.load();
}
std::pair<size_t, size_t> get_benchmark_progress(const std::shared_ptr<benchmark_handle> &handle) {
if (!handle) {
return std::pair<size_t, size_t>{0, 0};
}
return std::pair<size_t, size_t>{handle->progress_done, handle->progress_total};
}
size_t get_benchmark_running_thread(const std::shared_ptr<benchmark_handle> &handle) {
if (!handle) {
return 0;
}
return handle->running_thread.load();
}
size_t get_benchmark_thread_count(const std::shared_ptr<benchmark_handle> &handle) {
if (!handle) {
return 0;
}
return handle->datas.size();
}
void pick_benchmark_result(const std::shared_ptr<benchmark_handle> &handle, std::vector<benchmark_result> &result) {
if (!handle) {
return;
}
result.reserve(handle->datas.size());
for (auto &data : handle->datas) {
result.push_back(data.result);
}
}
main.cpp
// Copyright 2022 Tencent
#include <chrono>
#include <cstring>
#include <functional>
#include <iomanip>
#include <iostream>
#include <memory>
#include <thread>
#include <type_traits>
#include <vector>
#include "test_fpu.h"
int main(int argc, char* argv[]) {
std::cout << "Default:" << std::endl;
std::cout << dump_current_controlfp() << std::endl;
init_fpu();
std::cout << "Current:" << std::endl;
std::cout << dump_current_controlfp() << std::endl;
size_t thread_count = 8;
size_t round = 10;
if (argc > 1) {
thread_count = strtoul(argv[1], nullptr, 10);
}
if (argc > 2) {
round = strtoul(argv[2], nullptr, 10);
}
auto benchmark = start_benchmark(thread_count, round);
while (is_benchmark_running(benchmark)) {
auto progress = get_benchmark_progress(benchmark);
std::cout << "Progress: " << progress.first << "/" << progress.second << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(500));
}
{
auto progress = get_benchmark_progress(benchmark);
std::cout << "Progress: " << progress.first << "/" << progress.second << std::endl;
}
std::vector<benchmark_result> results;
pick_benchmark_result(benchmark, results);
std::ios_base::sync_with_stdio(false);
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.integer_add_cost;
}
std::cout << "Integer add: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.integer_add_final_result) {
std::cout << " " << std::setw(10) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.integer_sub_cost;
}
std::cout << "Integer sub: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.integer_sub_final_result) {
std::cout << " " << std::setw(10) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.integer_mul_cost;
}
std::cout << "Integer mul: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.integer_mul_final_result) {
std::cout << " " << std::setw(10) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.integer_div_cost;
}
std::cout << "Integer div: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.integer_div_final_result) {
std::cout << " " << std::setw(10) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.float_add_cost;
}
std::cout << "Float add: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_add_final_result) {
std::cout << " " << std::setw(12) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.float_sub_cost;
}
std::cout << "Float sub: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_sub_final_result) {
std::cout << " " << std::setw(12) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.float_mul_cost;
}
std::cout << "Float mul: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_mul_final_result) {
std::cout << " " << std::setw(12) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.float_div_cost;
}
std::cout << "Float div: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_div_final_result) {
std::cout << " " << std::setw(12) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::chrono::system_clock::duration total_cost = std::chrono::system_clock::duration::zero();
for (auto& result : results) {
total_cost += result.float_sqrt_cost;
}
std::cout << "Float sqrt: " << std::chrono::duration_cast<std::chrono::microseconds>(total_cost).count() << "us"
<< std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_sqrt_final_result) {
std::cout << " " << std::setw(12) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::cout << "Float sin: " << std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_sin_final_result) {
std::cout << " " << std::setw(9) << final_result;
}
std::cout << std::endl;
break;
}
}
{
std::cout << "Float cos: " << std::endl;
for (auto& result : results) {
for (auto& final_result : result.float_cos_final_result) {
std::cout << " " << std::setw(9) << final_result;
}
std::cout << std::endl;
break;
}
}
std::cout.flush();
return 0;
}
CMakeLists.txt
cmake_minimum_required(VERSION 3.16)
project(testfpu)
set(CMAKE_CXX_STANDARD 20)
add_executable(testfpu main.cpp test_fpu.h test_fpu.cpp)
target_include_directories(testfpu PRIVATE ${CMAKE_CURRENT_LIST_DIR})
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang|AppleClang")
target_link_directories(testfpu PRIVATE c++ c++abi)
target_compile_options(testfpu PRIVATE -stdlib=libc++)
endif()
include(CheckCCompilerFlag)
if(MSVC)
set(ADDITIONAL_COMPILE_OPTIONS
"$<$<NOT:$<CONFIG:Debug>>:/O2;/DNDEBUG>"
/Z7
/nologo
/DWIN32
/D_WINDOWS
/utf-8
/MP
/W4
/wd4100
/wd4125
/wd4566
/wd4127
/wd4512
/GR-
/Gy-
/Zc:__cplusplus
# Flags for floating point
/fp:precise)
else()
set(ADDITIONAL_COMPILE_OPTIONS
-O2
-fno-rtti
-g
-ggdb
-Wall
-Wextra
-Wno-implicit-fallthrough
-Wno-unused-local-typedefs
-fno-fast-math
-ffp-contract=off)
include(FindThreads)
if(TARGET Threads::Threads)
target_link_libraries(testfpu PRIVATE Threads::Threads)
endif()
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ADDITIONAL_COMPILE_OPTIONS -fdiagnostics-color=auto
-ffloat-store)
# -mpc32 is the same as: fpu_control_t cw = (_FPU_DEFAULT & ~_FPU_EXTENDED)
# | _FPU_RC_NEAREST | _FPU_SINGLE; _FPU_SETCW(cw);
elseif(CMAKE_CXX_COMPILER_ID MATCHES "Clang|AppleClang")
# list(APPEND ADDITIONAL_COMPILE_OPTIONS -ffloat-store)
endif()
check_c_compiler_flag(-Wno-unused-parameter
CFLAGS_FLAGS_NO_UNUSED_PARAMETER_AVAILABLE)
if(CFLAGS_FLAGS_NO_UNUSED_PARAMETER_AVAILABLE)
list(APPEND ADDITIONAL_COMPILE_OPTIONS -Wno-unused-parameter)
endif()
check_c_compiler_flag(-rdynamic LD_FLAGS_RDYNAMIC_AVAILABLE)
if(LD_FLAGS_RDYNAMIC_AVAILABLE)
message(STATUS "Check Flag: -rdynamic -- yes")
list(APPEND ADDITIONAL_COMPILE_OPTIONS -rdynamic)
else()
message(STATUS "Check Flag: -rdynamic -- no")
endif()
if(CMAKE_SYSTEM_PROCESSOR MATCHES "x86|x64|x86_64|AMD64")
list(APPEND ADDITIONAL_COMPILE_OPTIONS -mieee-fp)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES
"armv7|armv7s|armeabi|armeabi-v7a|arm64-v8a|aarch64|arm64")
check_c_compiler_flag(-mhard-float CFLAGS_FLAGS_MHARD_FLOAT_AVAILABLE)
if(CFLAGS_FLAGS_MHARD_FLOAT_AVAILABLE)
list(APPEND ADDITIONAL_COMPILE_OPTIONS -mhard-float)
endif()
endif()
endif()
set_source_files_properties(
test_fpu.h test_fpu.cpp PROPERTIES COMPILE_OPTIONS
"${ADDITIONAL_COMPILE_OPTIONS}")
最后
个人的建议是如果只考虑性能,并且如果只是用于公式计算和存储等简单的算法,现在已经可以在游戏服务器中使用浮点数了。但是如果目标是保证跨平台一致性,或者需要复杂的乘法运算,还是要谨慎。 欢迎有兴趣的小伙伴们交流分享,特别是如果具体的和一致性相关的测试代码欢迎补充。