libcopp v2的第一波优化完成

之前测出来libcopp还有一些列优化点，但是要破坏之前的API，所以整理了一下优化的想法和方案。

预留空间和合并分配

之前有太多的堆内存分配了，导致很多碎片。那么第一个想法就是协程对象可以分配在栈上，runner也可以分配在栈上。然后还可以加一个自定义预留长度。每个对象对齐到sizeof(long)，总长度对齐到64 Bytes。

然后暴露出一个很小的足够编译优化时放进寄存器的对象，直接拷贝这个对象来传递协程（类似boost.context的continuation）。

原来栈分配和协程对象分配是分开的，现在必须合并了。并且创建时可以指定分配多少预留空间。栈空间回收的操作必须切只能在析构结束最后执行，也就是智能指针的析构需要自己定义。

仍然保留cotask和copp两个部分，但是cotask分配在copp的预留空间中，同样，action也分配在那里。结构如下

|......call stack(执行栈)........Xcotask(协程任务)...action(协程任务可执行对象)...copp(协程上下文对象)...private buffer(私有数据内存块)...随机偏移|
                                ^
                       执行栈起始地址从这里开始

所有数据都对齐到sizeof(std::max_align_t)或sizeof(long double)，即64位系统下对齐到8/16字节。

侵入式智能指针

原来很多地方用得std::shared_ptr，这些也会导致额外的内存分配。这个目的本来是为了线程安全和生命周期管理，可以用make_shared来合并对象和管理区。如果我把协程对象分配在栈上，那就不能用make_shared了，那么内存碎片也就又出现了。所以我想再引入侵入式智能指针来解决这个问题。同时还可以选择要不要考虑线程安全，如果可以不考虑线程安全就可以节省掉原子操作的L1 cache miss的开销。

关于缓存命中率和return stack buffer问题

我和call_in_stack的作者聊了聊，有不少收获。但是他使用的避免return stack buffer失效方法我感觉有点暴力了，直接重定义掉了ret的汇编指令。这种怕是跨架构并且和后面各种编译器行为和编译优化相关性都很强。虽然说效果很明显，但是我觉得还是要放弃这种实现。不过关于缓存命中率的问题倒是可以加入一个随机长度的栈起始offset来解决。现在很多CPU的L1 cache都是32KB，然后4-8路。所以保险点的话大约有64-128ways吧。反正如果是64的话，64和0是一样的，所以可以offset从*(0-127) « 3*，然后就可以命中在不同的cache line上了。

但是后来我测试的时候发现，其实在第一项优化之后，由于不同的cotask（包括其内部的协程上下文和task runner）的地址间隔比较远。导致L1 Cache Read Miss增高，反而实际切换开销增大了。但是实际使用过程中，协程内部的逻辑应该会更容易导致切换时的L1 Cache Read Miss，所以这个数值应该更具有参考意义。在这种情况下，那个带offset的减少cache miss的优化根本上没有效果，所以后来又去除了。

减少原子操作

之前写libcopp merge boost 1.64和相关优化的blog里有提到这一点。主要是原子操作会导致cache miss。原来由于是用智能指针来维护生命周期，很容易传递这个协程的时候发生原子操作，另外就是每次切换上下文前，由于要检查状态，都避免不了一次load、一次CAS和一次restore。然后cotask要做一次这个操作，copp也要做一次。然后在当时的测试里，使用的是栈池，而栈池为了保证线程安全，又有一次splin lock的加锁和解锁。这导致cotask的开销比copp多了40+ns。而精简的情况下，协程操作才70ns左右。这就增加了非常多了。

所以减少原子操作，一方面是要简化cotask和copp里的两次，可以优化成一次。另外可以开启一个非线程安全选项，这时候吧原子整数换成普通整数。因为我们游戏进程里都是单线程的服务，这种开销也不需要。

右值引用

本来大部分逻辑是可以用简单的方法，无视掉低量的复制消耗的。但是随着原子操作的消耗增加，我们就不得不注意类似智能指针复制导致的这种原子开销。所以就得更加智能地判定编译器并且使用右值，因为使用右值转移智能指针的话，不涉及计数器的增减，所以也就没有这个开销。

其他方面使用右值倒不是特别重要，因为传递的东西几乎都是只有几个指针的长度，很容易在编译优化的时候放进寄存器。这样就很快了。

新的单元测试

结构的变化必然涉及新的单元测试用例，我稍微列举了下，记在这里以便后续一个一个地实现。

private buffer边界和内容测试
支持区分task_action_impl和非task_action_impl的仿函数，并且测试on_finished接口的是否被正确调用

API调整

移除了不含参数的callback支持，现在所有的托管回掉的形式都是 int (void*)。（以前接入新模型后向前兼容地保留了int ()的接口，这次涉及接口调整，所以一起移除了这个历史包袱）
不再允许把协程对象声明在栈上，现在必须分配在栈上，并且使用侵入式智能指针
协程任务系统使用侵入式智能指针，不再使用shared_ptr
不再支持共享task action，实际应用中没什么作用
允许预定义分配一段私有数据空间，保存用户数据

初步成果

目前v2版本的基本改造和流程已经实现完毕，CI上的结果如下:

###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1000 1000 2048
### Round: 1 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1 3000000 16
### Round: 1 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 1 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 30000 100 64
### Round: 1 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 145 ms, avg: 4833 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 771 ms, avg: 257 ns
remove 30000 coroutine, cost time: 0 s, clock time: 6 ms, avg: 200 ns
### Round: 2 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 10 ms, avg: 333 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 812 ms, avg: 270 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 3 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 793 ms, avg: 264 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 4 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 0 s, clock time: 787 ms, avg: 262 ns
remove 30000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 166 ns
### Round: 5 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 804 ms, avg: 268 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1000 1000 2048
### Round: 1 ###
create 1000 task, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 159 ms, avg: 159 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1 3000000 16
### Round: 1 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 351 ms, avg: 117 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 347 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 30000 100 64
### Round: 1 ###
create 30000 task, cost time: 1 s, clock time: 150 ms, avg: 5000 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1407 ms, avg: 469 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 2 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1404 ms, avg: 468 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 3 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1420 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 14 ms, avg: 466 ns
### Round: 4 ###
create 30000 task, cost time: 0 s, clock time: 16 ms, avg: 533 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1419 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 5 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1389 ms, avg: 463 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns

这是CI里Linux上的结果，osx和windows上的结果就不贴了，都没linux上的数据好，不过相差不大。

这一批优化过后，其实切换性能并没有提高，反而下降了，我查了一下是原先对read的L1 cache命中率也比较高，但是现在下降了。不过我认为这种miss的情况可能更能说明问题，因为现在是几乎0逻辑，一旦有逻辑之后也容易增加cache miss。究其原因，应该是新的组织方式把一个协程（特别是task）的维护数据放在了栈上，那么切换的时候切换地址就会比较遥远。而之前的方式，只要多个协程对象靠得近，而协程对象又足够小（task对象里的数据比较大，但是coroutine里的比较小），因为只有压测代码，所以new的时候malloc一般会把类似大小的对象放一起，这也增加了命中率。但是实际应用的话，必然中间会插入很多其他对象的，所以这个情况也会不贴近实际场景。但是另一方面，v2的优化也使得在使用栈池的情况下，分配开销被大幅降低了，我们的项目里由于开始需要大量的定时器，定时器回调的时候也需要在协程内进行，这也给大量定时器的触发的时候提供了更好的支持和保障。

不过我认为现在的v2任然有可以再优化的地方，本来想在优化完一波再放出这篇总结，但是一方面还没找到什么特别好的不破坏设计的点。另一方面最近看了下SSR-libev的源码，觉得他那种方式的适配的扩展性更好，所以后面可能回去优化服务器框架asf4g-co的网关层部分，可能要有比较大的改动，改变密钥协商流程，让atgateway支持协商加密方式、更多的加密算法和支持椭圆双曲线协商。所以libcopp的优化暂时搁置吧，不过以后会主要以v2版本的api为准，基本不会再变更接口了吧。

I'm OWenT

Challenge Everything