Toro kernel

A dedicated kernel for multi-threading applications.

Friday, January 12, 2018

Toro compiles with -O2 option

Hello everyone, I spent the last days by trying to compile Toro kernel with the -O2 option. This option tells the compiler to optimise the execution by using the CPU registers. This means that the compiler would keep the data in registers instead of the memory. This supposes a huge improvement in the performance since the access to registers is wide more faster than memory. I give for details about this issue in The problem that I faced was that some assembler functions were violating the ABI of Windows x64. In other words, these functions were not restoring the value of registers that the compiler uses. In addition, the IRQ handlers were not doing so either. After fixing all these issues, I am able to compile with -O2 and the result is already in master. A simple comparison of TorowithFilesystem.pas shows an speed up of ~12%!!!! Compilation with -O3 is also posible but I did not make any benchmark.


Sunday, December 24, 2017

Toro compiled with FPC 3.0.4 and Lazarus 1.8.0

I am glad to announce that Toro compiles smoothly by using FPC 3.0.4 from Lazarus 1.8.0. There is only one issue when I want to run qemu from the IDE which I am going to report. Apart of that, no other issue has been observed.


Friday, December 15, 2017

Toro in FOSDEM'18!

I am glad to announce that Toro will be in FOSDEM'18. This time I will talk about the improvement in the scheduler to reduce the impact of idle loops thus reducing the cpu usage of VMs. For more information, you can find the abstract here.

Friday, December 08, 2017

Reducing CPU usage on VMs that run Toro

Last days I worked on reducing the CPU usage of Toro. I observed that VMs that run Toro consume 100% of CPU which makes any solution based on Toro impossible for production. I identified four situations in which an idle loop is the issue:

  1. During a spin locks
  2. When there is no a thread in ready state 
  3. When there is no thread 
  4. When threads only poll a variable
Cases 1, 2 and 3 are in the kernel code. However, case 4 is when a user thread does idle work by polling a variable. So in this case, the solution would be harder since the scheduler has to figure out that the thread is only polling. Intel proposes different mechanisms to reduce the impact of idle loops [1, 2]. In particular, I was interested on the use of mwait/monitor instructions. However, I found this is not very well supported on all Hypervisors. So I have to base on the instructions hlt (halt) and pause. I want to highlight that hlt is a privilege instruction so only ring0 can use it. However, since in Toro both the kernel and the application run in ring0, hlt can be used either by the kernel or the user. Following cases correspond with the use by the kernel.

First, I tackled case 1 by introducing the pause instruction inside the loop. This relaxes the CPU when a thread is getting exclusive access to a resource. Cases 2 and 3 were improved by using hlt which just halts the CPU until next interruption. To tackle case 4, I proposed two APIs to tell the scheduler when a thread is polling a variable. When scheduler figures out that all threads in a core are polling a variable, it just turns the core off.

I tested this in my baremetal host (4 cores, 8 GB, 2GHz) in Scaleway with KVM and a VM running Toro. I also installed Monitorix to monitor the state of the host. To stress the system, I generate http traffic and monitor the CPU usage. A few seconds after the stress stopped, the CPU usage of the qemu process is only about 1%. This usage goes up and down depending on the stress. By topping, I get a patter like this:   

6.6    0.6     0:46.92 qemu-system-x86
63.8  0.6     0:48.84 qemu-system-x86 (stress)
99.7  0.6     0:51.84 qemu-system-x86 (stress)
45.5  0.6     0:53.21 qemu-system-x86 (stress)
2.3    0.6     0:53.28 qemu-system-x86
4.0    0.6     0:53.40 qemu-system-x86
6.6    0.6     0:53.60 qemu-system-x86

This is not always the case and sometimes Toro takes longer to turn off. This may happen when a socket is not correctly closed and it ends up by timeout. I need to experiment more to measure how much reaction the system losses when the core is halted. This recent work, however, seems very promising!

[2] Intel Volume 3, section 8.10.2 and 8.10.4