A dedicated kernel for multi-threading applications.

Monday, December 12, 2011

Fixed an important bug in emigrate procedure

That's just a brief post about a recent change in the way that Toro migrates threads. 
Previously, when a Thread running in core #0 wanted create a new Thread in core #1, function ThreadCreate allocated the TThread structure, TLS and the Stack then, It migrated the whole TThread structure to the core #1.
The main problem in this mechanism was that all memories block were allocated in parent core. This is a serious infraction in  the NUMA model: TThread, TLS and the Stack are not already local memory.
Thus, I rewrote the way that Threads are migrated. When a Thread wants to create a new one remotely, Toro still invokes ThreadCreate BUT it is executed in the remote core. Instance of migrate the TThread structure, now Toro migrates a set of arguments to be passed toward ThreadCreate. When ThreadCreate finishes, the parent thread retrieve the TThreadID value or nil if it fails. 
As we can see, while a local thread is made immediately when ThreadCreate is invoked, a remote thread  spend two steps of latency: one for migrate the parameters and other for retrieve the result.       

Matias E. Vara

Thursday, August 25, 2011

Patching GDB 7.3 for QEMU remote kernel debug

This time I will try to explain how patch GDB 7.3 in order to debug a kernel using QEMU through remote debuging. If we try to debug remotely, we'll find a error message like:

Remote packet too long: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 ...

I am not sure about problem but I suppose it's about register size. When the virtual machine jumps from real mode to long/protect mode, the register size changes but GDB doesn't know that. Thus, when GDB receives a bigger packet than it expects, it fails. Therefore, The patch just increments the buffer in those cases.
The first step is to download GDB 7.3 from http://www.gnu.org/s/gdb/download/, I've implemented the patch on 7.3 version but I think it works in oldest too.
Once downloaded and uncompressed, edit the file gdb-7.3/gdb/remote.c and go to 5693 line. That's the process_g_packet procedure. Now, look for and replace the original source with the following lines:

/* Further sanity checks, with knowledge of the architecture. */
//if (buf_len > 2 * rsa->sizeof_g_packet)
// error (_("Remote 'g' packet reply is too long: %s"), rs->buf);
if (buf_len > 2 * rsa->sizeof_g_packet)
rsa->sizeof_g_packet = buf_len;
for (i = 0; i < gdbarch_num_regs (gdbarch); i++)
if (rsa->regs[i].pnum == -1)
if (rsa->regs[i].offset >= rsa->sizeof_g_packet)
rsa->regs[i].in_g_packet = 0;
rsa->regs[i].in_g_packet = 1;

Finally, it just remains to execute:

$ ./configure
$ make

In some systems may be necessary to install termcap library, simply execute:

$ sudo apt-get install libncurses5-dev

After compilation, the binary could be found in gdb-7.3/gdb/gdb, It must be enough to run GDB correctly.

Matias E. Vara

Tuesday, August 23, 2011

Toro in Microelectronic Conference (UNLP)

Toro will be shown in the Microelectronic Conference at University of La Plata, Argentina. In the work that I've done I will show the kernel capabilities and a few tests comparing Toro with a general purpose operative system. The conference will be the 8th of September in "Sala A" at 16.20hs.

Matias E. Vara

Saturday, July 30, 2011

Toro in Ubuntu 11.04!

Now It is possible to compile and test TORO easy trough Linux Ubuntu 11.04. Actually, I am moving the whole project to Linux environment. In other way, I have started to use GIT in order to simplify the developed. I have updated the WIKI, giving the instructions to compile and TORO in Ubuntu.

Matias E. Vara

Friday, June 24, 2011

Toro bootloader

How can I start it?. The bootloader is a project itself, if you want to write a hobby OS you do not have to start from the bootloader. First, It will take you a lot of time, second, it is too hard to debug so you will become disappointed fast and you wont finish. I think that the important and interested things happens inside the kernel. Anyway, there are a few crazy guys that they want to make one. For that kind of guys, I have just started to write a few documentation about Toro's bootloader in the wiki. I hope that you find it interesting and appreciate the effort done (Yes, I don't like to write documentation but I know that it is too important ;) ).

Matias E. Vara

Sunday, April 17, 2011

Memory Protection in a multicore environment

This post is contained into the final paper of Matias Vara named “Paralelizacion de Algoritmos Numericos con TORO Kernel” to get the degree on Electronic Engeniering from Universidad de La Plata. These theorical documents help to understand the kernel design.


When a Kernel is designed for a multicore system, the shared memory must be protected of concurrent writing accesses. The memory's protection increments kernel code complexity and decreases operative system's performance. If one or more processors are having access to some data at the same time, mutual exclusion must be realized to protect shared data in multicore systems.

In a mono-processor multi-task system the scheduler often switch the task, so the unique risk is while the task is changing the information the scheduler take it out the cpu. The protecction is this case is easy: disabled the scheduler while the task is in a critical section and then enabled again.
In a Multiprocessor system that solution can't be implemented. When we have tasks running in parallel, two or more tasks may execute the same line in the same time; Hence, the scheduler state doesn't care.

Resources protection

For protect resources in a multiprocessing system we need to define atomic operations. These are implemented in just one assembler instruction but several clock cycles.

Atomic operations

In every processor, write and read operations are always atomic. This means that when the operation is executing nobody is using that memory area.
For certain kind of operations the processor blocked the memory, with this purpose is provided the #Lock signal that it is used for critical memory operations. While this signal is high, the calls from other processors are blocked.
Bus memory access is non-deterministic; this means that the first one processor gets the bus. All the processors compete for the bus, then in a system with a lot processor this is a bottleneck.
But, why do we need atomic operations? Supposing that we have to increment a counter, the pascal's source is :

counter := counter +1;

If this line is executed at the same time, in several processors, the result will be incorrect if it is not atomic.
The correct value is 2, using atomic operations the processors access to the variable once per time and the result is corrected. The time to the sincronization increments with the number of processor. The common atomics operations are "TEST and SET" and "COMPARE and SWAP".

Impact of atomic operations

In system with a few processors, atomic operations does not represent a big deal and they are a fast solution for shared memory problem; However, if we increment the number of processors then we make a bottleneck.
Supposing a computer with 8 cores and with 1.45 GHz [1], while an instruction average time is 0.24 ns, atomic increment spends 42.09 ns. The time wasted making lock becomes critical.

[1] Paula McKenney: RCU vs. Locking Performance on Different Types of CPUs.
http://www.rdrop.com/users/paulmck/RCU/LCA2004.02.13a.pdf, 2005

Tuesday, April 05, 2011

Memory organization in a multicore system: Conclusion.

From programmer point of view, the access to local and remote memory is transparent. An NUMA could be implemented in a SMP system without any problem. However, the OS must do an efficient memory assignation for improve these technologies.
In the case of SMP, memory administation is easy to implemented while in NUMA is not. The system has to assign memory depending of the cpu where the process is running. Every CPU has an own memory bank. The system performance is poor if there are more remote access than local.
Windows has supported NUMA since 2003 version and Linux since 2.6.X. Both of them gives syscalls to exploit NUMA.
TORO kernel is optimized for NUMA technologies, keeping in mind the moderns processors. The unique way to support NUMA is using dedicate buses. In the high performance environment these improves mustn't forget.

Matias E. Vara

Sunday, March 13, 2011

e1000 driver for TORO

I have just started the implementation of e1000 driver like Intel Gigabit or compatible. I am using Minix 3 source and qemu as an emulator (Begin 0.12.0 version it supports e1000 emulator). The detection procedure is complete as you can see in the picture, It is uploaded to SVN.

Matias E. Vara

Sunday, March 06, 2011

Memory organization in a multicore system II

Continuation of the article Memory organization in a multicore system.

Non uniform memory architectures.

In a NUMA system the processors have assigned a memory region, it access more fastly that others to it, however every procesor can access to every memory position. It is used message passing to remote memory access. The programmer see a continuos memory space and the hardware makes that abstraction.

The first one in the NUMA tecnology was Sequent Computer Systems. They introduced NUMA at '90. Afterwards it was acquired by IBM and the tecnology was implemented in Power processors.

In other way, IBM made it own NUMA implementation called SE (Shared Everything). This implementation is presented in Power6 processors.

The Intel NUMA implementation is called QuickPath Interconnect. It allows to share memory between the processors and it is transparent for the Operative System. Each processor has a point to point controller.

AMD implementation uses fast links called "Hypertransport Links". In this implementation each procesor has a memory controller and a local memory. The processors are connected between them through a coherent Hypertransport link. Futhermore, each processor has a bi-directional no-coherent bus for IO devices.

Using Point-to-Point controller, the processor can access to memory region more fastly than other and there is an important latency if it tray to access to remote memory. In this way, we have two kind of memory: Local Memory and Remote.

Matias E. Vara

Saturday, January 15, 2011

Memory organization in a Multicore system

This paper is a part of the final project called "Parallel Algorithm with TORO kernel", Electronic Engineering, Universidad Nacional de La Plata. In the next months I will publish more papers about my final project. Enjoy!

Memory organization in a Multicore system

Actually, the "Uniform memory access" is the common way to access the memory (See SMP). In this kind of arquitecture, every processor can read every byte of memory, and the processors are independent. In this case, a shared bus is used and the processors compite but only one can write or read. In this environments just one processor can access to a byte in a gived time. For the programmers the memory access is transparent.

In 1992 Intel made the first SMP processor called Pentium PRO. And the memory bus was called Front Side Bus.

That is a bi-directional bus, it is too simple and very cheap, and in theory it scales well.

The next intel step was partition the FSB in two independent bus, but the cache coherency was a bootle-neck.

In 2007 it was implemented a bus per processor.

This kind of architecture is used by Atom, Celeron, Pentium and Core2 of intel.

In a system with many cores, the traffic through the FSB is heavy. The FSB doesn´t scale and it has a limit of 16 processor per bus. So the FSB is wall for the new multicores technology.

We can have CPU that it executes instructions fastly but we waste time if we can´t make the capture and decodification fastly. In the best case, we lose one cycle more reading from the memory.

Since 2001 the FSB has been replaced with point to point devices as Hypertransport or Intel QuickPath Interconnect. That changed the model memory to non uniform memory access

Matias E. Vara www.torokernel.org