ModOS - Modular Operating Systems for Large Scale and Distributed Infrastructures
Loading...
Date
2022-05-24
Authors
Schubert, Lutz
Journal Title
Journal ISSN
Volume Title
Publication Type
Dissertation
Published in
Abstract
Modern day compute systems are equipped with multi-core processors that benefit from (1) applications running concurrently and (2) multi-threaded processes. Execution of applications over multiple processors is particularly vital for High Performance Computing, where any delay in execution can lead to considerable cost. Multi-core processing comes with additional cost for cache consistency, but also traditional Operating Systems are not adjusted to multi-core processing, which leads to overhead that can be avoided.
This thesis focuses on analyzing which aspects of an Operating System lead to unnecessary overhead in a multi-core setting and how these can be avoided:
Traditional Operating Systems are large, complex, and frequently monolithic blocks designed for executing multiple applications on a single processing unit. They rely on the fact that a standard processor is fast enough to allow for preemptive execution of an application without noticeable delay. In fact, processors are so fast they allow the Operating System to perform major management and maintenance tasks in the background without disrupting interactivity of applications. More processing units (cores) in one processor allows for even more management tasks to be executed without interruption.
The Operating System provides resources to applications and ensures that no conflicts arise from concurrent requests, in addition to graphics handling etc. These functions are executed at a central instance to arbitrate between conflicting requests, e.g. for memory acquisition. Similarly, processes are maintained in a single table to not generate overlapping process spaces etc. This causes little problems for running multiple applications.
As opposed to other applications, HPC applications consist of multiple coupled threads that execute in unison. Any interference can delay execution and thus impact performance – implicitly, no other applications should be running in the background. This also applies to Operating System tasks (unless triggered by the application).
A typical HPC application will spawn (and synchronize) as many threads as sensible for multiple times during execution. To compensate for workload jitter and potential wait time, more threads are instantiated than processing units are available. That means that thread instantiation (and destruction) is performed millions of times, but also that system calls are triggered at ideally(!) the same time by all threads. This can lead to a delay of 4% of total execution time. Notably, threads of an HPC application hardly need any OS functionalities in the first instance.
Therefore, the architecture of Operating Systems has to be completely revised for optimal multi-threaded execution and minimal overhead. Specifically, this means:
#1 replication of the OS per core to minimize overhead for system calls handling and unnecessary wait time.
#2 reduction of the OS to core functions. Since each thread may pose different needs towards the OS, different instances have to be supported.
#3 distribution of resource management without resulting in conflicts.
In this thesis, a modular and adaptive architecture will be presented that can be easily adjusted to different circumstances without having to adjust the code every time. What is more, the modular approach allows for distribution of functionalities that are needed multiple times in large scale settings. For example, different memory management modules can be instantiated to reduce the invocation time across complex bus architectures in large scale manycore processors, or a single communication module can handle messaging over the external I/O etc.
This approach also allows for easy adaptation to new platforms, as well as usage contexts, and can even support realization of large scale, widely distributed virtual shared memory functionalities at least cost and effort.
A reference implementation of the OS architecture has shown a considerably higher performance than Linux and L4 – with up to 20 times faster thread maintenance. Though the principles allow even for multiprocessor execution, only multicore scalability has been tested with more than 200 threads on a 61 core processor (Xeon Phi).
In conclusion, the work demonstrates how and why an Operating System architecture (and implementation) has to change to cater for large-scale manycore processors with or without cache consistency. The OS specifically aims at data parallel HPC applications, but the principles are generally applicable to any type of application that requires multiple threads. Its usage in embedded systems has equally been shown by now. Tests indicate a considerable performance improvement though at the cost of backward compatibility with Linux – instead, applications have to be (re)written explicitly for this new OS environment.
Description
Faculties
Fakultät für Ingenieurwissenschaften, Informatik und Psychologie
Institutions
Institut für Organisation und Management von Informationssystemen
Citation
DFG Project uulm
EU Project THU
Other projects THU
License
CC BY-NC-SA 4.0 International
Is version of
Has version
Supplement to
Supplemented by
Has erratum
Erratum to
Has Part
Part of
DOI external
DOI external
Institutions
Periodical
Degree Program
DFG Project THU
item.page.thu.projectEU
item.page.thu.projectOther
Series
Keywords
Operating System, Hochleistungsrechnen, Verteiltes System, High performance computing, Distributed systems