ModOS - Modular Operating Systems for Large Scale and Distributed Infrastructures

Loading...
Thumbnail Image

Date

2022-05-24

Authors

Schubert, Lutz

Journal Title

Journal ISSN

Volume Title

Publication Type

Dissertation

Published in

Abstract

Modern day compute systems are equipped with multi-core processors that benefit from (1) applications running concurrently and (2) multi-threaded processes. Execution of applications over multiple processors is particularly vital for High Performance Computing, where any delay in execution can lead to considerable cost. Multi-core processing comes with additional cost for cache consistency, but also traditional Operating Systems are not adjusted to multi-core processing, which leads to overhead that can be avoided. This thesis focuses on analyzing which aspects of an Operating System lead to unnecessary overhead in a multi-core setting and how these can be avoided: Traditional Operating Systems are large, complex, and frequently monolithic blocks designed for executing multiple applications on a single processing unit. They rely on the fact that a standard processor is fast enough to allow for preemptive execution of an application without noticeable delay. In fact, processors are so fast they allow the Operating System to perform major management and maintenance tasks in the background without disrupting interactivity of applications. More processing units (cores) in one processor allows for even more management tasks to be executed without interruption. The Operating System provides resources to applications and ensures that no conflicts arise from concurrent requests, in addition to graphics handling etc. These functions are executed at a central instance to arbitrate between conflicting requests, e.g. for memory acquisition. Similarly, processes are maintained in a single table to not generate overlapping process spaces etc. This causes little problems for running multiple applications. As opposed to other applications, HPC applications consist of multiple coupled threads that execute in unison. Any interference can delay execution and thus impact performance – implicitly, no other applications should be running in the background. This also applies to Operating System tasks (unless triggered by the application). A typical HPC application will spawn (and synchronize) as many threads as sensible for multiple times during execution. To compensate for workload jitter and potential wait time, more threads are instantiated than processing units are available. That means that thread instantiation (and destruction) is performed millions of times, but also that system calls are triggered at ideally(!) the same time by all threads. This can lead to a delay of 4% of total execution time. Notably, threads of an HPC application hardly need any OS functionalities in the first instance. Therefore, the architecture of Operating Systems has to be completely revised for optimal multi-threaded execution and minimal overhead. Specifically, this means: #1 replication of the OS per core to minimize overhead for system calls handling and unnecessary wait time. #2 reduction of the OS to core functions. Since each thread may pose different needs towards the OS, different instances have to be supported. #3 distribution of resource management without resulting in conflicts. In this thesis, a modular and adaptive architecture will be presented that can be easily adjusted to different circumstances without having to adjust the code every time. What is more, the modular approach allows for distribution of functionalities that are needed multiple times in large scale settings. For example, different memory management modules can be instantiated to reduce the invocation time across complex bus architectures in large scale manycore processors, or a single communication module can handle messaging over the external I/O etc. This approach also allows for easy adaptation to new platforms, as well as usage contexts, and can even support realization of large scale, widely distributed virtual shared memory functionalities at least cost and effort. A reference implementation of the OS architecture has shown a considerably higher performance than Linux and L4 – with up to 20 times faster thread maintenance. Though the principles allow even for multiprocessor execution, only multicore scalability has been tested with more than 200 threads on a 61 core processor (Xeon Phi). In conclusion, the work demonstrates how and why an Operating System architecture (and implementation) has to change to cater for large-scale manycore processors with or without cache consistency. The OS specifically aims at data parallel HPC applications, but the principles are generally applicable to any type of application that requires multiple threads. Its usage in embedded systems has equally been shown by now. Tests indicate a considerable performance improvement though at the cost of backward compatibility with Linux – instead, applications have to be (re)written explicitly for this new OS environment.

Description

Faculties

Fakultät für Ingenieurwissenschaften, Informatik und Psychologie

Institutions

Institut für Organisation und Management von Informationssystemen

Citation

DFG Project uulm

EU Project THU

Other projects THU

Is version of

Has version

Supplement to

Supplemented by

Has erratum

Erratum to

Has Part

Part of

DOI external

DOI external

Institutions

Periodical

Degree Program

DFG Project THU

item.page.thu.projectEU

item.page.thu.projectOther

Series

Keywords

Operating System, Hochleistungsrechnen, Verteiltes System, High performance computing, Distributed systems