On-board Computers (OBC) are at the centre of space-faring systems. They provide computational performance to the system with high availability and dependability. However, these systems typically consist of expensive, slow, fault-tolerant hardware to cope with errors or failures during a mission. Commercial-off-the-shelf (COTS) components offer higher performance but do not provide the fault-tolerance mechanisms. The ScOSA (Scalable On-board Computing for Space Avionics) architecture uses COTS and rad-hard components as a distributed system, with the advantage of providing more computing performance than current OBCs while maintaining the dependability properties.
ScOSA uses a middleware to manage the COTS components as a distributed system of nodes, which, in the event of a node failure, mitigates the effects by reconfiguring the system to a configuration that excludes the failed node using a pre-determined configuration. These configurations are computed offline and have an exponentially growing memory usage depending on the number of nodes in the system, which limits the system's scalability. This paper presents an online reconfiguration algorithm as a solution to this scalability problem. Upon the occurrence of a node failure event, the online algorithm makes scheduling decisions at run-time, eliminating the need for pre-determined configurations. A novel online scheduling mechanism, consisting of six phases, which includes a combination of fault-tolerance, parallelism, and the use of the real-time state of the system, is a step towards higher dependability in distributed on-board computing. The online reconfiguration is evaluated by comparing it to the offline reconfiguration in terms of time and network traffic, showing that it is not only capable of generating configurations dynamically but also provides a solution to the scalability problem.
Towards the Online Reconfiguration of a Dependable Distributed on-Board Computer