Elektron: the Journal of the South African Institute of the Electrical Engineers, Jan. 2000. Operating Systems for Safety-Critical Applications by Dr. Yinong Chen, Programme for Highly Dependable Systems, University of the Witwatersrand Introduction Areas of computer applications are far wider than what people normally imagine. Besides the visible applications like word processing and the Internet access using your desktop computer, computer applications cover a vast spectrum of areas, from nuclear reactor and aircraft control systems, to vehicle antilock brake systems (ABS) and electronic toys like play stations. Some of these applications are safety-critical, that is, catastrophic consequences may occur if a computer in the control system becomes faulty. Design errors and operational faults are in general not avoidable. All we can do are to make the probability of system failures as low as possible, or make the system as dependable as possible. Dependability has been defined as that property of a computer system such that reliance can justifiably be placed on the service it delivers [1]. Dependability covers a wide range of attributes like reliability, availability, safety and security. Safety is an attribute of non-occurrence of catastrophic consequences on human life or the environment. A safety-critical system is one by which the safety of the system is assured. A safety-critical system must have a predictable failure probability. This article discusses issues related to the computer software and operating systems in safety-critical systems. Software in Safety-Critical Systems Due to potential catastrophic consequences, any component in a safety-critical system must have been proved to be correct or to have the dependability that complies with the safety standard. For this reason, commercial off-the-shelf (COTS) operating systems are normally not acceptable for safety-critical systems. Even a simple operating system is too complex to be verified or to be proved to meet the dependability requirement. Traditionally, application programs have to run on a "bare" machine, allowing the application software designers to have a total control and visibility of the entire software system. As the complexity of the system and the software increases, it is extremely inconvenient and difficult to write correct code for a bare machine, where the programmers have to worry about task scheduling, memory sharing and input/output management. This leads to the introduction of a small "runtime kernel" to provide necessary operating system functions. This kernel must not include any functions which are not necessary for the particular application so that it is small enough for a full correctness verification. Can We Rely on COTS Operating Systems? Growing demand on functionality drives the software complexity in safety-critical systems to an extend where a full operating system becomes necessary. Fig.1 shows the size of software measured in words of executable codes used by airbus civil planes [2]. For a software system consists of 10 million of words, it is very difficult, even impossible, to develop in the traditional way on a bare machine without a proper operating system environment. Fig.1 Growing complexity of software in airbus To address this problem, a nature thought is to use existing operating systems. The question is, can we rely on COTS (commercial off-the-shelf) operating systems for building dependable systems? The advantages of using COTS operating systems are obvious. It reduces the cost and time of software development. The problem is that the development of COTS operating systems doesn't necessarily consider the conditions and requirements necessary for a safety-critical system. According to the research results from the Institute for Complex Systems at the Carnegie Mellon University, conventional COTS operating systems are not adequate for building safety-critical systems. Fifteen operating systems from ten vendors were tested [3]. The failure rates (number of failures detected over the total number tests conducted) of these systems are shown in Fig.2. The failure rate ranges from 10% to 23%. Note, the inputs chosen for tests do not belong to the normal inputs that an operating system is designed to handle. They are unexpected inputs which may only occur when an operator error or system error occurs. According to the results, AIX4.1 has the lowest failure rate. The free Linux operating system exhibits good behaviour in testing. Fig.2 Failure rates of ten commercial operating systems Is There a Solution? We must never give up the hope of finding a solution. The computer scientists in the Dependable Computing Group at LAAS-CNRS in France, who pioneered the research in this area, came up with a solution using the microkernel technology. The latest generation of operating systems are developed as middelware on the top of the microkernel. The idea is to use the microkernel technology to re-develop a highly dependable operating system with fault-tolerant mechanisms integrated into the system, instead of building these mechanisms on the top of an operating system [4]. The advantages are that a level in the hierarchy is saved. The complexity can be reduced by implementing only those functions that are needed by the specific application. Another solution is proposed according to the study of the researchers at the Carnegie Mellon University. They find that different operating systems fail to different inputs. Their idea of reducing the failure rate is to run the same application on multiple operating systems. The outputs from these replicate applications are compared against each other. The majority is then used as the final output. The experimental results exhibit significant reduction of the failure rate, as shown in fig.3. Initially, the failure rates range from 10% to 23%. As the number of operating systems used increases, the average failure rate decreases. Fig.3 Multi-version comparisons reduce the failure rate A higher dependability is not only important for safety-critical systems. A more dependable system will increase the productivity and user satisfaction in conventional systems. One of the research projects in the Programme for Highly Dependable Systems at Wits University has been using the dependable computing concept to improve the availability of Internet service [5]. The idea is to build a distributed system using COTS operating systems on which fault-tolerant mechanisms are implemented, as shown in Fig.4. As explained in previous sections, such a system is not appropriate for safety-critical systems due unpredictable failure rate at the COTS operating system level. Our target application however is the commercial Internet servers which are not safety-critical but the higher availability is extremely important. Fig.4 Distributed operating system with fault-tolerant extension References [1] Laprie, Dependability of Computer Systems: from Concepts to Limits, IFIP International Workshop on Dependable Computing and its Applications, Johannesburg, January 1998, pp. 108 - 126 (also see www.cs.wits.ac.za/ research/workshop/programme.html). [2] Potocki de Montalk, J.P., Computer software in civil aircraft, Microprocessors and Microsystems, 17 (1) 1993, pp. 17 - 23. [3] Koopman P.J., De Vale J., Comparing the robustness of POSIX operating systems, IEEE 29th Annual International Symposium on Fault-Tolerant Computing, Madison, June 1999, pp.30 - 37. [4] Salles F., Arlat J., Fabre J-C., Can we rely on COTS microkernels for building fault-tolerant systems? The 33rd Meeting of IFIP 10.4 WG, Cape Town, January 1998, pp.13 -20. [5] Chen Y., Hazelhurst S., Galpin V., Mateer R., Mueller C, Modelling software development of a decentralised virtual service redirector for Internet applications, The 7th IEEE Workshop on Future Trends of Distributed Computing Systems, Cape Town, December 1999, pp. 235 - 241. 5