Software Safety
The complexity of each new generation of commercial or military aircraft is growing exponentially due to an ever-increasing demand for new functionality. Software offers the least cost option to implement new functions in modern aerospace systems. As a result, software complexity has grown beyond human comprehension. This article offers my thoughts on what this means for system safety.
Background
Aircraft, spacecraft, and modern cars are all good examples of embedded real-time systems. Over the last few decades, we have witnessed exponential growth in the number of sensors, software size, and complexity of embedded and support systems. The required number of source lines of code (SLOC) is closely related to the overall complexity, including code embedded on board and software required in ground support systems. The chart at the top of the article reflects the exponential growth of SLOC in major aircraft development programs over time 1.
For the purposes of my observations, the Apollo program (1968) offers a good historical reference: the “complex” system used for the moon landing required around 8,500 lines of code and 1000x less RAM and CPU power than similar systems today. A more recent example - shown as the last entry in the chart - is the F-35 Joint Strike Fighter (JSF) program. The JSF requires around 20 million lines of code, including 9.5 million on board the aircraft alone. I also note that this reflects a 40 pc growth since critical design review in 2005.
Impact on system safety
Exponential software growth leads to an explosion of development effort and cost, and extensive program delays. Those are critical aspects of any new program, however, as mentioned in the first paragraph, in this article I wanted to focus on system safety aspects. The notion that complexity means an increased level of risk will not come as a surprise to any design engineer or program manager. But what are the critical aspects in terms of system safety management?
The current state of software engineering, assuming a good software design and best industry development practices, means that there is 1 residual defect in every 1,000 lines of code on average. Here, residual defect refers to one that remains in the code once all required testing was done. Our experience also shows that the number of defects is reduced by a magnitude as we climb the severity scale, i.e. for a 1 million SLOC project we will find 900 benign, 90 medium severity, and 9 potentially catastrophic residual defects. Again, these numbers apply to embedded real-time systems after all integration and test phases are completed.
Going back to the F-35 (JSF) example, the fact that our advanced engineering tools and processes deliver on board embedded software that contains between 90 to 100 critical residual errors is a sobering reality. To make matters worse, complex systems do not necessarily require critical software errors to experience a catastrophic failure. Relatively benign errors can link up and lead to a major loss (e.g. Mars Polar Lander incident).
Another critical system safety aspect is the way software errors manifest in real systems. Contrary to traditional thinking, software does not fail, like mechanical components. We cannot assign expected probabilities to software errors popping up. Software is pure mathematical logic. It will perform as specified and coded. Traditional risk matrices do not apply here.
In summary, we need to stop pretending that traditional system safety practices can meet the challenge of ever-increasing software complexity. To put it bluntly, we need a new approach to managing system safety for software-intensive systems, like modern commercial and military aircraft, spacecraft, or even the family car.
Footnotes
-
SLOC data and the illustration sourced from the Aerospace Vehicle System Institute (AVSI) of Texas A & M University. SLOC data reflects the total number of on board and ground support lines of code required for operating the air vehicle. ↩