Re: [sc] Could anybody advise .....



Re: [sc] Could anybody advise .....

From: Gerard Le Lann <Gerard.Le_Lann_at_xxxxxx>
Date: Fri, 06 Nov 2009 19:08:39 +0100
Message-ID: <4AF46627.1040904@xxxxxx>
Thierry.Coq@xxxxxx wrote:

>  quantitative probabilistic assessment of software failures that could have predicted, for example, Ariane 501, based on the data available before the flight?

Sorry to disagree.
Lots of experts or contributors to this list (myself included) have a 
radically opposite opinion regarding the cause/nature of the failure of 
the Ariane 5/501 maiden flight: that failure has *no* relationship at 
all with software.
Aerospatiale and CNES had all the data needed to avoid that failure with 
*absolute* certainty, for the simple reason that it is *them* who made 
the choice (among others) to commission a satellite launcher that would 
sustain horizontal accelerations (from nominal alignment) up to 5 times 
those specified for Ariane 4.
Such a choice has to do with satellite launcher system engineering 
(obviously), pertaining to the very first lifecycle phase, the 
requirements capture phase. Once such a choice had been made (factor 5), 
had "good" system engineering processes been followed for the subsequent 
phases, the Ariane 5 contractors responsible for the on-board 
command-and-control computer-based system (2 SRI and 2 OBC computers + 
software) would have known right away that they had to provide for 3 
extra bits of memory for storing the value of integer BH (biais 
horizontal). Trivially, an integer which is 15 bit long (Ariane 4) needs 
18 bits of storage when multiplied by 5.
That was totally overlooked -- obviously, a (computer-based) system 
engineering fault -- leading to an overflow of the 16 bit register (sign 
included) containing BH, 37 seconds after lift-off, and this is why 
flight 501 blew up.
As far as I know:
1) The concept of horizontal velocity of a satellite launcher is not a 
software concept,
2) Labelling "software fault" an erroneous dimensioning of a 
buffer/register/memory cell (which dimensioning depends solely on the 
external physical world) leads to the interesting conclusion that 
everything that is "in touch" with software is software; computers for 
example; hence, a failure due to having selected a computer that is too 
slow is a software fault; or a failure due to the overflow of a buffer 
meant to store incoming requests, under overloads (the concept of 
overloads is akin to to the concept of waiting queues, and as far as I 
know, the skills required in queuing theory have little to do with the 
skills required in the software domain),
3) A posteriori analysis revealed that the code of the conversion 
procedure (which converts horizontal velocity (floating point) into 
integer BH)) is fault free; it computed the correct value for BH at 
T0+37, but found not enough bits to store it; why should we blame "the 
software" when a code is correct?
4) Factor 5 being ignored, the choice of a particular implementation 
technology for instantiating the conversion procedure (which converts 
horizontal velocity (floating point) into integer BH)) is irrelevant 
vis-à-vis the analysis of the cause; had the conversion procedure been 
implemented in optoelectronics or mechanics or tupperware (thank you 
Peter!), Ariane 5/501 would have exploded as well; one should not 
confuse causes (factor 5 ignored, hence buffer overflow) and 
consequences (exception raised by "the software", upon detection of the 
overflow).

If interested, you can find detailed reports on this topic published 
circa 1999, including contributions to this list.

Gérard

PS: The contents of the Inquiry Board report comprise the following 
recommendation:
"(next time) conduct complete software inspection and simulation". This 
is rather astonishing, for at least two reasons:
* Acquiring the necessary knowledge "factor 5" needs no software 
inspection, nor any sort of simulation; moreover, as long as knowledge 
"factor 5" is ignored, against which specification would software -- 
fully inspected -- be declared "correct"?
* This is yet another example of the biased view according to which one 
can build correct systems simply by conducting a posteriori verification 
of software programs; such verifications can be conducted in reference 
to specifications, notably "high-level" specifications; questions almost 
never raised:
-- Where do such specifications come from?
-- How can we tell whether specification S (to be implemented in a 
verifiable manner) is a good/correct specification of a sub-problem 
which is provably raised by "my" overall composite (real world) 
requirements specification? Tony Hoare himself has recently pointed out 
the fact that this is now the weakest link (in applied and theoretical 
computing science).



Thierry.Coq@xxxxxx wrote:
> Dear sirs,
> 
> I'm willing to learn. In your previous post, you may be referring to John D. Musa's work (RiP), but in his work, the quality of the probabilistic assessment is heavily dependent on the operational profiles selected, as you are alluding. In other words, on the quality of the requirements.
> 
> So I repeat my questions:
> 
> Do you have a precise and documented reference, within the 61508 standard, for the quantitative probalistic assessment of software failures?
> 
> Are there references, in any standard, for quantitative probabilistic assessment of software failures that could have predicted, for example, Ariane 501, based on the data available before the flight? 
> 
> Are there references, in any standard, for the quantitative probabilistic assessment of common mode failures of hardware? 
> 
> Best regards,
> Thierry 
> +33 (0)6 80 44 57 92 
> 
> 
> 
> -----Original Message-----
> From: safety-critical-request@xxxxxx [mailto:safety-critical-request@xxxxxx] On Behalf Of Prof. Dr. Peter Bernard Ladkin
> Sent: jeudi 5 novembre 2009 20:35
> To: safety-critical@xxxxxx
> Subject: Re: [sc] Could anybody advise .....
> 
> 
> On Nov 5, 2009, at 8:06 PM, <Thierry.Coq@xxxxxx> wrote:
> 
> 
>>Let's go back to the reference document.
> 
> 
> Good idea.
> 
> 
>>§3.6.5 : Random Hardware Failure, Note 2:
>>NOTE 2 - A major distinguishing feature between random hardware  
>>failures and systematic failures (see 3.6.6), is that system failure  
>>rates (or other appropriate measures), arising from random hardware  
>>failures, can be predicted with reasonable accuracy but systematic  
>>failures, by their very nature, cannot be accurately predicted. That  
>>is, system failure rates arising from random hardware failures can  
>>be quantified with reasonable accuracy but those arising from  
>>systematic failures cannot be accurately statistically quantified  
>>because the events leading to them cannot easily be predicted
> 
> 
> Since software failures are regarded in 61508 as systematic failures,  
> it follows that
> 
> 
>>>A major distinguishing feature between random hardware failures and  
>>>software failures ...... is that system failure rates ... arising  
>>>from random hardware failures, can be predicted with reasonable  
>>>accuracy but software failures, by their very nature, cannot be  
>>>accurately predicted. That is, system failure rates arising from  
>>>random hardware failures can be quantified with reasonable accuracy  
>>>but those arising from software failures cannot be accurately  
>>>statistically quantified because the events leading to them cannot  
>>>easily be predicted
> 
> 
> I cite from an e-mail from perhaps the world's leading expert on  
> software reliability:
> 
> 
>>Systematic failures, particularly software failures, are  
>>"systematic" only in the sense that when exactly the same conditions  
>>(external and internal) apply, they are reproducible. If a  
>>particular input causes failure once, it will always cause failure.  
>>But this is a *very* uninteresting notion of systematic. We are  
>>interested in the failure behaviour of software. There is  
>>uncertainty about which inputs cause failure, and when they will  
>>occur. That's why you need a probabilistic treatment, and why  
>>notions of "failure rate", "probability of failure on demand" are  
>>needed. ......... All the theory has been around for over thirty  
>>years.
> 
> 
> 
> It follows that the contrast in §3.6.5 is spurious.
> 
> PBL
> 
> Peter Bernard Ladkin, Professor for Computer Networks and Distributed  
> Systems,
> University of Bielefeld, 33594 Bielefeld, Germany
> www.rvs.uni-bielefeld.de +49 521 880 73 19
> 
Received on Fri 06 Nov 2009 - 18:08:49 GMT