What’s your next move when things don’t make sense?
By Randall Noon, P.E.
Failure is fundamentally a cause-and-effect relationship. This is represented by the following logic expression, where “A” is the cause and “B” is the failure—or effect:
In a simple failure, “A” can be a single factor, “A1”, acting alone that causes “B”. For example, if a bearing has been in service too long, but all other design and environmental conditions are good, the bearing will eventually fail due to age. In this case, “A1” is a simple causative factor related to service time.
In a slightly more complicated failure, “A” can be a group of factors, labeled “A1, A2, A3,” etc., acting in a linear sequence, one after another like falling dominoes, that eventually cause “B” to occur. Beginning with “A1”, removal of any one of the “Ai” factors can interrupt the chain of events that lead to “B”. There are several root cause methodologies based upon this “domino” theory of failure. The following expression represents this:
For example, if a bearing is in service too long and fails due to age, the machine in which the bearing is used may also cease operating, which in turn causes the production line to stop. Using expression (ii), the service time is “A1”, the bearing is “A2”, the machine is “A3”, and the failure of the production line is “B”.
Alternately, “A” can be a group of factors acting in parallel with one another like the famous fire triangle of oxygen, fuel and an ignition source. All three factors must be present at the same time for a fire to occur. Expression (iii) represents this:
As the complexity of failure increases, “A” can be composed of a combination of both linear and parallel factors acting at various times. When such a failure is diagrammed, it resembles a logic switching circuit—similar to the example illustrated in Fig. 1.
Figure 1 depicts a nondescript failure, “B”, which has four causative factors: “A1”, “A2”, “A3”, and “A4”. Starting at the bottom and working up through the logic diagram, there are three failure scenarios that can result in “B”. They are as follows:
First Failure Scenario: First, causal factor “A1” and “A2” both occur. With both factors present, the event path proceeds through gate 1, the green AND gate, to the next stage, gate 2, the yellow OR gate. Because gate 2 indicates that the event path can proceed if either “A3” occurs or both “A1” and “A2” occur, the event path continues through gate 2 to “A4”. This then causes “A4” to occur. When condition “A4” occurs, the event path continues again and causes “B”.
In shorthand notation, this failure pathway is represented as follows:
Second Failure Scenario: Causal factor “A3” occurs first. With this factor present, the event path proceeds through gate 2 to “A4”. (Either “A1” or “A2” may be present. It doesn’t matter.) The preceding event, “A3”, then causes condition “A4” to occur, which in turn then causes “B”.
In similar fashion, the shorthand notation for this failure pathway is as follows:
Third Failure Scenario: This one is simple. For whatever reason, condition “A4” occurs first without any precedents. The occurrence of “A4” then causes “B”.
The simple shorthand for this failure pathway is as follows:
Preventing failure via typical root cause analysis
Having laid out the failure scenarios for “B”, how can the failure be prevented using the usual type of root cause analysis? Consider the following possibilities.
- The simplest way is to remove factor “A4”. Removal of this single factor, which is common to all the failure scenarios, definitely stops “B” from occurring. The other three causal factors can be allowed to occur or not occur. It makes no difference as long as “A4” is removed from the event path. But what if removing “A4” is a prohibitively expensive or risky option?
- Removing either “A1” or “A2” could also prevent the failure if factor “A3” does not occur. Perhaps “A3” occurs so infrequently that it can be left in place and just “A1” or “A2” needs to be removed to prevent failure. Perhaps the chance is worth taking, especially if the removal of either “A1” or “A2” is relatively cheap and easy.
- Likewise, removing “A3” alone could prevent failure “B” if there were assurances that “A1” and “A2” occurring at the same time is sufficiently infrequent. Since both “A1” and “A2” have to be present for “B” to occur without “A3”, the occurrence of one of the two factors can be tolerated without harm. Perhaps if “A1” were to occur, for example, there is sufficient time to fix it before “A2” can occur.
- Removing all four identified causative factors, sometimes known as the shotgun approach to problem solving, can prevent “B” from occurring. However, this would likely be the most expensive approach.
You have options
As the preceding example demonstrates, the prevention of failure “B” can be accomplished completely by two options: 1 and 4. On the other hand, perhaps failure “B” could also be reasonably prevented by assuming a small amount of risk with either option 2 or 3. So, which of the four options is best?
If you’re involved in a capital project, the standard approach is to perform a feasibility assessment and weigh the various costs and risks. Does your root cause process include a similar assessment to determine which corrective action strategy is the most cost-effective, or does it assume that elimination of the “root cause” is always the solution?
Such questions bring us to a significant issue that fuels much discussion and unnecessary consternation in root cause analysis: Which causal factor is the “root cause” of failure “B”?
- Some methods suggest that the causative factor or factors that first set in motion the failure scenario reflect the real root cause. By this definition, “A1” and “A2” could be the root causes. Then again, “A3” could also be the root cause if “A3” occurs first and “A1” and “A2” do not occur.
- Some methods indicate that there should be one, and only one, “root cause,” and that it is the one causative factor that when removed, completely stops the failure sce-nario. This would make the root cause in this case “A4”.
- Some methods indicate that the real “root cause” is the one factor or factors over which you have control that precludes the failure. This might mean that all the identified causative factors are root causes or perhaps just one of the four.
Depending upon which definition of “root cause” is used, any one of the four options in the preceding list could be considered a “root cause.” Now comes an additional dilemma.
Many, if not most, root cause methods require a person to preclude failure from recurring by eliminating the root cause. In fact, the term “root cause analysis” itself suggests that the focus of the investigation is to find and eliminate the “root cause.” In shorthand notation, the solution strategy being assumed is this:
As demonstrated, however, in our simple example, finding and eliminating the “root cause,” as defined by whichever method is being applied, may not be the most cost-effective way to address the problem, especially if the failure can be caused by various combinations of the same factors. Removing the “root cause” that caused “B” this time may preclude one failure scenario yet leave others in place.
Further, the term “root cause analysis” itself is suggestive: It floats the idea that the goal of an investigation is to find and eliminate a “root cause” so that a specific failure will not recur. Unfortunately, this approach ignores two other potentially useful strategies.
If the goal is to preclude recurrence of the failure rather than just find the “root cause,” there are three strategies that can be employed.
— Prevent “B” from occurring by eliminating “A”. This, of course, is the essence of many, if not most, of the root cause analysis methods in use.
— Change the consequences of “B” so that when “A” occurs, “B” may still occur but the consequences are tolerable. In other words, instead of eliminating the cause, eliminate the result. Note that if the deleterious effects of “B” are eliminated, it is not even necessary to know what the root cause is.
— Eliminate the link between “A” and “B”. Break the link, perhaps by an intervention strategy, between the two events so that they are independent events. Thus, if “A” occurs, “B” does not automatically occur.
How it all works
With respect to item 2 above, here is an example of how things can work. A nuclear station had regularly performed a required safety test of steam test stop valves in the middle of each run. During one such test, one of the re-heat test stop valves failed to operate. This occurred at the same time as a high-level alarm in the steam moisture separator. Because both the moisture high-level alarm and the stuck valve occurred at the same time (parallel events), the plant was required to SCRAM, that is, the reactor had to be shut down.
A subsequent investigation found that the particular test steam stop valve had jammed because of manufacturing debris in the valve. A small machining chip had caused the piston-type valve to bind. The valve was a commercial-grade item, but was unique in design. Various plans were studied to prevent machining chips from being present in the valve and various replacement valves were considered. However, all these measures were costly and still did not provide the kind of assurance needed to prevent a SCRAM.
A closer look at the required safety test found that while it had to be performed at least once every 18 months, and had always been done at mid-cycle, it could be conducted any time in the cycle. Thus, the most cost-effective fix was to move this testing to the end of the cycle—which cost nothing.
When the test was performed at the end of the run, if the valve jammed and there was a moisture alarm at the same time, the reactor would still be SCRAMed. But the reactor would be deliberately SCRAMed within minutes anyway for a planned maintenance outage. In other words, both the cause and effect were left in place, but the timing of the test was changed so that the consequence was no longer an issue.
With respect to item 3, de-linking cause and effect, consider this nationally famous example. In 1949, there were about 42,000 cases of debilitating and sometimes deadly polio. Many of you reading this are too young to recall the iron lungs, the leg braces and crutches associated with polio. In the early 1950s and continuing, there was a national program to vaccinate children against polio. First, it was the Salk vaccine, and later the oral Sabin vaccine. Eventually, the number of polio cases in the U.S. dropped to zero.
Vaccination didn’t get rid of polio germs—the root cause. The germs are still there. Vaccination didn’t cure the disease. Although treatment is better, there’s still no cure for polio. The consequence is still there. The vaccine did, though, break the link between cause and consequence, and polio no longer causes grief to 42,000 families a year. MT
Randy Noon is a Root Cause Team Leader at Nebraska’s Cooper Nuclear Station. A noted author and frequent contributor to MT, he has been investigating failures for more three decades. Email: firstname.lastname@example.org.
FYI: Noon will speak at MARTS 2013 on the troubling topic of why some root cause investigations fail. Be there. Register now at www.MARTSConference.com.