Failure analysis is the process of determining how and why a component, product, or system failed. It’s essentially forensic investigation for engineering: examining a broken part to trace the chain of events that led to its breakdown, then using that knowledge to prevent it from happening again. The practice spans nearly every industry, from aerospace and oil refining to medical devices and consumer electronics, and it can save companies millions of dollars a year in avoided downtime and repeat failures.
Why Failure Analysis Matters
When something breaks, the instinct is to replace it and move on. Failure analysis pushes past that instinct to ask a harder question: what actually went wrong, and will it happen again? The answers protect both safety and budgets. Petroleum refineries that implement systematic failure analysis reduce unplanned downtime by 37% and save roughly $4.2 million annually. Across industries, companies report 25 to 40% reductions in maintenance costs after adopting structured failure analysis programs.
The hidden costs of skipping this step are steep. When you account for lost production, emergency repairs, liability exposure, and cascading damage to other components, the total expense of a failure typically runs 4 to 15 times the visible repair cost. A medical device manufacturer that committed to failure analysis cut failure-related costs by 62% and shaved eight weeks off its product development cycles. Understanding why things break also reduces repeat issues by up to 60% and extends the useful life of equipment substantially.
Beyond cost, failure analysis serves legal and regulatory purposes. When failures cause injury or property damage, investigators need to assign responsibility and determine whether a design, material, or manufacturing process was at fault. Meeting mandatory compliance requirements often depends on documenting what went wrong and proving that corrective steps were taken.
The Three Phases of an Investigation
A formal failure analysis follows three phases: collection, analysis, and solution. Each phase builds on the one before it, and skipping steps early on tends to produce unreliable conclusions later.
Collection
The investigation starts by assembling a team and clearly defining the problem. What exactly failed? When? Under what conditions? From there, the team gathers three types of data: physical evidence (the failed part itself, surrounding components, environmental samples), recorded evidence (maintenance logs, sensor data, design specifications), and personal testimony from anyone who witnessed the failure or operated the equipment. Preserving the physical evidence in its failed state is critical. Cleaning, reassembling, or further damaging a broken part before it’s examined can destroy the clues needed to reach an accurate conclusion.
Analysis
With data in hand, the team builds what’s called a cause chain. This is a sequence linking the immediate cause of the failure (the crack that propagated, the circuit that shorted) to contributing causes (a missed inspection, an environmental exposure) and ultimately to the root cause, the underlying condition that set everything in motion. The root cause might be a flawed design assumption, a material selection error, an inadequate maintenance schedule, or a manufacturing defect. The goal is to trace the chain all the way back so fixes target the origin of the problem, not just its symptoms.
Solution
The final phase identifies every possible way to break the cause chain, then selects and implements the most effective corrective or preventive actions. A solution might involve redesigning a component, changing a material, updating an inspection interval, or revising an operating procedure. The key distinction here is between corrective actions (fixing the specific failure) and preventive actions (ensuring similar failures don’t occur elsewhere in the system). Both matter, but preventive actions deliver the larger long-term payoff.
Common Root Cause Methods
There’s no single technique for tracing a failure to its root cause. Analysts choose from a toolkit of approaches depending on the complexity of the problem.
The “Five Whys” exercise is one of the simplest and most widely used. You state the problem, then ask “why?” repeatedly, with each answer becoming the subject of the next question, until you arrive at a foundational cause. It works well for straightforward failures with a relatively linear cause chain.
The fishbone diagram (also called an Ishikawa diagram) takes a more visual approach. The failure is placed at the “head” of the fish, and potential contributing categories branch off the spine: materials, methods, machinery, manpower, measurement, and environment. Teams brainstorm possible causes within each category, which helps ensure nothing gets overlooked. This method is particularly useful when multiple factors may have combined to produce the failure.
More complex failures may call for fault tree analysis, Pareto diagrams, or force field analysis. Fault trees work backward from the failure event, mapping every possible combination of conditions that could have produced it using logic gates. Pareto diagrams help prioritize which failure modes to address first by ranking them by frequency or impact. In practice, analysts often combine several of these tools in a single investigation.
How Materials Fail
Understanding common failure mechanisms helps analysts know what to look for when they examine a broken part. The most frequent categories include fatigue, corrosion, overload, and combinations of these.
Fatigue is the most common failure type in many engineering applications, particularly aerospace. It occurs when a material is subjected to repeated cyclic stress, even at levels well below its ultimate strength. Over enough cycles, microscopic cracks initiate and grow until the part fractures. Every material has an endurance limit, a stress level below which it can theoretically survive an infinite number of cycles. But when a corrosive environment is present, that endurance limit drops sharply. This combined mechanism, called corrosion fatigue, causes premature fractures at stress levels that would otherwise be safe. It played a role in some of the most significant engineering disasters of the 20th century, including the failures of the de Havilland Comet, the first commercial jet airliner.
Analysts can distinguish corrosion fatigue from ordinary fatigue by examining the fracture surface. Normal fatigue typically produces a single dominant crack, while corrosion fatigue generates multiple parallel cracks, usually perpendicular to the direction of stress and often originating from corrosion pits on the surface. The striations (microscopic ridges left by each stress cycle) are also less pronounced in corrosion fatigue than in purely mechanical fatigue.
Overload failures happen when a single load exceeds the material’s capacity, either because the load was unexpectedly high or the material was weaker than specified. Creep, another mechanism, occurs when materials slowly deform under sustained stress at elevated temperatures, common in power generation equipment and jet engine components. Stress corrosion cracking happens when tensile stress and a corrosive environment act together on a susceptible material, producing cracks even without cyclic loading.
Diagnostic Tools and Techniques
Failure analysts rely on a range of examination tools, starting with the simplest and moving to more advanced methods as needed.
Visual inspection and optical microscopy are always the first step. High-power microscopes and 3D profilometers reveal surface features, crack patterns, and deformation that point toward specific failure modes. These observations guide decisions about which advanced techniques to use next.
Scanning electron microscopy (SEM) is a standard tool in failure analysis, especially for electronics and semiconductor devices. SEM provides much higher magnification and depth of field than optical microscopes, allowing analysts to examine fracture surfaces, identify the origin point of cracks, and characterize microscopic features like fatigue striations or corrosion products. When paired with energy-dispersive X-ray analysis, SEM can also identify the chemical composition of contaminants, corrosion deposits, or foreign materials found on a failed part.
For semiconductor and microelectronics failures, the toolkit expands further. X-ray diffraction reveals crystal structure and can detect material defects. Transmission electron microscopy examines cross-sections of layered materials at near-atomic resolution, checking whether fabrication processes produced uniform layers and proper material distribution. Ultraviolet-visible spectroscopy measures the thickness of oxide layers on silicon wafers, verifying whether manufacturing steps like thermal oxidation were performed correctly. These tools transform what might seem like a mysterious electronic failure into a concrete, identifiable manufacturing or design problem.
Where Failure Analysis Is Used
Virtually any industry that designs, manufactures, or operates physical products uses failure analysis in some form. In aerospace, it’s mandatory after incidents and deeply embedded in maintenance programs. Fatigue has been the dominant failure type in aircraft for over fifty years, and decades of investigation by organizations like the Royal Aerospace Establishment have shaped modern airworthiness standards.
In oil and gas, failure analysis targets pipeline cracking, pressure vessel failures, and equipment breakdowns that can cause environmental disasters or explosions. The medical device industry uses it both to improve product development and to meet regulatory requirements that demand documented investigation of any device failure. In electronics manufacturing, SEM-based failure analysis is a standard part of integrated circuit fabrication, catching defects before products reach consumers.
Even outside traditional engineering, the principles apply. The same root cause analysis methods used to investigate a cracked turbine blade are used in healthcare to analyze medical errors, in education to identify why student outcomes fall short, and in business operations to diagnose process breakdowns. The core logic is universal: define the problem, trace the causes, fix the system.

