Reducing System Failure Costs in Today’s Digitally Dependent World
As electronics systems become smaller, more complex and more deeply embedded in our daily lives, the consequences of failure have become exponentially more serious. Finding and fixing these failures before they can cause costly downtime, product recalls and reputational damage requires a comprehensive, multi-disciplinary approach to electronics system failure analysis that includes specialized tools and expertise.
AN INCREASINGLY DIFFICULT CHALLENGE
There are a number of factors contributing to the growing difficulty in electronic systems failure analysis. These include the increasing complexity of electronics systems, rapid pace of miniaturization, special characteristics of advanced technology processes, intermittent nature of system malfunctions, and challenges associated with the exotic materials used to design and manufacture today’s components.
System complexity is increasing at the board, IC, package and die level. Process technology has advanced from 10-micron processes in the early 1970s to today’s 28-nanometer nodes and below, enabling processors and system-on-chip (SoC) devices to grow to billions of transistors from millions in the 1980s and 1990s. At the same time, previously discrete components and independent subsystems are now being integrated, and we continue to see the rapid miniaturization of electronic components using FinFET, metal gate, low-k dielectric and other advanced process nodes. Packages are more complex, too, including SIP, MCM, SiSub, stacked die, TSV and Cu wire, and we are seeing more complex materials for packages and boards, as well as coatings and molding compounds. Finally, we are seeing the rising incidence of failures whose intermittent nature makes them extremely difficult to diagnose, regardless of their root cause.
Consider today’s typical system in a networking environment. A networking system might contain thousands of components of varying complexity on each of multiple boards, including many complex ICs and SoCs, and a large mix of RF, power supply, high-speed digital and storage media, all residing on a single system and each requiring specialized domain knowledge.
Automotive systems are similarly complex, and frequently contain 30 to 80 computers under the hood, including nearly 20 electronic control units for door locking and unlocking, alone. Certain vehicles contain as many as 100 ECUs, or more, according to Frost & Sullivan. Each of the car’s electronic system or device can consist of 50 to 100 microprocessors and more than 100 sensors. Backup cameras and lane-change warning systems are already in widespread use, and automotive manufacturers are also looking at electronic assisted-driving and sensor-guided autopilot systems that will take the wheel for tasks like navigating bumper-to-bumper traffic, driving through toll booths, recognizing speed limits and road signs, finding a space in a crowded garage, or squeezing into a tight parking spot. These systems can encompass a dozen ultrasonic detectors, and multiple cameras and radar sensors.
With complexity comes a higher risk of failure, which increasingly has more expensive consequences. Following are examples showcasing the potential cost of failure in a world that grows increasingly dependent on electronic systems:
- Hardware failures are responsible for 72% of network downtime (source: “Understanding Network Failures in Data Centers: Measurement, Analysis and Implications,” Microsoft and University of Toronto, 2011).
- The cost of an unplanned data center outage can reach $11,000 per minute for organizations that depend on service delivery, including telecom providers and e-commerce companies (source: “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability,” Ponemon Institute and Emerson Network Power, 2011).
- Virgin Blue Holdings announced in 2010 that a complete outage of its reservations and check-in systems harmed profit by up to$20 million.
- Consumer Report surveyed readers about product reliability in 2010 and found that 36 percent of laptop computers, 32 percent of desktop computers, 15 percent of LCD televions and 10 percent of plasma TVs fail by their fourth year (source: “What Breaks, What Doesn’t?,” Consumer Reports, 2011).
- The Consumer Product Safety Commission (CPSC) recalled 59 million products during 2012.
- Microsoft announced in 2007 it would pay an estimated $1.05 to $1.15 billion to implement an additional warranty against failures of its Xbox 360 as well as previously shipped consoles and new systems sold in the future.
- Electronics is expected to represent over 40% of the cost of a car in the future, up from over 25% today (source: “Frost & Sullivan Analysis of the Automotive Test Industry,” August 7, 2013).
SOLVING THE PROBLEM
There are very few options for comprehensive root cause failure analysis and resolution – primarily in-house testing, or third-party services that focus on only part of the problem, with no defined methodology for dealing with system-level failure analysis and debug. In-house teams lack the expertise and toolset to perform complete causal analysis for today’s advanced electronics products. While turning to third-party providers is a better choice, few historically have been able to perform more than a cursory investigation, and even fewer have had exposure to a broad enough range of failures to know where to start, and the right questions to ask. Optical inspection and task-based cross-section analysis is much easier to do – but won’t identify the root cause in many cases, and is not suited to the advanced technologies being used in products today.
The only way to solve the problem is by working with a provider that takes a comprehensive, multidisciplinary approach which includes both electrical and physical analysis to enhance identification of the root cause, the associated failure mechanism, and how to prevent future failures. The focus must be on the entire system, from electronics to materials, all the way down to failure mechanisms occurring at the IC transistor level. Fig. 1 shows the various levels of analysis that are required in order to find, analyze and resolve electronic system failure mechanisms and their root causes.
Figure 1 An effective failure analysis approach requires attention to each category of potential root causes and failure mechanisms.
Additionally, specialized expertise and equipment are required. Expertise must extend from the component to the system level, with a highly trained staff that has a proven track record conducting the full range of failure analysis investigations from design through production and field returns (see Fig. 2).
Figure 2 End-to-end failure analysis methodology
Following is a list of the analysis expertise required across the range of typical investigations:
- Solder joint integrity
- PCB and board failures
- Contamination and corrosion
- Electrical overstress
- Mold compound delamination
- Manufacturing defects
- Field/ customer returns
- ESD failures
- X-ray and non-destructive inspection
- Optical Inspection
- Materials characterization
- Electrical characterization
- Thermal resistance measurements
- Temperature mapping
- Resistive vias
- Lithography pattern defects
- Die attach fillet height
- Flip chip under fill voiding
- Gate oxide breakdown
Equipment is another key piece of the puzzle. This includes advanced toolsets that can require as much as $150 million in capital equipment investments. Choosing a provider that has a large and comprehensive set of equipment is critical in order to ensure the right solution for the problem, and to perform parallel processing of large projects with the ability to scale as scope and demand fluctuate. There also is the requirement for system redundancy, and for highly specialized equipment such as advanced high-resolution microscopy imaging systems (SEM, TEM, and dual-beam FIB) that facilitate analysis down to the component level. Additional, it is critical to be able to characterize failures through laser timing probing that supports real-time, no-loading, non-contact signal waveform acquisition. The ability to localize failures down to a single device also requires nano-probing capabilities for advanced process nodes below 28nm, along with specialized software tools that enable the measurement of any feature of interest on TEM images.
Once the right expertise and tools are in place, optimal analysis requires a comprehensive methodology and plan that spans electrical and physical failure analysis. Fig 3. shows the typical steps and workflow, starting with a definition of the electrical failure signature, and ending with identification of the failure mechanism and resolution of the problem.
Figure 3 An analysis plan must address the entire system, from electronics to materials, all the way down to failure mechanisms occurring at the IC transistor level.
Customization is also important. Ninety percent of today’s failure issues may be similar from one problem to another, but it’s the last 10 percent that makes all the difference. Every situation, customer, product, and failure mechanism has its own specific characteristics and issues. There is no “one size fits all” approach. Failure identification, analysis and resolution require a methodical approach that starts with asking the right questions up front and then customizing/designing the workflow. Once the workflow is identified, the solution can be quickly executed.
Electronic system failure is becoming increasingly expensive. At the same time, the process of finding and fixing these failures has grown in difficulty with the trend to smaller, more complex systems that are built using exotic materials and advanced technology processes. Failures have also become more intermittent in nature, and yet the stakes have never been higher to quickly find and fix them before they can cause costly downtime, recalls and reputational damage. This requires a comprehensive, multidisciplinary electronic system failure analysis methodology and workflow that considers all possible root causes from the component to system level, while leveraging extensive, specialized expertise and a variety of advanced equipment and toolsets.