Strategies for minimising failures


Derating

In the first part of the unit we touched on the possibility that components could be treated in such a way as to enhance their life expectancy. ‘Derating’ is the name normally given to operating a component inside its normal operating limits, in order to reduce the rate at which the component deteriorates.

Conceptually, it is easy to see that, whilst the component may be specified to operate at high voltage and high temperature, applying those conditions simultaneously would probably be worse than applying either one or the other. Also that, if a component has a voltage rating such that it will start to fail at, say, 130% of maximum rating, reducing the voltage applied to substantially below the maximum permitted should reduce distress, and by doing so extend the life.

Also given that reactions are known to proceed at higher speeds at higher temperatures, an insight originally shared by Arrhenius, one would predict reduced degradation, and hence extended life, by running a component at lower than its maximum category temperature.

That derating is a practical means of reducing failures is supported by much published literature.

Activity

Spend 20–30 minutes using the Web to find examples of manufacturer’s data sheets and other sources of recommendations about derating components in order to enhance reliability. Try looking for both integrated circuits and passive components.

Compare your findings with the comments below.

This is an activity where the information you collect will depend very much on the search terms you use – some sites will recommend derating as a matter of course; others concentrate on other meanings of the term. Particularly with power devices, derating will be interpreted mostly in terms of power derating, that is reducing the device power as the ambient temperature rises, in order to keep the junction temperature at a safe level.

As with all devices, the junction area runs substantially hotter than the case and ambient, depending on the thermal resistance between junction and ambient, produced by a series of poorly thermally conductive paths in the silicon, through the die bond, in the package, and finally through to any heat sink. Typically you will find an equation of the form:

where TJ = die junction temperature
TA = ambient temperature in the vicinity of the device
PD = total power dissipation (W)
qJA = thermal resistance junction-to-ambient (°C.W−1)

Manufacturers such an National Semiconductor will sometimes point out that there is a strong relationship between junction temperature and failure rate, frequently modelling this as an Arrhenius curve, and predicting perhaps a 10:1 increase in failure rate for a rise in junction temperature from 130°C to 160°C, based on a 1 eV activation energy.

There will be other evidence of derating, for example high-current devices may be recommended a ‘soft start’ circuit, in order to prevent damage from inrush current.

When we did our own search, we found an interesting comment from Phillips Semiconductors that “Exposure to limiting values for extended periods may affect device reliability”.

You may also have come across references to MIL-HDBK-217F, which predicts failure rates for different devices based on the severity of the application, generally using the Arrhenius model.

At the same time you may have read material that leaves you far from convinced that the MIL-HDBK-217F model is fully applicable to temperature, and may well not apply at all to other sources of stress. Lest you think that derating as a practice is not supported by theory, it is worth looking at an alternative view, which strongly supports at least a modicum of derating.

This approach is explained in the Reliability Analysis Center’s Mechanical Applications in Reliability Engineering. This refers to the ‘strength’ of a part, which is a random variable that can be represented by statistical distribution. Likewise, the stress applied to a part is random, changing the temperature, vibration, transients, shock and other environmental factors, and able also to be represented by a statistical distribution.

Figure 1: Relationship between part strength and part stress

Figure 1: Relationship between part strength and part stress

after Reliability Analysis Center (RAC)

Figure 1 plots these two probability densities assuming a normal (Gaussian) distribution for both stress and strength – even though such a distribution is probably not totally realistic, the comments still apply. The classical approach has been to select every part to have enough ‘strength’ to handle the worse case stress conditions, thereby reducing to a minimum the intersection (shaded) areas of the graphs where there is a slight chance that the stress applied to a part will exceed its strength. More recent approaches take into account the probabilistic nature of this ‘interference’ between the two distributions.

Using this insight, the four basic strategies for stress derating can be seen to be:

All of these are possible, but variations are more difficult to control.

The purpose of derating is to protect against these variations, preventing small changes in operating characteristics (usually temperature) from creating large increases in failure rate. Given that the simplest approach to increase average strength, this will normally be done by procuring a more capable component. For example, choosing a 100V capacitor rather than the 63V type for operation on a 60V line.

The amount of derating that is needed will depend on how well the designer can predict the variation in operating parameters, both before the part is assembled and during the operating environment over the lifetime of the part. Because the sources of variation are extremely difficult to quantify, engineering estimates in past experience are often used to estimate the derating level needed.

Not every factor will affect every type of component – Table 1 shows the most significant causes of variation for the performance of different types of component.

Table 1: Principal sources of variation for different types of component

after Reliability Analysis Center (RAC)

  transistor diode IC resistor capacitor inductor relay
temperature X X X X X X X
aging X     X X   X
radiation X X X        
vibration/shock       X X X X
humidity       X X    
life       X X    
electrical stress X X     X    

X: Significantly affected by environment

The principal practical question is to determine what are reasonable derating parameters, and information founded on hard evidence is hard to obtain. However, an indication of appropriate part derating parameters has been published by RAC, and part of this is reproduced as Table 2

Table 2: Selected suggested part derating parameters

after Reliability Analysis Center (RAC)

Part Type Derating parameters Severe Benign
Aluminium electrolytic caps Voltage (% max rated) 70% 80%
Temperature (°C) Tmax – 20°C Tmax– 20°C
Ceramic capacitors Voltage (% max rated) 60% 70%
Temperature (°C) Tmax– 10°C Tmax– 10°C
Solid tantalum capacitors Voltage (% max rated) 70% 80%
Temperature (°C) Tmax– 20°C Tmax– 20°C
Reverse voltage (% max fwd) 2% 2%
Signal diodes Forward current (% max rated) 90% <100%
Reverse voltage (% max rated) 70% 80%
Max. junction temperature 95°C 115°C
Chip resistors Power dissipation(% max rated) 50% 70%
Digital MOS and bipolar ICs Fanout (% max rated) 90% <100%
Frequency (% max rated) 90% <100%
Output current (% max rated) 90% <100%
Max. junction temperature 95°C 115°C
Linear MOS and bipolar ICs Frequency (% max rated) 90% <100%
Output current (% max rated) 90% <100%
Max. junction temperature 95°C 115°C
General purpose relays Contact current
(Continuous % max rated)
varies with load type:
75% resistive
75% capacitive
40% inductive
20% motor
10% fil. lamp

90% resistive
90% capacitive
75% inductive
30% motor
20% fil. lamp

Contact power (% max rated) 50% 70%
Temperature (°C) Tmax– 20°C Tmax– 20°C

Note that the level of derating recommended for a severe environment is greater than for a benign environmental, and that the maximum temperatures are also generally lower. There is some variation with device type, but many of the entries are quite generic. Also note that the list of parts contains mechanical components, and that these are often substantially different from electronic parts: a good example is of the general purpose relay, where the derating recommended is strongly dependent on the type of load.

Most of the discussion has centred on catastrophic failures, so it is worth being reminded that system failures are sometimes caused by parametric drift. Designers are recommended to try and make circuits as tolerant as possible to variations in part parameters. Whilst part data sheets indicate the expected level of parametric drift during environmental exposure, individual parts will vary much more. This means that, unless the design is sufficiently tolerant of drift, the product may not function properly, even though no catastrophic part failure has occurred.

A caveat

Derating is not always good news for the overall system. For example, using a component with a higher rating may mean using a larger case, creating space and weight problems. There may also be implications for cost. As with all design, there will be a trade-off between derating as much as possible and being able to meet manufacturing and marketing objectives.

Nor should it be assumed that working at low voltages is necessarily beneficial. We have already touched on the poor ‘dry circuit’ performance of devices with contacts, and low voltage failures are also experienced with various types of capacitors. Although capacitors go short-circuit and connectors and switches noisy or open circuit, the source of the problem is that insufficient energy is available to clear the fault.


Electrical protection

Designing a circuit to give long life is partly a matter of using the right components to withstand the stresses applied. But it is also prudent to ensure that those components are guarded against excess current or voltage, and these are the topics of our next section

Current overload

Whilst over-stress due to uneven heat distribution can only be guarded against by correct manufacture, other kinds of over-current protection can be effective. The table below compares three devices in common use:

device advantages disadvantages
Fuse Will isolate high current Once only operation;
No fault indication
PTC resistor1 Returns to normal operation once cool Restricted current range;
Long trip time;
No fault indication
Silicon switch Controllable rise and fault times;
Fault indication can be built in
Restricted current range

1 A resistor with a Positive Temperature Coefficient of resistance. Typically these switch between low and high resistance states over a relatively short temperature range.

For reliable operation of electrolytic capacitors, it is also important to limit inrush current in applications such as switching circuits and charge/discharge circuits, and this is normally accomplished by building in appropriate series resistance. Note that, like semiconductors, electrolytic capacitor failures are likely to be short-circuit, rather than open circuit, and that the rest of the assembly may need to be protected against destructive failure. For this reason, some tantalum capacitors incorporate built-in fuses.

Transient voltages

To be effective, transient voltage suppression (TVS) devices must activate before system components react catastrophically to transient pulses, and must be capable of dissipating the resultant transient energy, whilst clamping the voltage to a safe level. This is normally done with devices that limit the over-voltage rise by shunting the transient away from protected components, usually to ground. The table below compares four devices in common use, listed in approximately descending order of transient energy clamping capability:

device advantages disadvantages
Gas discharge arrestor Very high current capability
High insulation resistance
High overshoot voltage (200V-2kV)
Slow response time (several µs)
Non restoring under DC
Limited life
MOV2 High transient current capability
Clamp voltage up to 1.5kV
Clamp voltage 30+V
Performance gradually degrades
Difficult to make as SM part
TVS thyristor Does not degrade
High current handling
Fast response time
Non restoring under DC
Narrow clamp voltage range (28+V)
TVS diode Does not degrade
Wide voltage range 3V to 400V
Very fast response time (few ns)
Readily available as SM part
Limited surge current rating
Low voltage types have high capacitance

2 Metal Oxide Varistor: made of a low cost ceramic-like material formed into a disc shape typically 3mm to 20mm diameter, and usually presented as a radial through-hole component

Protection circuits

Board level protection is against residual transients from earlier stages of protection, system generated transients, and ESD. Transients at this level range from tens of volts to several thousand volts with peak currents usually tens of amps. Board level transient voltage protection is typically provided by:

This protection works on the general principles that:

An implementation for a generic operational amplifier is shown in Figure 2. Here the manufacturer suggests that the value of RS be determined empirically, notes that RFB may be required for high bias current amplifiers, and allows the use of Schottky diodes for faster, tighter clamping, provided that their capacitance and leakage current meet the circuit requirements.

Figure 2: Typical protection circuit for an operational amplifier

Figure 2: Typical protection circuit for an operational amplifier

At the board level, there may also be a capacitor between the voltage line to be protected and ground, to absorb high frequency transients (‘buffering’). This is especially common in power supply connections to integrated circuits, where a 100nF low-inductance ceramic capacitor is often fitted. This should be located as close as possible to the device being protected.

Similar circuits may be incorporated in the actual integrated circuit. Figure 3 shows a simple resistor-diode configuration for CMOS; Figure 4 shows a more complex circuit incorporating thyristors, which is used to create high-immunity devices.

Figure 3: Input protection circuit for metal gate CMOS

Figure 3: Input protection circuit for metal gate CMOS

Figure 4: Protection circuit for high-immunity applications

Figure 4: Protection circuit for high-immunity applications

Such circuits are limited in capability, and their connection to the power supply is a particular problem, because destructive voltages can be back-driven into it by multiple transient events. Modifications are therefore needed for equipment inputs, where the transient suppression power requirements are more severe: voltages can exceed several kilovolts, and peak currents range from several hundred to several thousand amps.

Typically, the arrangement is similar in concept, but the diodes are replaced by a back-to-back pair of TVS parts, as shown in Figure 5. For multiple-input use, sets of TVS parts are supplied in integrated packages, but these are chip arrays, rather than monolithic ICs, in order to maintain isolation. Unfortunately, however, suitable TVS devices are not available for logic applications below 3.3V.

Figure 5: Protection circuit for an external bus application using transient voltage suppression devices

Figure 5: Protection circuit for an external bus application using transient voltage suppression devices

ESD management

So far, our discussion on electrical protection has focussed on problems either within the circuit or conducted into the circuit along the leads, and caused by such phenomena as switching transients. Particularly given the extra sensitivity of modern electronic components, it is worth reminding ourselves that very high electrostatic potentials can be generated by triboelectric effects on clothing, packaging material, and automatic handling and assembly equipment. If these are discharged into sensitive components, either directly by contact with their pins, or via conductors in the system, damage or destruction is likely.

Electrostatic damage (ESD) can destroy some active components even when the circuits are not powered. MOS devices are particularly vulnerable, and require special protection, both on chip and externally. Antistatic precautions are therefore necessary during all operations involving handling, both during assembly and afterwards.

Although assembled components are generally less vulnerable, with some protection given by the circuits in which they are embedded, both test and maintenance procedures also have to take account of potential ESD sensitivity.

If you are unsure about how to implement protective measures against ESD, now would be a good time to read or review ESD control during manufacture.

Self Assessment Questions

You are laying out and advising on the design of a motor control circuit for use in an electrically noisy industrial environment, where it can also experience fluctuations in operating voltage. Identify and explain as many ways as possible of enhancing product reliability.

Compare your answer with this one.


Track design

Part of any attempt to minimise failure rate must be to examine the detail of the track design. Ensuring that the circuit operates correctly from reasons of signal integrity is beyond the scope of this module, but you will need to make sure that the tracks you design are able to withstand both normal operating currents and any anticipated higher currents during start-up or fault conditions. The key is to ensure that the temperature rise on the conductor is kept within bounds.

Activity

Before reading further, try and identify those factors that you believe to determine the temperature rise of a conductor. Your list is likely to start with the three items on which IPC-2221 is based.

Review your answer as you read the Coretec paper at this link.

Hopefully having read the paper you will have a better understanding of the issues. In this particular case it seems that the industry has been too conservative in its design traces.

Another design factor important for the reliability is the spacing of conductors. Here the guidance document is Section 6.3 of IPC-2221. The clearance requirements depend on the peak voltage, but also on the environmental conditions, specifically low air pressure, and on whether or not the track is protected.

However, the kind of failure which IPC-2221 has in mind is corona and other forms of high-voltage, high-current breakdown, rather than the deterioration of surface insulation resistance. Be aware that there are many high impedance circuits where even a reduction in isolation may result in parametric failure. In such circuits, spacings well above those in the standard are recommended.


Humidity protection

Of course you will remember that relatively quick breakdown may happen at comparatively small voltage gradients if the surface is full of moisture, has some ionic contamination, and a mobile metal species is available – watch those dendrites grow!

So how can we improve insulation resistance, high-voltage breakdown, and protect against electromigration? The answer has to be some kind of humidity protection. Sometimes this will take the form of an enclosure, though few enclosures are sufficiently hermetic to protect a board against the affects of hot, humid air. Another approach, which finds favour for severe applications such as automotive and military, is to provide the board with a conformal coating. A wide variety of materials is available, and several different application methods. Before you tackle the next activity, read the Concoat paper at this link.

Activity

What are the issues relating to the use of conformal coating that will impact on your design, as viewed from a process, manufacturing and servicing perspective?

Compare your answer with this one.


Protecting against mechanical failure

Enhancing robustness

Finally, having selected the right components, placed them on a correctly designed board, and protected them against the electrical and moisture environment, we come to the final strategy for minimising failure, which is to protect against mechanical failure. In general, we are trying to enhance the robustness of the assembly without compromising weight or cost, and this task has both design and material choice implications. We are considering structural issues, because the printed circuit assembly is only part of a total product solution, and failure in the enclosure may well cause the assembly inside also to fail.

Issues of fatigue

An electronic product is more than just a board – frequently there will be an enclosure which needs to be fit for its intended purpose of protecting the electronics within, as well as meeting customer expectation for appearance. Whilst most enclosures are more than adequately strong, fatigue failure may be an issue with elements such as fastenings, hinges, and other elements which create an integrated structure from a set of components. The question is, how much effort and cost it is worth putting into resolving potential problems? As with the strength:stress relationship discussed under derating, we can choose to design elements to withstand potential fatigue by ensuring that the stress never exceeds the critical stress, or else we can design for a limited ‘safe’ life.

Given wide variation in the sensitivity of structures to stress and environment it is not easy to ensure that failure will never occur. However, we can improve the quality of our product by taking time to consider as wide a range as possible of the events that are likely to happen, and the consequent fault modes. Depending on the application, we may also want to consider the likely abuse of the product – ask any maker of vending machines!

At a finer level of detail, we also need to consider stress concentrations in the enclosure, paying careful attention to the design of holes, fixings, corners and fillets, in order to control the stress distribution.

The distribution of stress, and the avoidance of stress concentrations, are equally important when mounting boards. The danger time is during initial assembly and disassembly for servicing. A realistic assessment must be made of how these tasks are going to be carried out and where the resulting stresses will occur. There are also tolerance issues to consider – will the board fit easily into its intended position, or be removed from it, without applying excessive force? Particularly with plastic parts, we need to be aware that some materials will shrink and distort with time, so that there is more than CTE difference to consider.

Vibration and resonance

Amongst other things, component assemblies can be subjected to vibration and shock during use, transport and maintenance. This can cause fracture due to fatigue or mechanical overstress, wear on components such as connectors, and the loosening of fastenings.

Vibration can be generated by reciprocating or rotating machinery, by wheel vibration on vehicles, and by acoustic noise. Vibration may happen at a fixed frequency, at different frequencies over time, or simultaneously over a range of frequencies, and can occur in or about different linear or rotating axes. The important measures of vibration are frequency, displacement (generally defined as peak-to-peak values), velocity and acceleration.

Shock is a particular type of vibration input, with relatively high intensity and frequency, but for short intervals, and the amplitude of any induced vibration is usually attenuated by inherent or applied damping.

Every structure has one or more resonant frequencies, and printed circuit boards are no exception. If the vibration input occurs at these frequencies, or at their harmonics, the displacements due to vibration will be maximised. The locations on the board at which zero vibration displacement occurs are called nodes, and the maximum displacement amplitudes occur at the antinodes . In order to avoid premature failure, we need to remove or at least damp any resonances.

The resonant frequency of the structure is proportional to its stiffness and inversely proportional to the inertia. In order to ensure than resonant frequencies are well above any input vibrations that may be applied, the structures need to be sufficiently stiff, especially when they contain relatively heavy parts, such as large components on circuit boards. This has an implication for the mechanical strength and rigidity of the board itself, as well as for the way in which the board is mounted. With a trend to thinner boards, it may be necessary to provide additional stiffening.

Look after the surface

Failure often starts with surface defects, and protective techniques used for structural materials include surface treatment to relieve stresses (shot peening, heat treatment) and increasing surface toughness (nitriding of steels; heat treatment). Given the normally much smaller scale of electronic structures, this generally translates into choosing suitable stress-free materials and providing an effective protective coating.

With the pressure to cut costs, however, there are increasing moves to using ‘self-finished’ materials such as plastics and precoated metal, rather than passivating, plating or coating the finished part.

Care has to be taken in manufacture and maintenance to ensure that surfaces are not damaged by scratches, nicks or impact. Although surface damage will reduce the fatigue strength in a stressed component, the major issue with electronic products is the cosmetic impact of such damage. As a result designers have to think about how to protect surfaces and prevent damage. Typical solutions involve specifying final packaging in detail, and, on critical areas such as displays, using temporary films that are only removed by the end user.

Solder joint design

Although eventual mechanical failure is inevitable, we need to protect solder joints against excessive loading in order to achieve the desired service life. Avoiding board flexure is a major contributor. Other ways in which joints can be supported include:

Protecting components from stress

Up to the functional test stage, small boards are usually processed in panels, each of which contains several circuit boards. This reduces process time, and makes the assembly easier to handle. However, separating individual circuits from a mother board introduces additional stress near corners and edges, and ‘de-panelling’ is probably the stage at which most fractures occur. The extent of the problem depends on the method used, and whether or not the components have compliant leads: the problem is worst with chip capacitors.

Depending on design/production volume, the panel may just have been scored by the PCB manufacturer, or pressed or routed, so that the circuits are held together by narrow laminate webs or ‘tabs’. Pre-routing panels before assembly can minimise board deflection, but component defects will still be concentrated near the shear lines, even when care has been taken to support the assembly.

Quote

Most of the cases (of ceramic capacitor cracks) currently being examined are flexure failures, and this has been the most common cause of failure over the last couple of years. Board flexure is one of the main causes of failure – if you break out multi panels using bolt cutters across a bench, or just snap vee-scored boards, what stress is being applied?!

Bob Willis of Electronic Presentation Services (1998)

The key is to avoid stressing the board, so de-panelling should preferably not be undertaken by hand, as the operation is difficult to control, even when a suitable fixture is used. Suitable available methods include:

The second of these methods has the practical difficulty that it produces dust, which has to be extracted for safety reasons, is cosmetically undesirable and, if not removed, will inhibit the adhesion of any conformal coatings used.

Other options which are being explored include routing with high pressure water jet and laser cutting (for thin laminates).

Stress countermeasures

There are no industry standards which cover how much bending or deflection is allowed before component damage occurs. Some chip manufacturers suggest a 1500mm bend radius, which allows little or no bending in short segments, but relatively large deflections for longer boards. This is more realistic than a linear mm/m specification, which becomes unusable with assemblies of even moderate size, although Rawal of AVX suggests 0.1mm/cm or 3mm for 10cm.

Assemblers should audit the process so that they are aware of the ways in which unintended flexure might occur. Manufacturing countermeasures include:

Design methods available to reduce vulnerability to fracture include:

Pad design may also play a part: Murata found that capacitors are more at risk if they are not sitting squarely on their solder pads, so that these should be designed to be only marginally wider than the component, and the stencil arranged to avoid excess solder paste.

Figure 6: Guidelines for positioning MLCs on multi-assembly panels

Figure 6: Guidelines for positioning MLCs on multi-assembly panels

Figure 7: Guidelines for positioning MLCs for minimum stress

Figure 7: Guidelines for positioning MLCs for minimum stress

Self Assessment Questions

Remind your colleagues of the likely failure mechanisms in ceramic chip capacitors in a surface mount assembly, and why these can have long term reliability implications. What precautions should you as a designer take to minimise the risk of component failure?

Compare your answer with this one.

A final thought

For large structures, civil engineers design for ‘fail safe’: that is, they make sure that the load can either be taken by other parts of the structure or the effect of failure otherwise allowed for, until the failed part can be detected, repaired or replaced. For example, look at how many wires join the top and bottom sections of the Millennium Bridge across the Tyne! Apart from the aesthetic aspects of the design, these afford multiple redundancy.

The idea of ‘fail safe’ also translates into ways in which we can tackle failure in electronics. Although this unit has concentrated on preventing failure, we have already seen some evidence of the fail-safe approach, such as capacitors with in-built fuses. A key tool here is Failure Mode, Effects and Criticality Analysis (FMECA) (or simply FMEA), which is a formalised design review technique that focuses the development of products and processes on areas that will reduce the risk of product field failure. There will be more on this in our module Design for eXcellence.