Design for Thermal Issues

Unit 3: The effects of heat

In this Unit we start by considering the changes in component characteristics with temperature, and their effect on equipment. We then consider the mechanical effects of temperature change, and how this impacts on reliability. As part of our study we will be looking at actions that can be taken during design and manufacturing to make sure that a product will be reliable over its lifetime.

Unit contents

Introduction

Some components on a PCB will generate little heat, others a great deal; a CPU in a PC can generate up to 100W! In addition to any internally generated heat, the temperature experienced by any part of a PCB will be affected by

If this heat is not removed, devices will become hotter, and their behaviour may change significantly, so that the circuit no longer behaves in the way it was designed to, and may even fail. Also, whilst high temperatures themselves can be damaging, repeated operational temperature variations cause thermally-induced mechanical stresses which can be even more harmful.

Such ‘thermal cycling’ may be quite severe, even in everyday applications. For example, circuitry housed near the engine of a car experiences both the high temperatures developed by the engine when the car is being driven, and rapid cooling when the engine is switched off. But repetitive thermal cycling occurs in all electronic devices to some extent; even a desktop PC is stressed every time it is switched on for a period and then turned off.

Quote

A computer that is turned on twice a day, every day for 15 years, will accumulate about 11,000 thermal fatigue cycles. A television that is turned on 10 times a day, every day for 15 years, will accumulate about 55,000 thermal fatigue cycles. An automobile that is started 10 times a day, every day for 20 years, would accumulate 73,000 thermal stress cycles. A satellite in orbit around the Earth experiences a thermal cycle about every 90 minutes. In 20 years it can accumulate about 117,000 thermal cycles.


Environment failure rate studies in military aircraft have shown that about 55% of all the electronic failures are related to thermal events such as high temperatures and thermal cycling.

Dave Steinberg, Preventing thermal cycling
and vibration failures in electronic equipment

 

[ back to top ]


Effects of temperature change on device characteristics

Devices change with temperature by two different mechanisms:

The effects on intrinsic properties are generally reversible with temperature, unless damage is done to the device, although there is some time-displacement of the recovery of ceramic capacitors taken beyond their Curie temperature (see below). These device changes can be allowed for in the electronic design, and have no effect on the failure rate, although we will see issues of balance between device characteristics that may impact on circuit performance.

The timescale for changes in characteristics varies from extremely short-term to very long-term: devices subjected to pulses of power will display transient changes in characteristics; most components operated for an extended period, especially at high temperatures, will exhibit a greater or lesser degree of drift. Again, these changes are predictable, so can be allowed for when setting component tolerances during the simulation exercise that is part of any professional circuit design.

Resistors

The electrical resistance of a body depends on its dimensions and the materials from which it is made. Although the dimensions will vary with temperature, the factor that primarily determines the temperature coefficient of resistance (TCR) of the body is the temperature dependence of its conductivity. The conductivity of a material is given by:

\sigma  = q\mu

where q is the carrier concentration in coulombs/m3 and µ is the mobility of the carriers in m2V−1s−1. Mobility is a measure of the ease with which carriers can move through the lattice, and they do not move in a straight line, but are influenced by lattice defects, impurities, and grain boundaries. Mobility reduces with temperature because, as the temperature increases, the carriers become more active and undergo more collisions.

Within a conductor, all the atoms are ionised, and the supply of electrons is virtually constant with temperature; the mobility is due almost entirely to ionic scattering and depends on the characteristics of the particular material. The temperature coefficient of resistance is the slope of the curve of resistance against temperature normalised to the resistance at a reference temperature:

\alpha  = \frac{1}{R}\frac{{dR}}{{dT}}

where R is the resistance at temperature T and α is the TCR. To allow for some non-linearity, TCR is usually presented as an average over a range of temperature, calculated (in units of ppm/°C) from:

\alpha  = \frac{{R_{T2}  - R_{T1} }}{{R_{T1} (T_2  - T_1 )}} \times 10^6

where RT2 is the resistance at temperature T2, and RT1 is the resistance at temperature T1, the reference (lower) temperature. In some cases, particularly where the TCR is non-linear, two values are given, for a ‘hot TCR’ usually over 25°C to 125°C, and a ‘cold TCR’ between −40°C and 25°C. Because q is constant and µ reduces, the TCR of conductors is typically positive, and generally in the range 1,000–6,000ppm/°C. Some typical values are shown in the table below; they are relatively constant over the temperatures of general interest.

Material

Resistivity (µΩ-cm)

TCR (ppm/°C)
Alloy 42
66.5
1,000
aluminium
2.83
3,400
copper
1.72
3,900
kovar
48.9
3,700
nickel
7.80
6,000
silver
1.63
3,800

The materials used for resistors (other than wirewound types) generally have a more complex structure. That for a thick film resistor is indicated in Figure 1. Because there is a mixture of metallic and semiconductor contacts, the TCR is a non-linear function of temperature and may change polarity across the temperature range of interest. As a result, hot and cold TCRs for the same resistor may vary in magnitude as well as polarity. A typical value of TCR is ±100ppm, but this varies with materials. As the semiconductor mechanisms become more dominant at low temperatures, the TCR is often significantly more negative at low temperatures.

Figure 1: Contacts within thick film resistive material

Contacts within thick film resistive material

after Sergent and Krum, 1998

Metal film and carbon film resistors also have TCRs that can be very variable and are a function of both the material and the deposition parameters. The TCR becomes more nearly constant with temperature as the films become thicker, but you should not assume linearity.

The other resistor parameter of importance for the design is the stability of the resistor, which is usually defined as the permanent change that takes place in resistor value after exposure to elevated temperature for a period of time. In many instances storage at 150°C for 1,000 hours is taken as representing end-of-life.

{\rm{Stability (\% )}} = \frac{{\Delta R}}{R} \times 100

Most resistors are extremely stable, depending on technology and whether or not the component has been trimmed. A thick film resistor may drift less than 0.1% in 1,000 hours, whereas its trimmed counterpart might exhibit drift up to 0.5%.

We have seen that metals generally have a positive TCR, and that practical resistors have lower values of TCR produced by combining metallic and semiconductor elements. In consequence, practical resistors made of the same materials exhibit a spread in TCR between components and between batches. They may also have a spread in temperature of operation, depending on their position on the board. For many purposes, this is unimportant, but resistor pairs are often used as voltage dividers or for gain-setting. If we take an operational amplifier in a typical inverting configuration, as shown in Figure 2, the gain is set approximately by:

G = \frac{{V_o }}{{V_s }} = \frac{{R_o }}{{R_s }}\angle 180^\circ

where \angle 180^\circindicates that phase reversal has taken place.

Figure 2: Basic inverting amplifier

Basic inverting amplifier

 

If the two resistors ‘track’ perfectly, then the only effect on gain will be caused by any temperature changes to the open-loop gain of the operational amplifier. However, if the match is not perfect, there will be an error. We will be seeing this in an exercise further on in the Unit.

[ back to top ]


Capacitors

The chip ceramic capacitor is made with a range of materials and additives – its general construction is shown at this link. All these materials have a Curie point at which a phase change takes place, and this is associated with a change in dielectric constant. The types of dielectric used to make capacitors of higher value have higher and less stable dielectric constants, and a range of additives is used both to shift the Curie temperature and to depress the peak value of dielectric constant (Figure 3).

Figure 3: Temperature dependence of barium titanate

Temperature dependence of barium titanate

after Sergent and Krum, 1998

As you will see from Figure 3, NPO types have a stable dielectric constant throughout the normal range of working temperature, which corresponds with Figure 4, where the change of capacitance with temperature is shown as essentially zero, typically ±30ppm/°C.

Figure 4: MLCs: different dielectrics

MLCs: different dielectrics

As with resistors, the temperature coefficient of capacitance (TCC) is defined by:

{\rm{TCC}} = \frac{{C_{T2}  - C_{T1} }}{{C_{T1} (T_2  - T_1 )}}

where CT2 is the capacitance at temperature T2, and CT1 is the capacitance at temperature T1, the reference (lower) temperature, usually 25°C. High values of TCC may be expressed in % rather than parts per million – capacitors with high dielectric constant are far from stable! Because TCC is a function of temperature, it is often expressed as ‘hot TCC’ or ‘cold TCC’, using room temperature as the reference.

Different classes of materials are specified according to their temperature coefficients, and Figure 4 indicates typical behaviour. Dissipation factor is also strongly dependent on temperature, but is similar for all dielectric types. Typical values are 4% at −55°C, falling to 1.5–2% at 25 °C and to 1% at 150°C.

Also related to the dielectric type is the ageing rate. After a capacitor has been taken above its Curie point, it will lose value at a constant rate per decade hour once its temperature has fallen below the Curie point. NPO capacitors do not exhibit this effect, because they are always used above their Curie point, but X7R dielectric might lose 1.5–3% per decade hour, and double this for Z5U formulations. This effect is fully reversible, so that every time the capacitor is taken above its Curie point (for example, when carrying out a rework soldering operation) the value will revert to its highest point. The loss of value that takes place is not affected by temperature, that is, unless the temperature of the part exceeds the Curie point.

Electrolytic capacitors have high values, but are typically used in configurations where the exact value is unimportant. However, a typical tantalum capacitor might change by ±10% over its working temperature range. Perhaps more important is the increase in leakage current that takes place, typically 10-fold over +25 to +85°C; above that temperature, reliability issues start to become important, depending on the capacitor construction.

Not only resistors and capacitors change with temperature, but so do inductors. Detailed consideration is beyond the scope of this module, but there is information on inductors and ferrite materials at this link. The soft ferrites used for coils and transformers have a permeability that is a function of temperature and frequency, but the variations with temperature are generally small over the range of interest.

[ back to top ]


Semiconductors

In contrast to a metal, the carrier concentration in a semiconductor is an exponential function and increases rapidly with temperature, whilst the mobility is a function of different kinds of scattering and generally decreases. The conductivity of a semiconductor can be approximated by:

\sigma  = ke^c ^T T^{ - {3 \mathord{\left/{\vphantom {3 2}} \right.\kern-\nulldelimiterspace} 2}}

where k and c are material constants. This is shown normalised in Figure 5.

Figure 5: Normalised conductivity for a semiconductor as a function of temperature

after Sergent and Krum, 1998

Note that the change is non-linear, and the plot is of conductivity, so that the resistance decreases with temperature, that is semiconductors have a negative TCR.

But temperature affects parameters other than the conductivity of the silicon itself. In an ideal diode, the current resulting from diffusion is given by:

I = I_s (e^{{{qV} \mathord{\left/ {\vphantom {{qV} {kT}}} \right. \kern-\nulldelimiterspace} {kT}}} - 1)

where I is the diode current, Is the reverse saturation current (the current created by random hole-electron recombination), V the diode voltage, q the electron charge and k is Boltzmann’s constant. Manipulation of this equation, and normalising to obtain the relative change, generates a plot of percentage change against temperature as shown in Figure 6.

Figure 6: Percentage change in Is vs temperature

Percentage change in Is vs temperature

after Sergent and Krum, 1998

At around room temperature (300K), the percentage change in Is is about 8% per degree. Allowing for some non-ideal behaviour, the reverse saturation current at room temperature doubles with a change in temperature of approximately 10°C. Figure 7 shows the resultant voltage-current relationship for an ideal diode.

Figure 7: Current against temperature for a silicon diode

Current against temperature for a silicon diode

after Sergent and Krum, 1998

Not only does the forward voltage of a diode change with temperature, but so do the characteristics of transistor junctions. Figure 8 illustrates the typical change with temperature in β, the common-emitter current gain, for a small signal transistor.

Figure 8: Normalised current gain against temperature for a small signal transistor

after Sergent and Krum, 1998

Unfortunately β is not the only function that changes, and this can have serious effects for bipolar power devices. The collector current also increases with temperature, which can cause instability in the circuit operating point, depending on the configuration.

In a circuit where a constant voltage is maintained across a base-emitter junction, elevated temperatures can produce a phenomenon called ‘thermal runaway’, in which the increased heat causes the collector current to rise; in turn, this causes the temperature to increase further, raising the current still more. The end result can be failure of the transistor as a result of excessive heat, excessive current, or both. These problems are usually minimised by placing a resistor in the emitter circuit. As the current increases, the voltage across the emitter resistor increases, lowering the effective base-emitter voltage, which acts to reduce the current, thus providing protective negative feedback.

In power applications, it is common to operate resistors in parallel to increase the available current. If two discrete transistors are not perfectly matched, both electrically and thermally, one will tend to carry more current than the other, resulting in a rise of temperature that further increases the imbalance. To avoid one transistor carrying the bulk of the current, resulting in failure, negative feedback is usually fitted in the form of low-ohm resistors in the emitter circuits. Unfortunately, this approach limits the available gain and the resistors dissipate a significant amount of power, which lowers the overall efficiency of the circuit.

Even when the power rating is minimal, differences in temperature between matched devices may impact on overall circuit performance. A simple example of this is the long-tailed pair shown in Figure 9.

Figure 9: Basic configuration of a long-tailed pair

Basic configuration of a long-tailed pair

 

Here the balance between the two halves of the circuit will be impaired if the two transistors forming the long-tailed pair are not in intimate thermal contact. Before the days of integrated circuits, when such configurations were created using discrete transistors, it was not uncommon for the two transistors to be coupled physically by a common heat sink. This is less of a problem with an integrated circuit implementation of the circuit, but bear in mind there can be substantial gradients within an integrated circuit die, as shown in Unit 2.

Figure 10: Typical output characteristics of a small-signal n-channel JFET

Typical output characteristics of a small-signal n-channel JFET

after Sergent and Krum, 1998

FETs have different operating characteristics from bipolar devices, being primarily voltage-driven as indicated by Figure 10. In the ohmic region, the current is given by:

I_D  = 2bwqN_D \mu _n \frac{{V_{DS} }}{L}

where b is the channel height, w the channel width, q the electron charge, ND the donor concentration, µn the electron mobility, VDS the drain-source voltage and L the channel length. The only temperature-sensitive term is the mobility, which varies with temperature by T−3/2, as with bipolar devices.

In the saturation region, a parameter rDS(ON) is used to describe the device characteristics of both JFETs and MOSFETs:

r_{DS(ON)}  = \frac{{V_{DS} }}{{I_D }} = \frac{L}{{2bwqN_D \mu _n }}

rDS(ON) values may vary from a few ohms for small-signal devices to a few milliohms for power MOSFETs. This parameter increases with temperature due to the increase in mobility.

As the gate voltage increases, the channel width decreases to the point where it becomes a constant, which defines the saturation region. All styles of FET exhibit a decrease in drain current as the temperature increases, in line with the mobility. This makes them more inherently more stable than bipolar transistors, with no danger of thermal runaway. Correspondingly, power MOSFETs may be operated in parallel in any number without current sharing issues.

Exercise

To gain an appreciation of the size of the errors involved, and of the range of integrated circuit parameters that are affected by temperature visit the Analog Instruments web site, and use their Simple OpAmp Buffer Error Budget Calculator to examine the effect of a 25ppm/°C spread in TCR for the resistors forming a feedback loop in a ×20 amplifier configuration.

Scroll down the list of parameters, and try and identify other aspects of the component or design that are related to temperature.

Now look at our comments.

 

[ back to top ]


Dealing with changes in device characteristics

So far in this Unit we have looked at the way in which temperature has an effect on device characteristics, both reversibly in the short term and as long-term permanent drift. Both these are effects that can produce parametric failure1 if not allowed for at the design stage.

1 If you have forgotten what distinguishes a parametric failure from other types, please refer to our Introduction to failure mechanisms.

 

The design task should therefore always be accompanied by a simulation of the circuit performance, in order to determine appropriate specifications for components, including selection tolerances, drift with temperature, and drift with life. Of course, in order to do this accurately, we will probably need some information about the operating conditions that the components have to survive.

Note that no computer simulation can tell you how a particular component will perform over a range of temperatures; only experimental results can tell you that. For example, the only way to determine the TCR of a resistor is to measure its resistance at different temperatures in a laboratory. However, once this information is available, we can use circuit simulation to check the performance of the circuit over its specified range of operating conditions. And we can use thermal simulation to predict the local environment of the components with a given combination of ambient conditions in order to feed that circuit simulation with correct data.

As generally used, circuit simulation will deal well with changes that happen in the longer-term, such as a slow drift in component values. It can also allow for the spread in characteristics between components and between batches. However, the simulation may well not take into account parametric shifts as a result of extended life at temperature, where some of the population lie outside the specification limits, and usually simulations do not take into account components that fail catastrophically, often a result of temperature cycling rather than high temperature operation, as we will see later in this Unit.

Activity

One has to be particularly careful about those aspects of the layout that may affect the temperature of components, in particular if there are hot spots. Read this case study as a reminder of why we use simulation to find the detail of what is going on, rather than limit ourselves to simplified calculations.

 

The positioning of components has a marked effect on the temperature profile across an assembly, because components both generate heat and impede airflow. So there will always be some thermal gradients across the board. At the layout stage, the attempt to place components in the best possible position thermally also has to be balanced against the circuit routing requirement, itself a compromise. Fortunately not all components generate significant heat, and not all components are temperature-sensitive. A typical tactic for the designer is to place the most sensitive components in cooler areas of the board, allowing components whose performance is less sensitive to temperature to be packed more closely together and placed in regions of the assembly that are expected to be hotter.

In some cases the absolute temperature of a component or components is of less importance than the temperature profile across or between them. This is of particular importance in analogue integrated circuits, where one chip can contain several identical circuits that are temperature-sensitive. A thermal gradient across such a die can cause the circuits within it to behave differently, whereas identical behaviour may be desired.

In other cases, several identical but physically separated components may experience different temperatures, because of a large thermal gradient across them, and this in turn can cause each component to behave slightly differently. This applies to all types of component, but especially to semiconductor devices. Whether this is of importance depends on the application; for example, differential amplifier circuits are very sensitive to thermal variation, as gain, bandwidth and offset current/voltages are often affected by temperature. However, it may not matter that they get hot (assuming of course that they remain within their maximum operating temperatures), so long as each amplifier experiences the same temperature.

What we want to achieve in such cases is thermal symmetry. Figure 11 uses thermal contours to show just the opposite, an example of thermal asymmetry between two ICs.

Figure 11: Asymmetric thermal profile across two ICs

Asymmetric thermal profile across two ICs

 

Because the thermal gradient across each IC is different, they are less likely to behave as a matched pair, which may cause unwanted drift and error in signal output, depending on the application. Computer simulation is a valuable tool to predict such unwanted thermal variation before the board layout is finalised.

[ back to top ]


Mechanical effects of temperature change

Provided that drifts in device characteristics have been allowed for in the design phase, most of the changes we have considered so far have no adverse implication for reliability. However, temperature cycling, an inevitable result of operating a circuit, induces strains caused by the expansion mismatches between component parts of the structure. Whether at component or assembly level, the level of these strains will be determined both by the absolute temperature and by temperature differences within the structure.

Within the elastic range of the materials, the structure will revert to its original condition when the stress is removed. However, strain-induced changes are only totally reversible if the strain takes place within a linear part of the stress-strain curve; once materials start to yield, permanent distortion will occur, and eventual catastrophic failure of some sort is likely. This may not happen immediately, and the time to failure will depend on the number of stress cycles as well as their amplitude.

Another distinction between our earlier considerations and this next section is that the drift and other slow changes that at worst cause parametric failures are generally associated with the functional core of the electronic components themselves, whereas mechanical failure can be more dramatic and many of the causes relate to the joints, or to assembly features of the internal component.

Exercise

For your studies we have brought together three papers that we believe are important in understanding these issues.

Read Stress caused by thermal mismatch, to see the benefits of using compliant joints where possible. Why is it that the LCCC has the highest failure rate?

Now read our comments

 

Read Stress and its effect on materials, to see how stress leads to failure, particularly the fluctuating stresses that result in fatigue failure, and how defects in the structure can increase the rate of failure.

[If you need it, there is more basic information on stress and strain in Mechanical properties of metals]

 

Finally, read Estimating time-to-fail for an insight into how the number of cycles to failure is related to the shear strain.

 

Eventually, joints will inevitably fail, but this is actually unimportant; what matters is that the expected number of cycles to failure should be well in excess of the design life of the product. The expected time-to-failure is a complex function, involving:

It is the last of these that is the most significant, and also the one which the packaging engineer can most influence by appropriate thermal design.

When we come to look at enclosures in Unit 16, we will find other aspects of the environment that add complications to the design activity. For example, Steinberg has observed that temperature-induced effects may be compounded if severe vibration is also taking place. And it is not always easy to distinguish problems that are due to the temperature experience of a product from those due to other causes; Steinberg quotes a case where a problem was believed to be vibration-induced, but proved to originate from thermal expansion in the Z-axis that caused solder joint failure.

Quote

The typical solder joint failure, often experienced in systems that have been exposed to vibration and thermal cycling, is not really a vibration failure. Experience has shown that most solder joint failures that appear to occur during vibration are really thermal cycling failures. Thermal cycling will typically initiate the solder joint cracks. However, thermal cycling is usually very slow, perhaps one or two cycles per day. There is very little crack propagation with such slow cycles. Vibration can easily have over 100–200 cycles per second. Cracks can propagate very fast under these conditions. So the cracks that appear during vibration are usually caused by thermal cycling and propagated very rapidly during vibration.

Dave Steinberg, Preventing thermal cycling and
vibration failures in electronic equipment

 

We have seen how the failure rate will depend on a combination of the materials used, and the temperature cycling to which the product is subjected. And from this we will be able to devise methods for testing components and assemblies to give assurance that they will survive their thermal environment. Typically we will use a series of accelerated tests, as described in our paper Assessing product reliability.

Accelerated life testing aims to generate within a short time period the same amount of damage that takes a much longer period to occur in the actual operating environment. Typically the range of temperature cycling is increased, and the cycle time reduced. However, when trying to establish a laboratory thermal cycling test programme that produces the same types of solder joint failures that are experienced in the actual operating environment, solder creep presents a problem. The reason is that, the slower solder is cycled, the weaker it gets, so a more rapid thermal cycle will result in a much longer fatigue life. Therefore, the data gathered in the accelerated thermal cycling fatigue life tests must be examined very carefully to make sure that there is correlation with fatigue failures produced in the actual operating conditions.

Solder tends to become more plastic with more creep as the temperature increases. At temperatures below 0°C, solder is quite rigid, with little creep and stress relaxation, whereas at around 125°C, a tin-lead solder can relax its stress levels to one-third of their value in around 2 minutes; Figure 12 shows the time it takes for a solder to creep and relax internal stresses at different temperatures.

Figure 12: Stress relaxation in tin-lead solder due to creep

Stress relaxation in tin-lead solder due to creep

after Steinberg, 2001

Notice that Figure 12 refers to tin-lead solder; the high-tin solders used to make products lead-free are considerably harder and more resistant to creep, with the result that this relaxation takes longer. In consequence, IPC are recommending that temperature cycling dwell times be increased to a minimum of 15 minutes, in order to ensure that joints are stressed to the fullest extent.

By contrast, the very rapid stress cycles caused by vibration do not appear to have any creep effects. This issue is confined to solder, as the mechanical properties of lead wires, usually made of copper, kovar or similar materials, are not affected by high-temperature exposure.

Whilst we have focused on solder joints, similar issues apply to the internal construction of a silicon integrated circuit. In a typical plastic-encapsulated device, low-expansion silicon is surrounded by a moulding compound with a much higher CTE2, but a lower elastic modulus:

material CTE (ppm/°C) elastic modulus
silicon
3
188
moulding compound
13−20
10−15
2 See this link for background information on CTE and its measurement and this link (PDF file, 362KB) for more information and an extensive list of CTE values for electronic materials.

 

The transfer moulding encapsulation process takes place at a temperature of approximately 180oC. As the materials cool, the plastic contracts more than the silicon. This means that it exerts a compressive force on the sides of the silicon and a shear force along the upper and lower surfaces. These combine to make the corners vulnerable to cracking. Such stresses caused in manufacture, even if they do not cause failure at the time, become ‘frozen into’ the structure, which starts its operational life in a state of stress, and these stresses are compounded by additional stresses generated during operation.

[ back to top ]


Reliability

We have seen that components and assemblies have the potential to fail, both parametrically and catastrophically, as a consequence of temperature excursion, high temperature endurance and temperature cycling. Especially for life-critical applications, a designer will use every means at his/her disposal, including simulation, to extend the life of the product in order to avoid heat-related problems. But how reliable will the resulting design be?

Modelling failure

A detailed consideration of modelling failure and failure rate is beyond the needs of this module, but thermal engineers should be aware that there are always pressures to reduce the maximum operating temperatures, particularly of semiconductors. Equally, they need to understand the background to such pressures, particularly as they relate to MIL-HDBK-217.

Exercise

Read our note on Modelling failure rate, and tackle the SAQ at the end.

 

You will note from our paper that there have been many criticisms of the standard, all of which are based on the need to understand the mechanisms by which failure happens, rather than just apply an artificial model.

Even if we accept that a number of failure modes can be based on the Arrhenius model, and give a straight line plot of time-to-failure on a log scale against temperature, different failure modes have different activation energies. In a typical integrated circuit, there is a broad range of activation energies for the various failure modes, as shown in the table below:

failure mode typical activation energy (eV)
oxide defects
0.30
mask defects
0.50
ball bond lifts
0.35−0.44
electromigration
1.00−1.10
contamination
1.00−1.40
electrolytic corrosion
0.30−0.70

Some of these will be highly accelerated by a temperature increase, others more temperature-independent, so plotting the Arrhenius equation for each failure mode will give a number of different straight lines (Figure 13). Note that different failure mechanisms dominate at different temperatures – for accurate prediction of the failure rate at any temperature, the nature of the failures must be known.

Figure 13: Failure rate vs temperature for three different activation energies

Failure rate vs temperature for three different activation energies

 

[ back to top ]


Improving reliability

There are good economic reasons for improving reliability, as our next quotation suggests, as well as some strategies for improving the reliability of products, based on sensible design practices.

Quote

When a product fails, you must replace it or fix it. In either case, you must track it, transport it, and apologise for it. Losses will be much greater than the cost of manufacture, and none of this expense will necessarily recoup the loss to your reputation.

Taguchi and Clausing, Robust Quality,
Harvard Business Review, Jan-Feb 1990

 

Choosing parts with appropriate temperature ratings

Based on the construction of the component, both the embedded technology and the encapsulation, component manufacturers will specify a temperature rating for each of their parts. Lower category temperatures are typically dictated by the encapsulation; upper category ratings may be limited by the basic technology (as in the case of aluminium electrolytics) or by the packaging (as with plastic-encapsulated microcircuits). One of the challenges of thermal management is to make sure that all devices are operated within their temperature limits. For most purposes, the concentration is on the upper part of the temperature range, except in severe conditions.

Procuring components for operation at high temperatures is a particular challenge for defence users. Traditionally, integrated circuits were purchased in hermetic packages, able to provide a 150°C rating, and the components were subjected to rigorous quality approval and screening. Not only did this add to the price, but it restricted the availability of the components. There were also concerns that, because of the low volumes manufactured, the components would be less reliable than commercial off-the-shelf (COTS) devices, as these are made in quantity and any quality problems have to be resolved quickly for commercial reasons.

The situation has been compounded by the move to lead-free, and the need to specify different termination materials. In consequence, a great deal of attention has been paid to using COTS parts for military applications, in the process up-rating their specifications. Surprisingly, in most cases, operating plastic-encapsulated parts at higher temperatures is much less of an issue than it once was. However, in any of these applications, we have to remember that failures may happen, and allow for this appropriately. This will include a thermal modelling activity, in order to ascertain the worst-case temperature for critical components.

Another way that thermal modelling helps save money is suggested by Tony Kordyban, who makes the very valid point that there is no benefit to be gained by excessive cooling, because this just generates cost, with no apparent reliability advantage.

Quote

That ‘Rule of Thumb’ [that every 10°C increase in temperature cuts component life in half] was probably never true. It comes from the white-coated world of chemistry, where there is a general principle that chemical reactions go faster the higher the temperature. Years ago the military adapted that concept to predicting how temperature makes electronic components fail. They gathered tons of questionable data from the field, then correlated the data with this iffy assumption about chemical reaction rates and came up with the military handbook on electronic reliability (MIL-HDBK-217). It quickly became an industry standard because they wrote the use of it into all the procurement contracts for military hardware, so everybody knows it by heart. MIL-HDBK-217 is the source of the myth that component failure rates double with every 10°C increase in temperature. But most people don’t remember that even MIL-HDBK-217 states that long term nominal operating junction temperatures operate lower than 70°C have zero effect on reliability. So spending money or other resources to reduce junction temperature below 70°C will buy you nothing. The truth is that the temperature that starts hurting a component may be even higher than that. But it is different for different kinds of components.

What are those maximum operating limits? . . . rarely published . . . Everybody agrees that for every component there is some temperature above which it should never be operated . . . every person has a different idea of what that temperature is.

Tony Kordyban, Hot air rises and heat sinks

 

Supplementary information

If COTS issues are important for your work, you should be aware of the online resources at COTS Journal.

And if you are working at extremes of temperature, this link has information that will be a useful starting point.

 

[ back to top ]


Avoiding catastrophic failures

We have already seen the use of emitter resistors in bipolar power transistor circuits to help avoid catastrophic failure, and there are several other possible approaches.

One way of preventing failure due to overheating and thermal hot spots is to design a level of thermal management into the circuit. Some companies produce thermal management ICs, specifically designed to operate with silicon temperature sensors. These are diodes that have a predictable and temperature-dependent current/voltage relationship; a constant voltage is applied across the diode, and the current flowing through it is measured to give information on the diode temperature.

Many modern CPUs use sensor diodes fabricated actually within the wafer to measure their core temperature. If this exceeds a pre-set threshold, the thermal management system may reduce the clock speed, thus reducing the CPU dissipation, or switch the processor into standby mode.

Another approach is to implement power sharing, where the load is shared between several devices. For example, having two identical systems, one of which is used until it reaches a critical temperature, when it is shut down and the second circuit takes over, and so on. Apart from size and cost considerations, this only works if the systems have substantial thermal capacity and take some time to reach their critical temperature.

[ back to top ]


Reducing failure rate in the field

Whilst the traditional ‘bathtub’ curve showing the change of failure rate with time has rightly been criticised, especially as it relates to ‘wear-out’ failures, early failures are known to exist, so that typical methods for improving reliability involve eliminating those defects that might cause early failure. These ‘stress screening’ strategies include high temperature storage (effective in removing failures caused by chemical reaction), temperature cycling or temperature shock (both of which simulate life and expose problems associated with CTE imbalance) and ‘burn in’, applying potential to a circuit under conditions of elevated temperature and for an extended period (up to 168 hours).

Supplementary Information

There is more information about this topic, including some illustrations of test sequences in our paper Improving reliability by screening.

 

Another common practice for improving the reliability of electronic circuits is ‘derating’, employing devices at below their rated limits. MIL-HDBK-217 uses an acceleration factor defined by

A = e^{m\left( {p_1  - p_0 } \right)}

where p1 is the percentage of the maximum stress, p0 is a reference percentage of rated stress (usually 25%) and m is a component factor. The term p refers to the main stress variable that relates to the components being considered – applied voltage for capacitors, applied voltage or power dissipated for a resistor.

Exercise

Certainly derating is good practice, though how much derating is necessary is a debatable point, as you will have seen from our discussion on MIL-HDBK-217!

Review our material on Derating, and don’t forget to spend a little time on the activity in it.

 

And finally . . .

We hope that this brief look at the effects of heat will have given you some insights into the importance of being aware of maximum steady-state temperatures, of temperature differentials, and the effect of temperature changes. Controlling these, or at least using calculation and modelling to understand the thermal situation, is our focus in the following Units and software simulations.

[back to top ]


Resources for this Unit

Each of these lists is in the order in which the material is referenced in the Unit text. However, note that links to SAQ answers are not included!

Needed for activities

Recommended supplementary material

Optional links and information

[ back to top ]