Introduction to Fault Tolerant (FT) Systems
This is only a short introduction to fault tolerant (FT) systems to explain the major difference between regular and mission critical products. This document is targeted toward readers that want an introduction to fault tolerant design. It also applies to the Ã…AC RTU family fault tolerant devices.
Article table of contents
- Effects of ionizing radiation on electronics
- Challenges for electronic design
- Radidation hardening on component, design, and system level
- Radiation hardening case study on nanoRTUâ„¢ 200 series
Effects of ionizing radiation on electronics
The challenge for electronics in an radition environment, such as space, is the existance of highly energetic gamma rays, protons, electrons, neutrons, and heavy nuclei.
The gamma rays causes a Total Ionazing Dose (TID) which slowly degrades electronics and first increases the power consumption and later destroys the functionality. TID is in the space community still defined in RAD, where most other business areas uses Gray. As reference, a TID of 400 RAD is lethal to a human beeing. In space applications, a typical minimum TID is often given to 10.000 RAD. Ã…AC's Fault Tolerant products have a minimum TID of 20.000 RAD, and the radiation hardened (RH) structured ASICs have 100.000 RAD. It is possible to shield electronics from TID, for instance with Aluminum. A device that on component level tolerate 20 kRAD can with normal shielding (4 mm Al) tolerate 100 kRAD.
The highly energetic particles causes other phenomena, called Single Event Latch-Up (SEL) and Single Event Upsets (SEU). These processes are complex but renders the device in a short circuit in the SEL case and causes electronic noise and signal distortion in the SEU case. A system can be protected from SEL by introducing Latch-Up Current Limiters (LCL) which shuts the power off when a short circuit occurs. SEU's usually generates bit-flips in digital electronics, causing the content of memories, processor caches, instructions to alter and hence change the behaivor of the system. SEU effects cannot be shielded due to the very high energy of the particles. It may be tempting to fly commercial electronics in space, and it may just work for low Earth orbits (LEO), but the system level analysis will seldom be higher than 50%. Its a 50-50 game wether the system will work or not and it get worse depending on the mission length. The core avionics system of a spacecraft should always consists of fault tolerant electronics that can mitigate SEU effects. Usually the supceptibility to SEE effects are determined by Linear Energy Transfer (LET) numbers. If a device is rated to a LET number higher than >70 MeV, it is typically good for LEO. However, remember that a digital component such as an FPGA can have different LET numbers for different parts. The FPGA flash may have LET=96, while the RAM can have LET=1, meaning that significant protection is needed on the RAM:
There is an important difference between space radiation and radiation in nuclear power plants. Most particles around Earth due to the Earth magnetic field is protons and electrons, while neutrons dominate in nuclear power plants.
Challenges for electronic design
Mimimize the digital SEU effects! Use rated components where manufacturer garantuees the correct functionality. These are typically very expensive, in the order of $1000 to $100 000. Alternatively use screened components meaning that the component is tested on Earth at radiation facilities. Third, use components with heritage, i.e. they have flown before.
In some cases, especially for Class D missions (low cost mission) you can use "buest guess". A similar component has been screened, but not the one we have selected. Find out if the component is made on the same process. Find out if there is a space qualified component by the manufacturer and see if it is the same die that have been screened.
Be aware that you will likely have to perform a trade unless you have an infinite amount of resources. This is especially important for nanosatellites. Softer EEE components need to be proteced with Redundancy and Error Correction.
Space Agencies normally have a plethoria of rules and standards to mitigate potential threats. Normally they also recuire that Part Stress Analysis (PSA) and Worst Case Analysis (WCA) are performed. These are important. Although, a few major rules can be applied for nanosatellies to reduce cost. Derating of EEE components to 60%, i.e. only use the components to 60% of their maximum specified rating. Second derating for connectors to 50% of their maximum specified rating. These rules of thumb together with thermal analysis governs the device to a higher level than otherwise.
Radidation hardening on component, design, and system level
Digital EEE parts and in particular FPGAs can be proteced in several ways. Many FPGA manufactures offer Triple Modular Redundancy (TMR). TMR converts each flip-flop of the design into three flip-flops and a voting circuit with a truth table that is 1, only if two, or all three flip-flops are 1. TMR consumes a lot more flip-flops than normal due to this. As a recommendation, do not fill a FPGA to more than half if you want to use TMR.
Another trick is to use Error Detection, Analysis, and Correction (EDAC) algorithms. Especially memory and data paths can use this technique. Typically a data bus is made wider, for instance 32 bit data + 8 bit EDAC to a 40 bit bus. EDAC can be sized depending on need, one bit error detection, one bit error correction, one bit error correction with two bit error detection, two bit error detection, etc. Ã…AC typically uses one bit error correction and multiple bit error detection within the RTU family. Even firmware data can be protected by adding EDAC during programming on Earth before sent to space. The FPGA algortihm checks the firmware data on the fly during execution.
The same algorithm can be used to protect external processor SDRAM through a scrubbing sequence. When a processor or sequencer is not working with the external memory, a hardware scrubber goes through the memory and checks each memory address with EDAC and if it finds error, it corrects it and writes the corrected value back. If it can't correct it it notifies the processor of the error and make an action.
For less critical parts a simple parity algortihm can be used. Each byte is protected by a parity bit (8 + 1) which is very simple, however, it only detects half of the errors and it offer no correction. Although, finding half of the errors in a less critical path significant improves the system reliability analysis.
Cyclic redundancy checks (CRC) can be used to verify the integrity of data streams. This is typically used in conjunction with parity or EDAC to further improve the reliability and detection of faulty data. Data transmission inside a system should be sent with CRC protection.
Additional protection can be done by having redundant CRC and EDAC protected firmwares.
It is very important to handle ALL detected errors! An error mangaer is a logic entiry, for instance implemented in an FPGA. The error manager implements the rules for the consequences that a detected error will have. These actions can be resetting of the processor, rebooting with redundant firmware, discarding data, initiate a processor interrupt for fast action, or switching to a completely redundant processor unit.
Radiation hardening case study on nanoRTUâ„¢ 200 series.
The nanoRTUâ„¢ 200 series are a processor module with simple performance which is used for simple things. In order to minimize the power consumption and complexity, a PIC microprocessor was chosen. The first thing to observe is that there are no fault tolerant PIC processors so a soft core in VHDL was chosen from the Open Core community. However, the downloaded core had significant flaws and had to be heavily modified to work correctly. Once the soft core PIC16 core was certified, a fault tolerance improvement program was started resulting in EDAC (one bit correction, multiple bit error detection) was implemented to the PIC program ram data flow, parity was added to PIC data ram and user RAM, and finally a redundant CRC protection firmware boot sequence was added. All flip-flops of the FPGA is proteced with TMR and any event that would cause an entire IO bank of the FPGA to flip is handled with cross connection. This all together creates a fault tolerant PIC16 core called (Ã…AC PIC16F84-FT) which is fully implemented in the Ã…AC nanoRTUâ„¢ 212 FT interface module. The error manager implemented on the nanoRTUâ„¢ 212 FT interface module has the following behavior:
- 1 bit correction and multiple error detection EDAC on PIC program RAM. Multiple errors lead to automatic reset of the device.
- Parity error checking on PIC data RAM. Parity error leads to automatic reset of the device.
- Parity error checking on 1 KByte user RAM. Parity error triggers user interrupt for user action.
- Triple Modular Redundancy (TMR) on all FPGA flip-flops
- FPGA SEU bank flip detection. Bank SEU leads to automatic reset of the device.
- Watchdog. Watchdog tripping leads to automatic reset of the device.
- Cyclic redundancy checking (CRC-8 CCITT) during boot on static firmware and in-flight upgradable firmware
- Cyclic redundancy checking (CRC-8 CCITT) of extensible Transducer Electronic Datasheet (xTEDS)
- The nanoRTUâ„¢ first re-boots from in-flight upgradable firmware. This only occurs if the in-flight upgradable firmware exists and CRC checks out, otherwise the device automatically falls back to the static firmware and checks CRC. The device tries to boot the static firmware even during static firmware CRC error. It will, however, flag that a CRC error has been detected. If the device is functioning, this lets the user transmit to the spacecraft fault detection isolation & recovery (FDIR) manager. In addition, the FPGA is providing registers that count the number of various errors and these are not resetted with the FPGA soft reset.
The Ã…AC PIC16F84-FT core can be licensed from Ã…AC. The core and the nanoRTUâ„¢ 212 FT product has been validated by the US Air Force Research Laboratory (AFRL) and thouroghly tested by Space Dynamics Laboratory (SDL) at Utah State Univeristy.
