OBDP 2021 - Session 9: COTS Processors Hardening and Fault Tolerance Improvement

Session 9: COTS Processors Hardening and Fault Tolerance Improvement

Tracks

Day 3 - Advances in Data Processing Devices and Equipment

Wednesday, June 16, 2021

4:30 PM - 5:50 PM

Speaker

Attendee169

multiMIND – high performance processing system for robust NewSpace payloads

4:30 PM - 4:50 PM

Abstract Submission

The increasing demand for onboard calculation capability produces the necessity for high performance processing systems onboard nano to small satellites. Modern NewSpace applications especially signal processing, image processing, artificial intelligence require an amount of processing which is not feasible with currently available space grade solutions. In this paper, Thales Alenia Space in Germany is presenting its answer to this need: the multiMIND processing system - a highly flexible, multi-mission solution with modular software framework. The development and demonstration are supported by the European Space Agency in the frame ofthe ARTES Competitiveness & Growth activity “multiMIND on EIVE”.
Using latest COTS Xilinx Zynq Ultrascale+ MPSoC (multi-processor system on chip) effectively combines large FPGA with multi-core ARM processors and delivers the required performance power, whereas robustness against radiation effects is ensured by a specific radhard island circuitry. This radhard circuit serves as a supervisor for the radiation sensitive COTS parts as well as reliable interface to other satellite subsystems. This design is completed by using other elements like Linux, software components and COTS FPGA IP cores allowing for fast development in demonstration mission as well as cost reduction in next generation missions.
The generic core is interfaced with mission specific data acquisition boards, such as RF frontend and digitizers, RF agile processors or optical sensors using standard, high speed lanes exceeding a total data rate of 100 Gbit/s. Internal high speed communication allows integration of plug-in accelerators and dedicated companion boards as well. Dedicated interface boards even allow to fly COTS FMC boards with minimum adaptations.
Since the supervisor section manages all space-specific challenges, the user can widely use terrestrial and COTS proven routines, accelerating the design cycle and providing a fast “highway to space” for his designs. Radiation data and characterization – especially for latest edge COTS chips – are not available and these components require active SEE mitigation. The supervisor performs autonomous latch-up monitoring and clearing, including smart current analysis to catch lower current-step or microlatching events. Relevant telemetry is monitored, analysed and provided to the satellite platform. The supervisor also handles controlled initialization, reboot or reset if necessary and informs the platform accordingly. It buffers and manages the TM/TC flow and can update the MPSoC firmware and software autonomously.
The first application of the multiMIND framework is the E/W-band demonstration mission EIVE (Exploratory In-Orbit Verification of an E/W-Band Satellite Communication Link), where camera data handling and high speed data downlink will be demonstrated; it is scheduled to fly in 2022. This paper will also show how the multiMIND framework can be adapted for modern RF SDR, signal / spectrum intelligence, radar or optical applications.

Presentation_PDF

Attendee3

Dependable MPSoC framework for mixed criticality applications

4:50 PM - 5:10 PM

Abstract Submission

System-on-Chip such as the Zynq UltraScale+ combine multi-core processors (PS), programmable logic (PL) and peripherals in a single device. For space applications, these devices offer the possibility for unprecedented functional integration and performance in a smaller form factor.

Challenges arise in mixed-critical use-cases such as onboard computers with payload integration where the dependability of critical functions can not be compromised by adjacent functions. To improve the adoption of the MPSoC technology in space, avionics developers should consider providing a baseline design framework with native functionality and the possibility for users to further exploit other MPSoC resources. For both, a safe level of dependability shall be guaranteed, which can be balanced with performance according to the use-case.

One possible solution is to exploit native MPSoC isolation and fault detection features. This is often not sufficient, especially when use-cases require data to be shared across domains with different criticality levels which may lead to failure propagation.

EVOLEO and AIRBUS, in the frame of the ESA GSTP project CHICS, are developing an ADHA compatible radiation tolerant 3U computer, based on the Zynq Ultrascale+ for mixed criticality space applications. The solution considers a clear separation between platform and payload functions within the MPSoC. It is oriented towards parallel but independent developments for platform and payload functions, which are often the responsibility of different entities.

The platform side includes lockstep ARM R5 cores and peripherals such as DDR4, SpaceWire router, CAN and UART. The solution offers IP cores, drivers, handlers and software to command and control these peripherals.

The payload side considers four ARM A53 cores with XEN Hypervisor and cache coloring, high speed serial transceivers, and dedicated DDR4 and PL resources. These elements are connected via a generic AXI infrastructure. Multiple user applications (SW or logic based) can be easily integrated.

Both criticality sides are connected via a bespoke “secure data exchange unit” in the PL. Besides the data exchange, this unit supports an “exchange monitor” for fault detection and isolation with the goal of avoiding fault propagation between criticality levels.

The complexity of this exchange monitor is scalable to user needs and available FPGA resources. It can range from ECC, contextual aware limit-type checks up to machine learning algorithms. All alarms are collected and managed by a configurable FDIR function running on the secure side.

The consortium explores this concept in a scenario with a SAVOIR OBC on the secure side and GNSS, star tracker on the user side. Position, velocity and time (PVT) are shared to an AOCS application running on the secure side. The exchange monitor detects any out-of-range PVT values before these are fed into the AOCS algorithms.

Recurrence oriented HW/SW frameworks are a key enabler to explore the functional integration capabilities of state-of-the-art SoC. These have the possibility of simplifying payload integration into avionics systems, reducing overall costs.

Fault detection and isolation may be handled by this infrastructure, providing baseline levels of dependability with upper dependability levels still limited by the device itself.

Presentation_PDF

Attendee118

IP to detect and diagnose errors in COTS microprocessors through the Trace Interface

5:10 PM - 5:30 PM

Abstract Submission

The use of commercial-off-the-shelf (COTS), cutting-edge processing systems in space applications has received much attention due to an increasingly competitive commercial space sector. Such components would increment the processing capabilities on orbit to unprecedented levels, bringing a great competitive advantage, but assuring reliability under harsh space conditions is a challenge.

Single-event effects (SEEs) are a major concern in processors. When using COTS components, available SEE protections are limited and the knowledge about the behavior of the device under radiation is poor. Typically, limited actions can be performed on the hardware to enhance radiation hardness. For that reason, COTS processors usually introduce software-level hardening, by modifying the code to increment robustness, but paying significant performance penalties. Moreover, software hardening can only protect software-accessible resources, but other processor resources may be left unprotected.

To achieve fault tolerance, processing systems based on COTS must be designed to implement error detection and recovery capabilities. However, few details are typically available about the internal architecture or implementation of COTS components, and the observability of the processor internal state is usually low. In addition, complex processing systems may present varied failure modes that need to be tackled in different ways, especially when considering different criticality levels.

We are presenting a solution to tackle both radiation hardening and testability challenges regarding COTS microprocessors. We have developed a lightweight IP core in HDL that leverages the information available at the trace interface to detect and diagnose errors in microprocessors. The trace interface is a resource commonly found in modern microprocessors to support application development. It provides relevant data about execution with low latency in a non-intrusive manner. When development finishes, trace circuit is no longer in use and could be reused for other purposes. The presented IP can detect errors online with processor execution and obtain error evidence and traceability with low impact on system design and no performance penalty, supporting several development phases:

-Design: providing error detection and diagnosis capabilities during development to identify flaws in the system and enhance a given application to meet dependability requirements.

-Device evaluation: detecting and classifying errors in different devices, allowing severity evaluation to provide objective criteria on component selection. Not only for COTS but also for RadHard devices, it could help to understand and mitigate complex failure modes.

-Operation: working side by side with a microprocessor to check the integrity of the executed application in real time, raise an alert upon error, and provide diagnosis information to perform the necessary corrective action with low latency, achieving fault tolerance.

This solution is currently available at ARQUIMEA as an IP core compatible with ARM Cortex-A9 processor. It has been functionally validated in Xilinx Zynq device under radiation testing (TRL3-4) obtaining high error detection rate (up to 99.9%) and useful diagnosis information. The IP features low pin count and parametric design ready to be implemented in any FPGA with low footprint. Currently, efforts are ongoing to enhance IP compatibility with a wider range of technologies and processor cores, including microchip and NANOXPLORE FPGAs.

Presentation_PDF

Attendee181

System-level hardening techniques used in the COTS-based data processing unit

5:30 PM - 5:50 PM

Abstract Submission

CubeSat missions, in most cases, utilize commercial off-the-shelf (COTS) components. The COTS components are vulnerable to radiation effects such as single event effects (SEE) or total ionizing dose damage (TID). Those effects decrease overall system reliability and can lead to permanent damage to components. One of the methods of mitigating the risk is system-level hardening.

The most commonly used hardening techniques are fault detection, isolation, and recovery (FDIR) mechanisms implemented in a ruggedized controller that controls state-of-the-art systems on a chip (SoCs), error correction coded (ECC) or triple modular redundant (TMR) memories, and Configuration RAM (CRAM) scrubbing in the SoCs. In some cases, partial or full redundancy of selected components is implemented. These techniques were reviewed then selected techniques were implemented in the KP Labs' Leopard data processing unit (DPU). The proposed solutions may be re-used in other missions to fulfill mission reliability, availability, and safety levels.

The Leopard data processing unit (DPU) is a part of KP Labs' Intuition-1 mission scheduled to be launched in 2023. Intuition-1 will be a 6U-class CubeSat, and it will utilize a specialized hyperspectral camera with spectral resolution in the range of 470-900 nm with 150 spectral bands. The primary purpose of this mission is to technologically demonstrate the reduction of the spatial resolution of hyperspectral images (HSI), hyperspectral band selection, and segmentation of HSI with a neural network-based in-orbit processing hardware that is the Leopard DPU. Implementation of algorithms on-board the satellite will allow to quickly decimate data, reducing the amount of radio air time required to download all the data to Earth.

The Leopard DPU consists of two redundant processing nodes controlled by a shared supervisor. Both elements are built using COTS components. Each processing node utilizes a state-of-the-art Xilinx Zynq Ultrascale+ MPSoC, 16 GB of DDR4 memory with ECC, 4 GB of NAND flash memory, and two 256 GB solid-state drives (SSD). Leopard DPU utilizes Xilinx's Vitis AI development environment platform to support mainstream AI networks such as TensorFlow and Caffe. Zynq's bootloaders and basic Linux image are located on a TMRed QSPI NOR flash memory, placed on a supervisor board. Moreover, the supervisor's board implements basic safety features for the two processing nodes, selects Linux images to be loaded, and multiplexes platform interfaces.

DPUs are powered by highly integrated power management integrated circuits (PMICs) surrounded by multiple sensors. The supervisor monitors currents, voltages, and temperatures to implement FDIR techniques for detecting single event latch-ups and high current events. Single event upsets (SEUs) are corrected on many levels. Zynq's configuration RAM (CRAM) is scrubbed by a soft error mitigation module. DDR4 utilizes an ECC mechanism that detects and corrects SEU errors that occurred in memory.

The supervisor board is composed of a radiation-hardened Vorago's Cortex-M0 microcontroller and accompanying ProASIC3 FPGA. The main tasks of the supervisor are: controlling DPUs, multiplexing platform interfaces such as high-speed X-Band and S-Band links, and routing CubeSat space protocol (CSP) packets.

Presentation_PDF

Session Chairs

Attendee2

Attendee21