Session 5b: Evaluation and Benchmarks of Processing Devices and Systems
Tracks
Day 2 - On-Board Processing Benchmarks and AI Acceleration
Tuesday, June 15, 2021 |
3:05 PM - 4:05 PM |
Speaker
Dr. Patricia Lopez Cueva
Thales Alenia Space France
Evaluation of new generation rad-hard many-core architecture for satellite payload applications
3:05 PM - 3:25 PMAbstract Submission
Computing power requirements of space applications, in particular for payload functions, are constantly rising with the objective of increasing the number of processing operations that can be carried out on board to achieve higher spatial, spectral, temporal and radiometric resolution. Moreover, next generation projects are seeking a breakthrough by embedding new innovative processing to gain efficiency, flexibility and autonomy, putting more emphasis on the necessity to have powerful processing components.
Two kinds of solutions are available to meet the need of on-board processing power: radiation tolerant COTS hardware subject to component unavailability issues, or space dedicated hardware with lower capabilities. Instead, the RC64 component from Ramon.Space proposes a good compromise, being radiation-hardened (Rad-Hard) and proposing significant processing capacity with 64 DSP cores.
The RC64 component is a many-core radiation-hardened processor, built around spatial DSPs [1], and designed for applications requiring high computing power and very high speed processing [2][3]. This component has a massively parallel architecture that offers processing performances close to FPGAs or specialized ASIC, but also provides the flexibility of software programming and low power consumption. However, its architecture requires rethinking the algorithms in order to make the most of the RC64 capabilities.
Last few years, CNES has been working on the evaluation of this component for different space applications. As part of this effort, a study has been launched to evaluate the suitability of using the RC64 component with high complex algorithms.
The algorithm used on this study is based on the latest CCSDS 123 standard [4], which introduces, among other new features, near-lossless compression. However, this standard has been designed in a sequential way, which means that modifying the execution model is necessary for parallel operation, and it was a required step before any attempt at efficient implementation on a many-core.
Thus, the first step was to familiarise with this component through micro-benchmarks, to learn good practices and gain technical knowledge for an efficient use of the RC64 programming model, as well as on its architectural specificities: the impact of DSPs in parallel, the memory hierarchy and the hardware multi-tasking scheduler needed to be analysed to achieve good processing performances. Then we applied this knowledge to the considered CCSDS algorithm, testing different configurations, in both the prediction and the encoding steps which have an impact on parallelisation performance and memory usage, so as to realise a preliminary implementation of the CCSDS-123.0-B-2 that tried to achieve efficient performance and scalability on the RC64.
In this paper, we propose an overview of the RC64 architecture and a description of the methodology used to parallelise efficiently the application on this target. Then, we present the results of the implementation as well as the conclusions of the evaluation.
[1] CEVA DSP processors for MacSpace, MacSpace Symposium and Summer Seminar, 2016
[2] RC64: High Performance Rad-Hard Manycore, Aerospace Conference, 2016 IEEE
[3] MacSpace / RC64 architecture, MacSpace Symposium and Summer Seminar, 2016
[4] Low-Complexity Lossless and Near-Lossless Multispectral & Hyperspectral Data Compression, Recommended Standards. CCSDS-123.0-B-2. Blue Book, 2019
Two kinds of solutions are available to meet the need of on-board processing power: radiation tolerant COTS hardware subject to component unavailability issues, or space dedicated hardware with lower capabilities. Instead, the RC64 component from Ramon.Space proposes a good compromise, being radiation-hardened (Rad-Hard) and proposing significant processing capacity with 64 DSP cores.
The RC64 component is a many-core radiation-hardened processor, built around spatial DSPs [1], and designed for applications requiring high computing power and very high speed processing [2][3]. This component has a massively parallel architecture that offers processing performances close to FPGAs or specialized ASIC, but also provides the flexibility of software programming and low power consumption. However, its architecture requires rethinking the algorithms in order to make the most of the RC64 capabilities.
Last few years, CNES has been working on the evaluation of this component for different space applications. As part of this effort, a study has been launched to evaluate the suitability of using the RC64 component with high complex algorithms.
The algorithm used on this study is based on the latest CCSDS 123 standard [4], which introduces, among other new features, near-lossless compression. However, this standard has been designed in a sequential way, which means that modifying the execution model is necessary for parallel operation, and it was a required step before any attempt at efficient implementation on a many-core.
Thus, the first step was to familiarise with this component through micro-benchmarks, to learn good practices and gain technical knowledge for an efficient use of the RC64 programming model, as well as on its architectural specificities: the impact of DSPs in parallel, the memory hierarchy and the hardware multi-tasking scheduler needed to be analysed to achieve good processing performances. Then we applied this knowledge to the considered CCSDS algorithm, testing different configurations, in both the prediction and the encoding steps which have an impact on parallelisation performance and memory usage, so as to realise a preliminary implementation of the CCSDS-123.0-B-2 that tried to achieve efficient performance and scalability on the RC64.
In this paper, we propose an overview of the RC64 architecture and a description of the methodology used to parallelise efficiently the application on this target. Then, we present the results of the implementation as well as the conclusions of the evaluation.
[1] CEVA DSP processors for MacSpace, MacSpace Symposium and Summer Seminar, 2016
[2] RC64: High Performance Rad-Hard Manycore, Aerospace Conference, 2016 IEEE
[3] MacSpace / RC64 architecture, MacSpace Symposium and Summer Seminar, 2016
[4] Low-Complexity Lossless and Near-Lossless Multispectral & Hyperspectral Data Compression, Recommended Standards. CCSDS-123.0-B-2. Blue Book, 2019
Dr. Constantin Papadas
Isd Sa
Summary of multiple benchmarks on the High Performance Data Processor (HPDP)
3:25 PM - 3:45 PMAbstract Submission
The purpose of this communication is to discuss on the suitability of the HPDP device for power-efficient, on-board data processing in future satellites. Five use cases will be presented. These use cases have been analysed in the course of a GSTP de-risk activity or in the early phases of scientific missions. The performance figures reported below have been obtained on the HPDP evaluation board connected with a laptop.
Moon Asteroid Strike (MAS): The MAS algorithm is meant to detect and count the asteroids striking the moon surface. Basically, the idea is to count collision flashes in the moon surface and from their intensity and duration to extract meaningful information about the collision energy and the mass of the asteroid. From the computational point of view the algorithm involves DMA engines, filtering, storage of the last 7 frames and comparison. The computational flow is shown hereinafter. The obtained performance is the on-the-fly analysis of 116 HD fps at the expense of 1.65W power consumption. This implementation is meant to be used in the Lumio mission.
Vessel Detection (VD) from EO images: The VD images is meant to detect vessels from EO images. At first, the VD algorithm involves a Sobel filter. This is actually an edge detection filter, which amplifies the various features of the image. The Sobel operator consists of a pair of 3x3 convolution kernels and is designed to perform 2D spatial gradient over an image. In turn, 6 kernels of 20x20 pixels each are applied to the image and after comparison with thresholds the detected vessels are reported. This implementation can process 9.6 HD fps at the expenses of 1.65W power consumption.
A step further, a machine learning version of this application has been developed in tensor flow. Targeting >90% correct answers for the application of the VD, the NN is trained outside the HPDP device by using the tensor flow framework and python and the final NN is mapped on the HPDP in order to perform the pattern recognition. For this application a convolutional network is chosen with 1 hidden layer and one final dense layer of neurons. In total, the network has about 5 thousand parameters and it is mapped in 2-3 different configurations of the HPDP array. The different configurations are applied on the fly with a delay of less than 0.5ms per reconfiguration. The obtained results are better than those presented above in terms of accuracy, the processed fps figure is similar as above and the power consumption remains idem.
AES 256: Two versions of the encryption algorithm have been developed: AES256 without CBC (Cipher Block Chaining) and AES256 with CBC. First the key expansion is calculated in the FNC0 and is passed via a FIFO to the array. The key expansion process generates the expanded version of the 256b key which is used for the actual encryption. The expanded key is used repeatedly throughout the encryption process and it is stored in the FIFO of the array in order to reduce additional delays. The data to be encrypted are fetched to the incoming stream by the DMA in groups of 4 bytes. From the other side, the cypher sends data to the on-chip SRAM in packets of 4 bytes too. Two instances of the AES256 without CBC IP (resp. One instance of the AES256 IP with CDC) can fit in the array giving a total throughput of 11.7MB/s (resp. 5MB/s) at the expense of 1.65W power consumption.
Image Compression: For the compression of the multi-spectral imagery the lossless flavor of the standard CCSDS 123.0-B-2 has been implemented. The compression is split into two sequential configurations and the attained throughput is 1Gb/s image data consumption. In this configuration the power consumption of the device is 1.65W and no external processing capability is required. Currently ISD is working on similar lossless algorithms featuring low entropy encoder targeting the TRUTHS mission.
Moon Asteroid Strike (MAS): The MAS algorithm is meant to detect and count the asteroids striking the moon surface. Basically, the idea is to count collision flashes in the moon surface and from their intensity and duration to extract meaningful information about the collision energy and the mass of the asteroid. From the computational point of view the algorithm involves DMA engines, filtering, storage of the last 7 frames and comparison. The computational flow is shown hereinafter. The obtained performance is the on-the-fly analysis of 116 HD fps at the expense of 1.65W power consumption. This implementation is meant to be used in the Lumio mission.
Vessel Detection (VD) from EO images: The VD images is meant to detect vessels from EO images. At first, the VD algorithm involves a Sobel filter. This is actually an edge detection filter, which amplifies the various features of the image. The Sobel operator consists of a pair of 3x3 convolution kernels and is designed to perform 2D spatial gradient over an image. In turn, 6 kernels of 20x20 pixels each are applied to the image and after comparison with thresholds the detected vessels are reported. This implementation can process 9.6 HD fps at the expenses of 1.65W power consumption.
A step further, a machine learning version of this application has been developed in tensor flow. Targeting >90% correct answers for the application of the VD, the NN is trained outside the HPDP device by using the tensor flow framework and python and the final NN is mapped on the HPDP in order to perform the pattern recognition. For this application a convolutional network is chosen with 1 hidden layer and one final dense layer of neurons. In total, the network has about 5 thousand parameters and it is mapped in 2-3 different configurations of the HPDP array. The different configurations are applied on the fly with a delay of less than 0.5ms per reconfiguration. The obtained results are better than those presented above in terms of accuracy, the processed fps figure is similar as above and the power consumption remains idem.
AES 256: Two versions of the encryption algorithm have been developed: AES256 without CBC (Cipher Block Chaining) and AES256 with CBC. First the key expansion is calculated in the FNC0 and is passed via a FIFO to the array. The key expansion process generates the expanded version of the 256b key which is used for the actual encryption. The expanded key is used repeatedly throughout the encryption process and it is stored in the FIFO of the array in order to reduce additional delays. The data to be encrypted are fetched to the incoming stream by the DMA in groups of 4 bytes. From the other side, the cypher sends data to the on-chip SRAM in packets of 4 bytes too. Two instances of the AES256 without CBC IP (resp. One instance of the AES256 IP with CDC) can fit in the array giving a total throughput of 11.7MB/s (resp. 5MB/s) at the expense of 1.65W power consumption.
Image Compression: For the compression of the multi-spectral imagery the lossless flavor of the standard CCSDS 123.0-B-2 has been implemented. The compression is split into two sequential configurations and the attained throughput is 1Gb/s image data consumption. In this configuration the power consumption of the device is 1.65W and no external processing capability is required. Currently ISD is working on similar lossless algorithms featuring low entropy encoder targeting the TRUTHS mission.
Mr. Vasileios Leon
National Technical University of Athens
Systematic Evaluation of the European NG-LARGE FPGA & EDA Tools for On-Board Processing
3:45 PM - 4:05 PMAbstract Submission
The proliferation of demanding workloads in on-board systems for space applications, such as Vision-Based Navigation (VBN), has led to a new era of embedded on-board processing. To enable new applications and reconfigurable high-performance computing within a restricted power envelope, the space industry is examining alternative platforms/technologies. Among the existing embedded platforms, the FPGAs have gained increased popularity due to their attractive performance per power ratio, and thus, they are already being considered for future space missions either as main accelerators or framing processors. In this context, the new European space-grade family of FPGAs, named BRAVE and provided by NanoXplore, is expected to play a key role owing to its radiation-hardness, high density, and reconfiguration features, as well as its software tools providing end-to-end FPGA development and seamless chip configuration.
The BRAVE family of FPGAs constitutes an additional promising solution in the current limited pool of space-grade FPGAs. Most of these FPGAs are inferior to their Commercial Off-The-Shelf (COTS) counterparts either in terms of performance or resources availability. NanoXplore provides various BRAVE FPGAs ranging from low-end to high-end, i.e., NG-MEDIUM (65nm), NG-LARGE (65nm), and NG-ULTRA (28nm), which are Radiation-Hardened By Design (RHBD) and incorporate the traditional FPGA programmable logic resources (LUTs, DFF, DSPs, RAMBs, etc.). NG-LARGE and NG-ULTRA include ARM processors, with the latter implementing a full SoC. Moreover, the BRAVE family provides features essential for embedding computing in space, such as the SpaceWire interface for fast I/O and chip configuration, and memory scrubbing to ensure the continuous correct functionality.
In this paper, we evaluate the NG-LARGE FPGA by (i) assessing the development and configuration tools, i.e., NXmap3 and NxBase2, respectively, (ii) assessing the hardware components (board+chip), and (iii) doing high-performance DSP benchmarking with realistic workloads of space applications. Our work is associated with the QUEENS-FPGA and QUEENS2 activities of ESA and aims to examine the maturity of the software tools and the chip’s performance, targeting its future usage as on-board processor. To evaluate the capabilities of such a new device, we develop a quality assessment methodology, which is based on systematic and disciplined testing at different stages (Synthesis, Placement, Routing, HW Execution). The methodology includes comparisons with the predecessor NG-MEDIUM and other competitor FPGAs. The benchmarking involves VHDL kernels from the computer vision and signal processing domains, representing the performance requirements in current and future spacecraft/rovers. Overall, the contribution of this work is twofold: (i) the presentation of a methodology for evaluating new devices/tools throughout the entire development & execution process, and (ii) the demonstration of the NG-LARGE capabilities as on-board processor. Preliminary results show that NG-LARGE provides similar resource utilization to the competitors, and in some cases even better (e.g, in RAMBs). In terms of performance, NG-LARGE offers sufficient throughput, outperforming the space-grade CPUs. Specifically, the throughput is 130 Megasamples/sec for signal filtering, 5-10 Frames/sec for feature detection on Megapixel images, and ~7 sec per high-definition image depth extraction.
The BRAVE family of FPGAs constitutes an additional promising solution in the current limited pool of space-grade FPGAs. Most of these FPGAs are inferior to their Commercial Off-The-Shelf (COTS) counterparts either in terms of performance or resources availability. NanoXplore provides various BRAVE FPGAs ranging from low-end to high-end, i.e., NG-MEDIUM (65nm), NG-LARGE (65nm), and NG-ULTRA (28nm), which are Radiation-Hardened By Design (RHBD) and incorporate the traditional FPGA programmable logic resources (LUTs, DFF, DSPs, RAMBs, etc.). NG-LARGE and NG-ULTRA include ARM processors, with the latter implementing a full SoC. Moreover, the BRAVE family provides features essential for embedding computing in space, such as the SpaceWire interface for fast I/O and chip configuration, and memory scrubbing to ensure the continuous correct functionality.
In this paper, we evaluate the NG-LARGE FPGA by (i) assessing the development and configuration tools, i.e., NXmap3 and NxBase2, respectively, (ii) assessing the hardware components (board+chip), and (iii) doing high-performance DSP benchmarking with realistic workloads of space applications. Our work is associated with the QUEENS-FPGA and QUEENS2 activities of ESA and aims to examine the maturity of the software tools and the chip’s performance, targeting its future usage as on-board processor. To evaluate the capabilities of such a new device, we develop a quality assessment methodology, which is based on systematic and disciplined testing at different stages (Synthesis, Placement, Routing, HW Execution). The methodology includes comparisons with the predecessor NG-MEDIUM and other competitor FPGAs. The benchmarking involves VHDL kernels from the computer vision and signal processing domains, representing the performance requirements in current and future spacecraft/rovers. Overall, the contribution of this work is twofold: (i) the presentation of a methodology for evaluating new devices/tools throughout the entire development & execution process, and (ii) the demonstration of the NG-LARGE capabilities as on-board processor. Preliminary results show that NG-LARGE provides similar resource utilization to the competitors, and in some cases even better (e.g, in RAMBs). In terms of performance, NG-LARGE offers sufficient throughput, outperforming the space-grade CPUs. Specifically, the throughput is 130 Megasamples/sec for signal filtering, 5-10 Frames/sec for feature detection on Megapixel images, and ~7 sec per high-definition image depth extraction.
Session Chairs
Mickaël BRUNO
CNES
Clément Coggiola
CNES