Session 6b: AI Inference Frameworks and Acceleration on Space Devices
Tracks
Day 2 - On-Board Processing Benchmarks and AI Acceleration
Tuesday, June 15, 2021 |
4:55 PM - 5:55 PM |
Speaker
Mr. Jason Vidmar
Xilinx
Space DPU: Constructing a Radiation-Tolerant, FPGA-based Platform for Deep Learning Acceleration on Space Payloads
4:55 PM - 5:15 PMAbstract Submission
Deep learning techniques based on neural networks are of interest for a variety of on-board processing applications in Space spanning areas such as scientific analysis, object detection and image classification (e.g., cloud detection). However, the impact of single event effects on high-performance hardware platforms capable of accelerating compute-intensive deep neural networks is an area of concern. In this paper, we present the architecture and design of a reconfigurable, FPGA-based platform that combines Xilinx's Deep Learning Processing Unit (DPU) for Convolution Neural Networks with reliability enhancements for Space deployments. These enhancements include integrating the DPU with Xilinx's Triple Modular Redundancy (TMR) MicroBlaze subsystem, which provides a fault-tolerant and fail-safe host processor for dynamically deploying and managing neural networks in on-board processing applications, in addition to Xilinx's Single Event Upset Mitigation (SEM) IP for real-time detection and correction of soft errors. Furthermore, a technique for Fault Aware Training developed by Xilinx Research Labs is applied and tested on selected neural networks for image classification and object detection: resnet-18 and Tiny-YOLOv2, respectively. The platform was beam-tested on a commercially available Xilinx KCU105 board, demonstrating a substantial reduction in classification degradation relative to standard, non-mitigated approaches, and the architecture is directly transferable to Xilinx's Space-grade, Radiation-Tolerant Kintex UltraScale XQRKU060 devices for production deployments. We will summarize results from the existing work and outline approaches for scaling and encapsulating the design into a Vitis Target Platform for compatibility with Xilinx’s mainstream Vitis Acceleration flow and Vitis AI tools.
Mr. Ken O'Neill
Microchip Technology
Using the VectorBlox software development kit to create programmable AI/ML applications in radiation-tolerant RT PolarFire FPGAs
5:15 PM - 5:35 PMAbstract Submission
In this paper, we describe the VectorBlox Accelerator software development kit (SDK) and how it is used to optimize and convert trained Artificial Intelligence (AI) models, targeting power-optimized 28nm FPGAs. Neural networks are sourced from a variety of supported input frameworks, such as TensorFlow, Caffe, ONNX, and PyTorch. The VectorBlox Accelerator SDK performs a three-step conversion flow to optimize the networks, calibrate and scale them to 8-bit representation, and finally create an image for implementation on the FPGA.
Power-efficient implementation of the neural network on the FPGA is achieved by a soft IP core called CoreVectorBlox. This soft IP core comprises a RISC-V processor and firmware, a vector processor, and a convolutional neural network accelerator, which consists of a two-dimensional array of processing elements, making use of the multiply-accumulate blocks in the RT PolarFire FPGA.
By implementing neural networks in a matrix processor programmed in the fabric of the new radiation-tolerant RT PolarFire FPGA, the networks can be iterated and changed without resynthesizing the FPGA, resulting in convenient, programmable low-power AI applications that can be dynamically changed at run-time.
Examples of performance and utilization of a variety of neural networks sourced from TensorFlow, Caffe, ONNX and PyTorch will be provided, for implementations both with and without triple module redundancy which may be desired for radiation mitigation purposes.
The radiation-tolerant RT PolarFire FPGA will be described, with emphasis on radiation test data and schedules for qualification and flight models.
Power-efficient implementation of the neural network on the FPGA is achieved by a soft IP core called CoreVectorBlox. This soft IP core comprises a RISC-V processor and firmware, a vector processor, and a convolutional neural network accelerator, which consists of a two-dimensional array of processing elements, making use of the multiply-accumulate blocks in the RT PolarFire FPGA.
By implementing neural networks in a matrix processor programmed in the fabric of the new radiation-tolerant RT PolarFire FPGA, the networks can be iterated and changed without resynthesizing the FPGA, resulting in convenient, programmable low-power AI applications that can be dynamically changed at run-time.
Examples of performance and utilization of a variety of neural networks sourced from TensorFlow, Caffe, ONNX and PyTorch will be provided, for implementations both with and without triple module redundancy which may be desired for radiation mitigation purposes.
The radiation-tolerant RT PolarFire FPGA will be described, with emphasis on radiation test data and schedules for qualification and flight models.
Mr Ran Ginosar
Ramon Space
Ramon Space RC64-based AI/ML Inference Engine
5:35 PM - 5:55 PMAbstract Submission
AI and machine learning can be employed for many good uses in Space. Examples in remote sensing include cloud detection, image understanding and analytics, image-based decision making, targeted imaging, precision agriculture, supply chain analytics, disaster monitoring, alert and analysis, change detection and anomaly detection and tracking. In telecom satellites, examples include spectrum analysis, responsive and predictive interference detection and mitigation, anomaly detection, network and traffic management, capacity and spectrum planning, and user management. Defense applications include battle management and threat analysis. Across many Space segments, AI and ML are used for cybersecurity and Space debris avoidance.
Most AI/ML applications in Space employ standard methods of machine learning and neural networks model development. Models are developed on the ground, typically based on cloud computing and big data for training, or on personal computing and small workstations. Most often, a standard “framework,” or specification language, is employed for model specification and developments. Common frameworks include Google’s TensorFlow, the Keras Python front-end to TensorFlow, Torch and PyTorch, Caffe, Theano, and toolboxes on Matlab and Mathematica. Once the model is trained and can be used for inference, the model can be uploaded to Space for on-board execution, processing on board data. This seamless mode of operation allows high frequency upload of updated models and enable flexibility for whether to execute the inference model on the ground or in Space.
To support model-based and framework-based execution, a framework interpreter, or Inference Engine (IE), is employed. The IE receives the model, written in the source specification language such as Keras Python, and executes it. The models, especially in deep learning, are layered and consist of predefined set of kernels, such as 1D/2D/3D convolution layers, separable and depthwise convolution, pooling layers, normalization layers, fully connected layers, flattening layers and softmax layers. The IE implements the layers in parametric form. Execution of the model happens in stages. In each stage, one layer is executed by invoking the appropriate kernel, customizing it using the parameters that are included in the model, and processing the data. A key advantage of this mode of operation is that no programming effort is required to execute each model.
Framework-based interpreted model execution faces two key challenges, computational load and data movement. The most computationally challenging kernels are convolutions, which boil down to large number of vector dot (inner) products. RC64 manycore DSP is designed specifically for this type of work, having 256 multiplier-accumulators (MAC) operating in parallel at very high power efficiency. But all arguments, data (activations) and coefficients (parameters) need to be streamed continuously from memories. To fully utilize RC64 high level of parallelism, careful programming is required that orchestrate data movements, caching, pre-fetching and off-loading in a most efficient manner. Luckily, this effort is done once and is embedded in the optimized kernels.
RC64, originally designed as a DSP parallel processor, , has turned out extremely efficient in inference execution. Large inference tasks are easily distributed over many RC64 chips, interconnected with very high speed SpaceFibre links for data exchange.
Most AI/ML applications in Space employ standard methods of machine learning and neural networks model development. Models are developed on the ground, typically based on cloud computing and big data for training, or on personal computing and small workstations. Most often, a standard “framework,” or specification language, is employed for model specification and developments. Common frameworks include Google’s TensorFlow, the Keras Python front-end to TensorFlow, Torch and PyTorch, Caffe, Theano, and toolboxes on Matlab and Mathematica. Once the model is trained and can be used for inference, the model can be uploaded to Space for on-board execution, processing on board data. This seamless mode of operation allows high frequency upload of updated models and enable flexibility for whether to execute the inference model on the ground or in Space.
To support model-based and framework-based execution, a framework interpreter, or Inference Engine (IE), is employed. The IE receives the model, written in the source specification language such as Keras Python, and executes it. The models, especially in deep learning, are layered and consist of predefined set of kernels, such as 1D/2D/3D convolution layers, separable and depthwise convolution, pooling layers, normalization layers, fully connected layers, flattening layers and softmax layers. The IE implements the layers in parametric form. Execution of the model happens in stages. In each stage, one layer is executed by invoking the appropriate kernel, customizing it using the parameters that are included in the model, and processing the data. A key advantage of this mode of operation is that no programming effort is required to execute each model.
Framework-based interpreted model execution faces two key challenges, computational load and data movement. The most computationally challenging kernels are convolutions, which boil down to large number of vector dot (inner) products. RC64 manycore DSP is designed specifically for this type of work, having 256 multiplier-accumulators (MAC) operating in parallel at very high power efficiency. But all arguments, data (activations) and coefficients (parameters) need to be streamed continuously from memories. To fully utilize RC64 high level of parallelism, careful programming is required that orchestrate data movements, caching, pre-fetching and off-loading in a most efficient manner. Luckily, this effort is done once and is embedded in the optimized kernels.
RC64, originally designed as a DSP parallel processor, , has turned out extremely efficient in inference execution. Large inference tasks are easily distributed over many RC64 chips, interconnected with very high speed SpaceFibre links for data exchange.
Session Chairs
Enrico Magli
Politecnico Di Torino
David Steenari
Esa