By: Girish Shirasat, Concurrent Technologies
Original article published on RTC Magazine site can be found here.
With the ever increasing trend for processor boards to include more general purpose cores for better parallel processing capacity along with fabric interconnect bandwidth it is possible to construct very flexible HPEC systems.
A High Performance Computer (HPC) situated in a data center environment is typically made up of a collection of racks and cabinets connected by Ethernet, Infiniband or both. These run applications on top of a wide array of standards-based communication middleware that are widely extensible and capable of physical distribution. Each HPC rack consists of several server elements and each of these is likely to include a combination of Intel CPU and accelerator devices operating in a heterogeneous fashion to provide the best possible performance dependent on the application type. When additional performance is required another server rack can be added to increase the cluster size and because the interconnect mechanism is based on Ethernet or similar, the application has more resources and operates faster. While adding more server racks can be both inconvenient and costly, it is possible because physical space is usually available to rent.
The opposite is usually true for High Performance Embedded Computers (HPEC) where the physical space envelope for a system is pre-defined and often cannot be extended. For example, in a cockpit or in-vehicle environment the space may be fixed years in advance so as much functionality needs to be crammed into the available size envelope as possible, and this often includes more specialized devices including high speed analogue I/O, FPGAs for pre-processing, CPUs for analysis, storage, display and control. Legacy HPEC systems often consisted of physically or logically separate elements but there is a trend to migrate more functionality to heterogeneous CPU and GPGPU resources which are relatively easy to program. In an HPEC environment the interconnect is also likely to be different with serial fabrics like Ethernet, PCI Express and RapidIO all being used to connect the elements to provide the necessary real time performance and low latency necessary for a high performance distributed system. While these fabrics offer key technical advantages, there is a lack of support for standards-based user space APIs which could be used as the basis for communication middleware. This can make it a challenge to design a distributed HPEC system with these fabrics.
What is standard communication middleware?
Communication middleware is a software component that sits between an application and an operating system providing standards-based abstraction for applications to communicate between each other. The application communication paths can be within processes in a single processor module or multiple modules. Communication middleware simplifies the distributed application design by providing a coherent unified view of the system communication infrastructure abstracting the operating system communication interfaces along with the fabric communication complexities leaving the system architect to focus on actual work in hand, i.e., designing the distributed application. Figure 1 shows where the communication middleware fits in to a distributed system architecture.
Distributed System Architecture. Some of the most commonly used communication middleware used in HPEC systems are Sockets, OpenDDS and OpenMPI.
The Sockets application programming interface is one of the most ubiquitous communication middleware elements and is usually based on the Berkeley socket standard. It can be considered analogous to assembly level programming when writing a distributed application. It provides a relatively low level programming interface to distributed application writers and would be the preferred choice when it comes to adhering to strict performance requirements and using limited system resources. Sockets are based on a connection-orientated, client/server design paradigm which requires the client to know the address of the server in order to communicate. Almost all operating systems support sockets as a default in their distribution.
OpenDDS is an open source implementation of Data Distribution Service, which is communication middleware standardized by the Object Management Group (OMG). DDS is based on the published/subscriber design paradigm wherein there are multiple data sources and sinks in a system which are loosely coupled. The sources publish themselves as producers of data in a DDS network while the sinks subscribe to these publishers to consume the data. The subscribers do not need to know the address of the publishers; the DDS network takes care of data delivery based on the type of data the subscriber is requesting. Multiple publishers/subscribers can join or leave the DDS network without application connection management implications. Other than OpenDDS, there are many proprietary and optimized versions of DDS for embedded systems which include some additional functional attributes like making efficient use of any additional cores that are available in modern processors.
OpenMPI is an open source implementation of Message Passing Interface, which is a language independent standard communication protocol for writing distributed applications. It supports both the point-to-point and the collective design paradigm of distributed applications. It provides a high-level programming abstraction for sending and receiving messages without worrying about managing the connection, multiplexing, etc. This is another design paradigm to write a distributed application that is widely used for a compute intensive application. One of the examples of the functionality offered by MPI is to distribute a block of data to different local or remote processes for split computation and then combine the results with just a single API.
The choice of which communication middleware to use depends heavily upon the type of application being designed. It is quite possible that one might design a system using a combination of communication middleware depending upon the data being processed.
Enabling Open Standard Communication Middleware for HPEC
Unlike Ethernet where most operating systems provide a socket API as a standard way of accessing the fabric, PCI Express and RapidIO lack this standardized API support. Hence a system architect wanting to design a multiprocessor HPEC system around PCI Express or RapidIO fabrics needs to design a proprietary API on top of these fabrics leading to a significant rise in the total cost of ownership of the product along with the risk of using a proprietary solution that is not easily portable.
To address this need of trying to encapsulate the complexities of communication over a PCI Express or RapidIO fabric, Concurrent Technologies developed Fabric Interconnect Networking Software (FIN-S) which virtualizes the PCI Express/RapidIO fabrics allowing any existing Ethernet application to run on these interconnects with no change. This high level of portability is achieved in combination with high bandwidth, low CPU utilization and can offer low latency interconnects.
The fact that FIN-S allows any socket based application to run on top of PCI Express and RapidIO fabric enables the use of higher level communication middleware like OpenMPI and DDS out of the box along with the low level socket APIs. On OpenMPI, users can use the standard TCP/IP Byte Transfer Layer (BTL) to write their compute intensive distributed application.
How Does Communication Middleware Map onto an HPEC System?
Having covered how FIN-S enables the use of different communication middleware based on PCI Express and RapidIO fabric, let’s consider an example of a simple sonar array processing element mapped on to a MicroTCA system constructed of AdvancedMC modules and then look at the data delivery model using DDS middleware so that we can get a practical context of how standard communication middleware can map onto an HPEC syst
Figure 2 shows a simplified composition of a multi-processor Sonar system comprising an input module which might contain a DSP/FPGA combination to do the data preprocessing work. On the output side the beamforming module may also have a similar composition to generate the correct signals to be transmitted. The processing algorithms involved in sonar processing are dependent on the number of elements in the array as well as the field of view and this is very computationally intensive. In this example system algorithms would be carried out by multiple Concurrent Technologies AM C1x/msd modules which are based on fourth Generation Intel Cor i7 processors with offload acceleration carried out by AG A1x/msd modules based on NVIDIA Tegra K1 devices. These General Purpose Graphical Processor Units (GPGPUs) are easily programmed using CUDA or OpenCL and consist of multiple cores with the capability of up to 1.4TFLOPS processing performance in a module about the size of a postcard.
Example Sonar Processing System
In a MicroTCA system, all the cards are connected to each other by a MicroTCA Carrier Hub (MCH) and in this example the data plane would be RapidIO Gen 2 with Gigabit Ethernet links on the control plane. Additionally, the AM C1x/msd denoted as SysCon in Figure 2 is acting as a system controller and has additional responsibility of interfacing with the input and output cards along with controlling, display and data storage. The additional AM C1x/msd and AG A1x/msd cards act as co-processors working alongside the system controller in algorithm computation and load sharing.
With the system architecture of our sonar processing element defined, let’s now look at the data flow model of this system using the DDS middleware in Figure 3. The AM C1x/msd system controller SBC is comprised of an input data handling process which is responsible for streaming the preprocessed data from the module that listens to the sonar data (Figure 4). A subset of this data may be consumed by a compute process on the system controller as required for coordination and management.
Sonar Processing System Data Flow Model
AM C1X/MSD Module 1
The compute processes which are responsible for carrying out all the algorithmic computation on the data subscribe to it, split it up into manageable portions and offload by publishing to the GPGPU resources. The GPGPU elements form the work horses of the entire system and can easily be scaled to cope with higher numbers of sonar array elements or more complex beam phasing capabilities. Once the GPGPU offload engines have processed their data they publish their results for reassembly by the AM C1x/msd modules which then publish the final beamforming output.
Depending upon the system load, more co-processor boards can be added into the system up to the maximum allowed by the chassis configuration. There is no need to change the control and monitoring application on the system controller or co-processor devices nor need to know the actual address of the publisher/subscribers; additional compute nodes become part of the DDS network as soon as they boot and can start contributing by dispersing the system load.
By virtualizing PCI Express and RapidIO fabric as Ethernet, FIN-S enables the use of standard off the shelf communication middleware allowing the system architect to focus on designing portable, high performance, fabric-agnostic distributed applications rather than expending time and resource in sorting out the fabric level inter-board communication, thus reducing the total cost of ownership of the system.