Will RapidIO Become The Fabric For ARM-Based Servers?

By: Chris A. Ciufo, Editor-in-Chief, Embedded; Extension Media

Original article published on the EECatalog site can be found here.

Rick O’Connor and I go way back—all the way back to Ottawa, Canada where he was CTO at Tundra Semiconductor, freshly spun out of Newbridge Networks. He’s an expert on interface ICs, from the VMEbus chips I was buying from him for our VME boards, to all manner of fabric interconnects.

Rick, now the Executive Director for RapidIO.org, was there at the beginning of RapidIO, moved away for a decade and is now back from “retirement” to oversee RapidIO’s possible leap from 4G/LTE basestations into the data center bolted to server-class 64-bit ARM®-based SoCs.

I caught up with Rick in January at a technology insiders’ conference, then spoke with him at length for this article. Edited excerpts follow.

Chris (C2) Ciufo: What’s new with RapidIO?


Rick O’Connor: As I presented at the Embedded Tech Trends conference in Phoenix in January, there have been over 100 million 10-20 Gbps RapidIO ports shipped worldwide, and the technology is in 100 percent of 4G/LTE basestations worldwide [Refer to Figure.]

If you pull out your 4G/LTE-capable cell phone and call me on my 4G/LTE-capable cell phone, the tower that your phone is connected to is RapidIO based, and our call will go through a RapidIO fabric at your end and at my end. This is true whether it’s Nokia, Ericsson, Huawei, Alcatel-Lucent, ZTE, Great Dragon Telecom…all the 4G/LTE basestation providers. In fact, there are more 10 Gbps RapidIO ports in the world than there are 10 Gbps Ethernet ports.

C2: Rick, why did you come back to RapidIO.org?

O’Connor: Maybe we’ll start with when I joined in Fall 2013. It was as a result of some initial conversations with some of the RapidIO people I’ve known, such as IDT, Freescale and TI—they came to me and suggested that I “hang out with them.” I had done this in the early days and “we came, we conquered and we got as much market share as you can get in the wireless space.” That’s done. I wasn’t really interested in being a caretaker for the technology and “riding” RapidIO into the sunset. I like to be active and doing things, with my hand on the steering wheel.

Their reaction was interesting in that they didn’t want a caretaker. They described an opportunity in medium- to large-scale high-performance computing (HPC) and data analytics where the technology is being pulled into the data center. I found that interesting and realized that if there was any chance to wrestle away some of the market dominance from Intel in that space—with the x86 architecture—we’ve got to have ARM at the table. I realized that we’ve got to be developing RapidIO as the fabric scale-out for clustered ARM processors. This is not just because a handful of companies have done it on their own in the wireless infrastructure space, but because we’re building out a stronger ecosystem of ARM partners with a unified RapidIO fabric.

RapidIO Overview Slide

So around this time last year (in 2014) I set about having a dialog with ARM. Over the course of the year, culminating in the late summer or early fall, ARM came to the conclusion that it wanted to join RapidIO.org, and that “yes” RapidIO is a very interesting technology. One of the company’s lead architects made the comment to me that if they were to set off to define an off-chip fabric interface for multiprocessor clustering, they’d probably come up with something that looks very much like RapidIO. So…why not RapidIO? From then on, it was just a question of going through the business and legal aspects of an IP company like ARM joining an open industry trade group.

C2: That’s not without precedent for ARM.

O’Connor: No, it’s not. But we had AMD join, and Marvell also joined. These were basically the “new” ARM licensees. ARM and AMD joined at about the same time; in fact, weeks apart. Each understood that the other one was coming. For all intents and purposes they joined at the same time and we announced that they joined together at the Linley Processor [forum] in October 2014. Marvell joined in November, which is interesting because there was a Marvell architect in the audience at the Linley event, and he approached me at that time after my RapidIO presentation there.

Even before these events, one of the reasons that ARM thought this was interesting, beyond the technology benefits of RapidIO, is because there was already a very healthy subset of ARM licensees within the RapidIO community. They included some pretty significant ARM partners and licensees: Freescale, Texas Instruments, Cavium, Altera, Xilinx and others. We added to the list ARM, AMD and Marvell. This represents a good cluster of ARM processor companies.

And more are coming, although I can’t say who, just yet. What we wanted to have is enough critical mass at the front end when we made the announcement that we’re forming a task group for AXI and AXI Coherency Extensions (ACE). The coherency task group we have is working on defining how the ARM AXI 4/5 ACE on-chip coherency environment protocols map over to or can go off-chip to a RapidIO fabric, both for homogeneous processor systems and larger heterogeneous processor systems.

C2: So we’re all in agreement, define a heterogeneous system.

O’Connor: A really good example of a heterogeneous system [see figure] is PayPal’s use of the HP Moonshot m800 cartridge platform that uses four TI KeyStone II DSPs, and each Keystone II has four ARM A15s plus eight C66x DSPs and you can put 45 of these cartridges in a Moonshot chassis. This system is all clustered over RapidIO. This system has obviously been in place for a while before we started the coherency task group—it isn’t a coherency system—but it is indicative of: a) use of ARM devices in a compute-intensive analytics application, complete with floating point off-load DSPs; and b) the use of RapidIO to cluster those resources.

RapidIO - HP Moonshot Proliant m800 Overview Slide

Part of the proof point is that the HP system is released to production, and there are others with names not as recognizable as HP. Those proof points are part of the reason that there is investment going on within the RapidIO community that will bring more capability to ARM-based devices. This also talks to the investment that companies are making in defining the coherency scale-out within RapidIO.org.

C2: What are the goals of the coherency task group?

O’Connor: Companies have been telling me that RapidIO can be perfect for the data center, but in order for it to actually find its way into the data center (and have its perfection shine), one needs to address the hegemony that Intel has within the data center.

“Coherent” means maintaining data contents across a broader multi-processing system, in fact multiple copies of that data for local access, and the data should remain “coherent” so that all copies in data cache are coherent or consistent. This is done for “very quick” local processing needs by local processing resources. This includes the question of how to do this.

The traditional 2P/4P socket configuration, where there is a two-socket or a four-socket CPU, describes a tightly coupled architecture with very short stub lengths. There is a protocol that connects those processors together. There’s also the notion of maintaining lookup tables and different resources coherently in a larger heterogeneous system. This might be referred to as a “loosely coupled” system, but you have multiple cache copies of lookup or other data in your system that are spread throughout a system.

So one of the things that RapidIO allows is to maintain many tens of thousands of nodes/processors in one large RapidIO fabric system. And within that fabric, RapidIO allows multiple sub-domains, and within those sub-domains there can be coherency and consistency. That is, there can be multiple sub-domains that are coherent unto themselves, but yet they all participate in the larger RapidIO fabric. This allows sharing resources with traditional message passing paradigms.

C2: Is RapidIO deficient, hence the task group?

O’Connor: It’s not that RapidIO is missing something in a glaring way to be able to be used in this fashion; rather, from the beginning a coherency protocol was established in RapidIO called Globally Shared Memory (GSM) Specification. Some of the original PowerPC devices that were architected to use RapidIO implemented the GSM spec. So coherency protocols exist within RapidIO and within ARM’s AXI and ACE bus structure, so the work that we’re doing is mapping the on-chip ACE protocol to the RapidIO GSM protocol so that coherent ARM-based traffic can be carried across the RapidIO fabric. There may be a need to augment the existing GSM spec to add a few other transaction types to complete the [ARM] mapping, but the fundamental capability is there.

Both coherency architectures—within the RapidIO environment and within the ARM environment—are Stanford Dash based, the Cache Coherent Non-Uniform Memory Access (CCNUMA) for multiprocessor processor architectures. Both the AMBA® environment and the RapidIO environment have their origin in Stanford Dash CCNUMA architectures, although some of the inherent primitives that exist on both sides are quite similar and compatible. But I’m not saying that it’s easy, or that it’s a straightforward mapping, and there’s still work to be done and tradeoffs to be made. But that’s the work that the Task Group is doing.

C2: The results of this mapping, once successful, would mean what? That ARMv8 instructions will issue data and processor commands across a homo- or heterogeneous, geo-local or -distant RapidIO fabric with data center processing nodes?

O’Connor: Yes. There are two things to think about. There’s the notion of “scale out” and “scale up.”

Scale out is multiple processors, each running its own OS and potentially wanting to share overlapping, coherent space. On the other hand, scale up is multiple processors running a single instance OS. When you scale out you can scale the system by simply adding more nodes, and the existing nodes aren’t affected because it’s a self-contained environment. You have more availability and more scalability.

The 2P/4P case where you have a server and you’re just connecting two or four sockets together is where you’re trying to build, in effect, one big processor and you’ll run a single OS image on it—this is scale up. In contrast, just to put a finer point on it, scale out is where multiple node processors are arranged across a fabric with switches like that HP m800 Moonshot product we discussed earlier. There are multiple boards and multiple SoCs per board, all connected either in a switched fabric or in a 2D torus, and each of those resources is running its own OS.

C2: What do you think ARM is trying to do with RapidIO?

O’Connor: It’s trying to cover the waterfront. ARM wants to be able to build a 2P/4P configuration because that’s what many customers want to do. With what we’re trying to do with RapidIO, we’d be able to offer the scale up for a very tightly coupled homogeneous 2P/4P configuration. And at the same time, we’ll be able to offer a heterogeneous, tens-of-thousands-of-nodes architecture that scales out like the HP Moonshot platform. Because of the flexibility and performance of RapidIO it’s really the only architecture—short of growing your own protocol—that has that flexibility to cover the waterfront.

C2: So RapidIO would allow ARM to do both?

O’Connor: Yes.

C2: What architecture does Intel offer today that RapidIO is trying to beat?

O’Connor: Intel has had proprietary front side bus architectures for a long time that let you connect server-class processors together, and they’ve evolved over the years. As ARM enters in to be able to have a core that matches the requirements for a server-class processor, having some kind of a front side interconnect is important to be able to build 2P/4P systems and for scale out systems.

You could ask if ARM has something like this already, but they’ve sort of stayed away from this space. Historically, ARM has done a very good job of being a processor vendor while leaving the peripherals up to their partners for SoC implementation. ARM’s business is the processor.

The challenge in this particular space—regardless of which server architecture—is for some partners to roll their own. I call them “YAFI”: yet another fabric interface, and one could certainly design something that performs well. I’ve seen that some of these YAFI’s are the same, only different. So why not consider RapidIO?

Beyond the technical merits of doing something with RapidIO, the ecosystem enablement of having a standard front side interconnect is what is really going to help the ARM architecture take any kind of position in the x86 dominated data center.

C2: What are some of the RapidIO ecosystem benefits?

O’Connor: Part of the attraction is to be able to build heterogeneous systems; maybe you’re going to get I/O processors from X and compute processors from Y, and maybe a floating point processor from Z. Wouldn’t it be nice if all of these had a nice low latency, scalable, unified fabric for interconnection? This allows IP providers to focus on their “secret sauce” just like the ARM partnership with its customers has always allowed.

C2: But ARM needs to offer some sort of standardization to make this a reality in this next market?

O’Connor: That’s exactly it. Everybody’s better off with a standard frontside interconnect, and is there really a point of differentiation for an SoC vendor on the interconnect? Or should it just be table stakes such that everybody can participate and grow an opportunity for a broader ecosystem? Instead, this allows ARM partners to focus instead on their accelerators, encryption engines, I/O processors, and on whatever their secret sauce is. In fact, if there’s a point of differentiation on the frontside interconnect then it could split the ARM ecosystem, thus decreasing the likelihood that ARM would make any significant penetration. In this latter [proprietary, splintered frontside implementations] you’re then locked into a single vendor again.

C2: Why is the world anxious for an ARM-based alternative to Intel?

O’Connor: It’s this: price, power and performance. There is a valid argument that the Intel Architecture (IA) stuff is wonderfully performing, but it costs too much and it consumes way too much power.

But the next layer is that there are some really challenging performance, no, power density hurdles that the industry has to deal with when using Intel processors in the data center. One of the beautiful things about the ARM architecture is that it really does span the medium-low performance but at very, very low power. Put another way, it’s high performance and really low power in smartphones, and now their roadmap goes all the way up into the 64-bit ARMs for server-class machines.

This is attractive and really does cover the waterfront from a price, performance and power standpoint. Add on a RapidIO unified fabric into this mix, and the market may respond, “this is something very interesting.”

My belief is: there is an opportunity, it is interesting, there’s compelling performance arguments to be made, and there’s certainly an appetite for something that’s not IA-based. What we are really talking about is gauging the momentum behind this market appetite.

RapidIO Executive Director, Rick O'Connor - Quote

C2: If your theory is correct, ARM should be all over RapidIO.org.

O’Connor: ARM’s position and participation in this work is similar to how it treats any ARM partner and that partner’s peripherals and IP that may go into an SoC. ARM wants to make sure that it’s done right and that it’s a really good [implementation], but as with any of the other peripherals that might surround an ARM core, the company doesn’t stand on the top of a mountain and anoint anything as “The Thing.”

ARM does a good job at being agnostic. Clearly the fact that it’s at the table and participating and it’s public about its participation is, for ARM, a fairly strong endorsement. But you’ll never see—at least in the near term—ARM pointing any of its partners to RapidIO for SoC use.

C2: Is there any public reference to Intel participating in RapidIO?

O’Connor: If you connect the dots of a bunch of acquisitions that happened over the last 12 months you might be able to arrive at that conclusion.

C2: I first read up on RapidIO during the “fabric wars” some 10 years ago. Where can someone go to learn more about today’s RapidIO?

O’Connor: Start at our website www.rapidio.wpengine.com/technology-comparison. From there, you can navigate to several technology descriptions, or check out the “Resources” tab from the main page and you’ll find a bunch of white papers.