Accelerating parallel data processing using optically tightly coupled FPGAs

Kenji Mizutani; Hiroshi Yamaguchi; Yutaka Urino; Michihiro Koibuchi

doi:10.1364/JOCN.448626

1. INTRODUCTION

Parallel data processing using field programmable gate arrays (FPGAs) has been attracting attention as an economical, low-power computing platform for dealing with the extraction of information from large amounts of data, which has now become essential. Parallel data processing includes not only artificial intelligence learning/inference but also database processing. In both applications, memory and network inputs/outputs (I/Os) are a crucial component, and they should be designed with care so as not to create a performance bottleneck.

The latest FPGAs are equipped with high-bandwidth memory 2 (HBM2), and research is underway to speed up data processing targeting high-bandwidth memory [1,2]. However, the throughput of data flow processing is usually limited by both the memory bandwidth and the network bandwidth of an FPGA. For example, the latest Ethernet I/O is limited to the aggregation of a 400 Gbps network bandwidth over a Peripheral Component Interconnect Express (PCIe) bus.

In addition to the bandwidth, memory capacity must also be considered. HBM2 has a 4–8 GiB capacity thanks to its current on-chip integration technology, and the latest HBM2e has at most twice that capacity. Only a few HBM2s can be mounted on a single compute element.

The sharing of memory data among multiple FPGAs is essential for highly parallel data processing. The heart of parallel data processing on multiple FPGAs is the inter-FPGA network using remote direct memory access (RDMA), which enables efficient collective communications among multiple FPGAs. A typical collective communication is all-to-all communication, where each FPGA sends an individual message to all FPGAs. An example of an application using all-to-all communication is found in parallel data sort (see Subsection 3.B). Another typical collective communication is all-reduce, where partial computational results are collected from all FPGAs and the results are combined into one global result shared by all FPGAs. An example of an all-reduce application is found in training for deep learning. The parameters of each FPGA are usually averaged by summation and the results are then shared by all FPGAs [3,4] (see Subsection 3.C).

We previously proposed a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collective communications. OPTWEB discards the traditional packet communication mechanism and introduces dedicated memory processing. These two features enable us to minimize the communication overhead and hardware resource utilization [5,6]. Specifically, OPTWEB reduces the utilization of adaptive logic modules (ALMs) required for high-bandwidth communication. OPTWEB is an FPGA-to-FPGA synchronization circuit network using a simple single-cycle block control signal. Since the block signal only connects two devices, a fully connected network topology is designed with advanced onboard Si-photonics transceivers. OPTWEB fully utilizes the large number of external I/O channels and the large number of memory channels with the introduction of HBM2 to form a fully connected network topology.

In this paper, we provide the design guidelines for parallel data processing on FPGAs with OPTWEB. These guidelines are based in part on the domain-specific architecture (DSA) in the field of computer architecture for high efficiency. In the guidelines, data to be exchanged should be directly processed on the dedicated communication memories of each FPGA and natural parallelism should be implemented to match collective communications. This paper attempts to generalize the application design by integrating the insights from our previous case study [7]. For enabling large-scale applications, a larger number of FPGAs should be interconnected. We thus extend OPTWEB to support a generalized hypercube (GHC) network topology based on our prior study [6].

The contributions of this study are as follows.

• We provide design guidelines for applications using OPTWEB. We then illustrate two case studies, one on parallel counting sort and one on sharing averaged parameters, using eight custom Stratix10 MX2100 FPGA cards that form OPTWEB to aggregate a 800 Gbps network bandwidth.
• In parallel counting sort, we report a 5.3 times improvement in the execution time relative to a comparable CPU cluster using InfiniBand (IB) [7]. The total utilization of ALMs was 81% in the custom Stratix10 MX2100 FPGA. Our analysis showed that the GHC decreased the sorting throughput by 35% compared to an ideal fully connected network topology with the same number of FPGAs.
• In sharing averaged parameters, we report that the FPGA cluster with OPTWEB improved the parallel data application by up to 250 times. The total utilization of ALMs was 61%. Our analysis showed that the GHC improved sharing averaged parameters up to 4.2 times compared to an ideal fully connected network topology in the case of 64 degrees of a node.

In Section 2 of this paper, we provide an overview of the background. In Section 3, we present the design guidelines for parallel data processing using OPTWEB and describe the two case studies. In Section 4, we discuss the empirical evaluation results on eight FPGA cards implemented with OPTWEB. We conclude in Section 5 with a brief summary and mention of future work.

2. BACKGROUND INFORMATION

A. Existing Inter-FPGA Networks

In traditional computer systems using FPGAs, data is packetized and communicated by the IB host channel adapter (HCA) or Ethernet network interface card on the host’s PCIe bus. HCAs and network interface controllers are connected to electrical switches [Fig. 1(a)]. In the latest InfiniBand, the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) has optimized the efficiency of all-reduce thanks to the co-design of InfiniBand switches and HCAs [8]. However, due to the bandwidth constraints of the PCIe bus, it would be difficult to fully utilize the high potential bandwidth of an FPGA. Start-up latency is also large due to the increased overhead on the PCIe bus [9]. Even with SHARP, the performance improvement of collective communication on the traditional inter-FPGA network would be limited.

Fig. 1. Types of inter-FPGA networks: (a) Traditional packet, (b) indirect packet, (c) direct packet, (d) OPTWEB.

Download Full Size | PDF

An alternative way to provide inter-FPGA networks is to add a communication intellectual property (IP) such as Ethernet in an FPGA to avoid the constraints from the host and PCIe. In this approach, an Ethernet port is directly attached to each FPGA [Fig. 1(b)]. The hardware resources for supporting Ethernet would increase in proportion to the bandwidth. To build a high-bandwidth system with HBM2, hardware resources are insufficient for complicated packet-network structures with the latest largest FPGAs [5,10–12].

A direct inter-FPGA network is another approach. FPGAs are directly connected by dedicated links. Various FPGA-accelerator systems have recently appeared, such as Project Catapult v1 [13], the Flow in Cloud project [14], the EuroEXA project [15], Novo-G# cluster [16,17], and the Cygnus supercomputer [18]. The network topology of these systems is a low-radix network topology (e.g., a 2D mesh) [Fig. 1(c)]. The low-radix network has the disadvantage of a large diameter and large average shortest path length. It is well known that the throughput and latency performance of a low-radix network is inferior to that of a high-radix network.

To mitigate the hardware resource problems of indirect inter-FPGA networks [Fig. 1(b)] and the communication latency of direct inter-FPGA networks [Fig. 1(c)], we previously proposed a lightweight, fully connected inter-FPGA network using onboard Si-photonics transceivers [Fig. 1(d)] [5].

B. Aspect of DSA

The existing inter-FPGA networks rely on a general-purpose packet structure, as described above. We thus explore alternative network architectures with a lower overhead. In this context, DSA provides efficient design options for inter-FPGA networks.

DSA has attracted attention in the field of computer architecture as a way to design energy-efficient digital systems [19]. DSA typically increases the number of arithmetic operations per instruction from one to hundreds for efficiency [19], as seen in the TPU, Catapult v1, and Darwin [20]. DSA emerged in response to the predicted end of Moore’s law, which is thought to be around 2020. As the number of transistors on a chip will not continue to increase after Moore’s law ends, it will be difficult to increase the performance of on-chip computing platforms. In such a scenario, while specialization would increase the performance and energy efficiency, it would come at the cost of reducing the productivity of programs.

The key factors of DSA are five techniques for performance and efficiency gains [19]: (1) dedicated memories to minimize the communication overhead, (2) the resources saved from dropping advanced microarchitectural optimizations, (3) the most accessible form of parallelism that matches the domain, (4) data specialization, and (5) domain-specific programming language.

The end of Moore’s law will bring a similar performance problem to the endpoint of the optical network of an FPGA because the interface is made up of electronic circuits. The current communication systems do not yet include DSA. To the best of our knowledge, OPTWEB can be considered the first inter-FPGA network that exploits some of the essence of DSA.

C. OPTWEB

OPTWEB utilizes the single program multiple data model to perform parallel data processing of large amounts of data. OPTWEB efficiently supports unicast and six collective communications (barrier, scatter, gather, broadcast, all-to-all, and all-gather). It unburdens the complicated packet network structure by means of a dedicated link of the fully connected network topology. In this subsection, we focus on the key design features; more detail is provided in [5].

The basic block diagram of OPTWEB is shown in Fig. 2. For high-bandwidth communication among FPGAs, it is necessary to read and write from all memory channels simultaneously. There are two independent 256-bit width pseudo memory controller channels (M) in each memory channel ${\rm{M}}\# n$ ($n = 0,\; \ldots ,\;N - 1$) in HBM2, where $N$ is the number of FPGAs. These two pseudo-channels in each ${\rm{M}}\# n$ are connected to the data read section ${\rm{R}}\# n$ ($n = 0,\; \ldots ,\;N - 1$) and data write section ${\rm{W}}\# n$ ($n = 0,\; \ldots ,\;N - 1$) of each direct memory access controller (DMAC) through a ${{2}} \times {{1}}$ bidirectional memory-mapped switch. As many DMACs as there are FPGAs are prepared, and M_DMAC is the mechanism to control them all at once.

Fig. 2. Block diagram of an FPGA on OPTWEB with three FPGAs.

Download Full Size | PDF

The distributed switch is the mechanism to set the route in the FPGA according to the type of communication. To support broadcast and all-gather communications, the data from ${\rm{R}}\# n$ is replicated for all FPGAs by the distributor, and then selected by the next $N \times {{1}}$ one-way switch. A unidirectional $N \times N$ switch is used to route the communication data before forwarding the data to the M_DMAC. On this distributed switch, synchronization intellectual properties (sync. IPs) that perform simple control between FPGAs are placed. This sync. IP handles one clock signal as a control signal between FPGAs and is used for synchronization and prior communication confirmation in rendezvous communication. This simple control mechanism enables a lightweight communication. Link IP is an IP for communication on the link between FPGAs, such as Seriallite3 (SL3) in Intel. The SL3 can pass stream bus data through the sync. IP to the other side as it is.

The command sequencer manages all of these functions and executes the communication between FPGAs according to the flow in Fig. 2 of [5] to ensure reliable communication. The data processing core uses data from multiple memory channels to perform highly efficient data flow processing. The data processing core accesses M_DMAC via a changeover switch directly to reduce the amount of required ALMs. It is also connected to the command sequencer prepared by OPTWEB and handles communication and data processing sequentially.

We previously designed and evaluated OPTWEB using four custom Stratix10 MX2100 FPGAs with 45 Gbps/links. Conventional hardware circuit design tools (e.g., Intel’s Quartus Prime Pro) were used to investigate the behavior of OPTWEB in our implementation using a custom Stratix10 M2100 FPGA. Each custom FPGA card had eight high-density optical transceivers (${{4}} \times {{25}}\;{\rm{Gbps}}$ transmitting and ${{4}} \times {{25}}\;{\rm{Gbps}}$ receiving) called the optical I/O core [21] as pluggable embedded optical modules, which were mounted near the FPGA. The optical transceivers were connected to GXT transceiver channels on the FPGA. The GXT channels, in turn, are implemented on tiles integrated on the FPGA package, and they are connected to the FPGA fabric via Intel’s embedded multi-die interconnect bridge technology. In contrast, existing pluggable ports such as those for small form-factor pluggable modules are very large, and only a few ports can be mounted on an FPGA card. Since commercial optical fibers support long distances, e.g., 30 m, OPTWEB supports inter-rack scale communication. The execution time of less than 1 µs has been obtained for all-to-all communication. Interestingly, this is equivalent to a unicast communication between two FPGAs.

From the viewpoint of DSA, OPTWEB exploits both dedicated memories to minimize the communication overhead and low communication resources by discarding the conventional packet network architecture. They respectively correspond to points (1) and (2) in the DSA guidelines discussed in Subsection 2.B. The application design guidelines are inspired by points (1), (3), and (4) (see Subsection 3.A). The domain-specific programming environment to port algorithms to FPGAs, which corresponds to point (5), will be our future work (see Section 5).

D. Network Topology of OPTWEB

We can extend OPTWEB to provide high scalability by supporting indirect paths. The two-hop indirect paths guarantee the reachability of a network in which the maximum shortest path length between two FPGAs is two. The diameter is the maximum shortest path length between FPGAs. A diameter-two network topology can be constructed by the GHC. The GHC is a $k$-ary $r$-cube network, which uses a complete connection in each dimension [22].

Note that a fully connected network topology can be considered as the GHC ($r = {{1}}$), and the supporting GHC ($r = {{2}}$) can be regarded as a simple extension of the fully connected OPTWEB, as shown in our prior work [6].

In this study, we consider a fully connected network topology for high performance and a GHC ($r = {{2}}$) for high scalability. The GHC ($r = {{2}}$) network is equivalent to a diameter-two Dragonfly network [23]. Each FPGA in the GHC ($r = {{2}}$) is labeled with two-dimensional coordinates (x, y). FPGAs labeled with the same symbol in the same dimension are fully connected to each other. For example, FPGAs (0, 0), (0, 1), and (0, 2) are fully connected and FPGAs (0, 0), (1, 0), and (2, 0) are also fully connected on a ${{3}} \times {{3}}$ GHC ($r = {{2}}$), as shown in Fig. 3(a).

Fig. 3. Examples of a two-dimensional GHC network: (a) ${{3}} \times {{3}}$ and (b) ${{4}} \times {{2}}$.

Download Full Size | PDF

The network sizes constructed by the GHC ($r = {{2}}$) can be larger than those of existing parallel FPGA clusters (e.g., 64 FPGAs on the Cygnus supercomputer [24]). We measured the start-up latency of direct and indirect communications of OPTWEB on the custom Stratix10 M2100 FPGAs. A direct communication consumes 0.8 µs and 332.4 µs for 0.125 KiB and 1024 KiB data, while an indirect communication consumes 2.0 µs and 665.0 µs, respectively. We feel that these start-up latencies are sufficiently low when an indirect path is taken.

3. APPLICATION DESIGN

A. Guidelines for Parallel Data Processing

Parallel applications should exploit the following three primary techniques for performance and efficiency gains on OPTWEB.

Dedicated communication memory: Data that will be exchanged should be directly processed on the dedicated communication memory of each FPGA. Key data structures should be stored in the dedicated communication memory. Global access patterns using RDMA should be optimized to make the best use of the high memory bandwidth with the low latency overhead. Since the communication is end-to-end, traffic balancing within a memory channel and memory allocation should be considered.

Data specialization: Special operations directly support collective communications, including all-to-all communication. OPTWEB provides a unique feature in that the latency overhead of collective communications is the same as that of the unicast when the fully connected network topology is used. A specialized data structure to directly issue collective communications should be utilized.

Natural parallelism: The most accessible form of parallelism that matches the collective communications provided by OPTWEB should be implemented.

We designed two parallel applications on an FPGA cluster using OPTWEB in accordance with these guidelines.

B. Parallel Counting Sort

1. Assumption

First, we describe parallel counting sort. For ease of understanding, the number of FPGAs is set to three on a fully connected network topology. We consider two-column data as key and value in the following explanation. Assume that the number of key types is less than or equal to the square of the number of nodes, or the number of memory channels in the entire system. This is a fundamental constraint to simplify the counting sorting process because the data sorted to each memory channel does not need to be sorted again within the same memory channel. The reduction of this sorting process contributes to the reduction of the circuit scale of the FPGA, as well as the reduction of sorting time. In Fig. 4, there are nine types of keys (A to I). Both key and value are placed in each contiguous memory address space. The order of the values corresponds to the keys. The values in Fig. 4 are an example of the initial status of the data, 0–11, and the key value. For an FPGA, the data is placed in multiple dedicated communication memories, as shown on the right side of Fig. 4. To identify the FPGA number of the initially placed data, the frame of the data initially stored in FPGA #0 is marked as red, FPGA #1 as black, and FPGA #2 as blue.

Fig. 4. Initial status of data placement in an FPGA.

Download Full Size | PDF

2. Procedure

Parallel counting sort mainly consists of three steps, as shown in Fig. 5. Each FPGA has the same sorter. The sorter in each FPGA can continuously sort data read from the multiple memories simultaneously. Then the sorted data can also be written continuously to multiple memories in parallel. The sorter also has a function to count the frequency distribution of the key data.

Fig. 5. Behavior of parallel counting sort. (a), (c) Local sort. (b) All-to-all communication.

Download Full Size | PDF

Step 1 [Fig. 5(a)]: In each FPGA, both key and value data are individually sorted by the sorter. In each local sort, a part of the key is first read out and temporarily recorded. The corresponding value data is read out and sorted along with the previously read key information. The key data are classified into $N$ groups, where $N$ is the number of FPGAs. For example, there are three groups in Fig. 5—A to C, D to F, and G to I—that are then put into M#0, M#1, and M#2, respectively. Similarly, the value data is sorted according to the type of key data. Eventually, each ${\rm{M}}\# n$ contains data from the same group of keys.

Step 2 [Fig. 5(b)]: We perform all-to-all communication to exchange the locally sorted key and value data in step 1. In the source FPGAs, the data have been stored in each memory that corresponds to their destination FPGA in the previous step. It is not necessary to switch routes between memories and network ports during the all-to-all operation. Each FPGA then collects the same type of data with the same key groups. In Fig. 5(b), FPGA #0 collects data with key types A to C, FPGA #1 collects data with key types D to F, and FPGA #2 collects data with key types G to I. In the destination FPGAs, the data are stored in a memory that corresponds to their source FPGA. In the same way, value data related to the same key is collected.

Step 3 [Fig. 5(c)]: Each FPGA sorts the value data again locally. The value data is sorted according to the type of key data exchanged in step 2. Then it uses the sorter to place the value data into a single ${\rm{M}}\# n$ for each key type. Finally, each ${\rm{M}}\# n$ obtains the value data for each sorted key.

We show how to extend the parallel counting sort on the GHC ($r = {{2}}$) network using the same application circuits in the FPGA as those on the fully connected network topology. A GHC ($r = {{2}}$) requires five steps for parallel counting sort. First, we perform steps 1 and 2 among FPGAs with the same $y$ coordinate. Next, steps 1 to 3 are executed among FPGAs with the same $x$ coordinate.

3. How Parallel Sort Follows the Guidelines

The respective data structures of the key and value are stored in the dedicated communication memory for efficient all-to-all communication with low communication overhead. Specialized key and value sizes are fit to the dedicated communication memory. The processing granularity of a single FPGA is fully exploited by combining local sort and all-to-all communication provided by OPTWEB. In this context, natural parallelism is exploited.

C. Sharing Averaged Parameters

1. Assumption

Sharing averaged parameters data is a way to update parameters by averaging the parameter values of each FPGA and then sharing them across all FPGAs. It is frequently used to update the parameters that have already been learned at each node during training in a synchronized data-parallel deep learning application. In our design, the parameters are placed in multiple memory channels, as shown on the left side of Fig. 6. The mapping of updated data to the memory channel is designed to efficiently issue the all-gather communication. The updated data are placed in the same memory channel. Then they are shared by all FPGAs with no data conflicts in a specific memory channel. Figure 6 shows an example of the mapping. Data ${{\textbf{a}}_{{\rm{update}}}}$ is placed in M#0, data ${{\textbf{b}}_{{\rm{update}}}}$ in M#1, and data ${{\textbf{c}}_{{\rm{update}}}}$ in M#2. An averaging data circuit is provided to obtain the average value of the parameters of each FPGA for the update.

Fig. 6. Behavior of sharing averaged parameters. (a) Naïve approach, (b) aggressive approach.

Download Full Size | PDF

2. Procedure

The simplest implementation of sharing averaged parameters is to gather the parameter data of each FPGA and share it among all FPGAs by means of all-gather communication. It then computes the average parameters. However, the total communication data is too large to straightforwardly perform the all-gather communication.

Therefore, in our design, we first perform all-to-all communication. This leads to a decrease by ${\rm{1/}}N$ of the amount of communication data, where $N$ is the number of FPGAs. The parameters are then averaged by the averaging data circuit. Second, all-gather communication is issued to share the parameters by all FPGAs. The communication data amount of both all-to-all and all-gather communications decreases by a total of $1/N$.

There are two ways to implement the sharing averaged parameters in OPTWEB. A naïve approach is to perform data processing and inter-FPGA communication independently, as shown in Fig. 6(a). This is a similar approach to the method used for the parallel counting sort. Initially, all-to-all communication is performed on the data distributed among all FPGAs. The all-to-all communication aggregates the relevant parameter data for each FPGA by ${\rm{t}} = {{1}}$. At ${\rm{t}} = {{1}}$, an incoming data is input to the circuit of the averaging data computation from each ${\rm{M}}\# n$ and the average value is obtained. At ${\rm{t}} = {{2}}$, the average data is stored in the memory area managed by a single ${\rm{M}}\# n$ in an FPGA. Finally, all-gather communication is performed to copy this average data and then distribute it by ${\rm{t}} = {{3}}$. In each step, data is being input and output to and from the memory simultaneously. An M_DMAC thus issues three operations to compute the data in memory (all-gather, data averaging, and all-to-all).

Another approach is to aggressively perform computation at the same time as the communication process, as shown in Fig. 6(b). In OPTWEB, when the data size is large, rendezvous communication is used. The receiver sends a control signal for one clock cycle in advance to notify that the receiver is ready. In the all-to-all communication, this notification is generated in all FPGAs. This makes the time difference between FPGAs as small as 0.4 µs in the custom Strtix10 MX2100 FPGA with OPTWEB [5]. This time difference can be compensated by the receive buffer of OPTWEB, and no modification to the internal circuitry (e.g., the data processing core) is required. As shown in Fig. 6(b), the first step is to send out the data along with all-to-all communication. When each FPGA receives the data, the distributed switch directly puts the data into the averaging data circuit. The results of the averaging data computation are then put into memory by ${\rm{t}} = {{1}}$. The data is shared by the all-gather communication by ${\rm{t}} = {{2}}$. Since the number of data inputs and outputs is reduced to just two, averaged parameters can be shared faster in the aggressive approach.

As with the parallel counting sort, the sharing averaged parameters can also be extended to the GHC ($r = {{2}}$) network topologies. First, each flow shown in Fig. 6 is executed among FPGAs with the same $y$ coordinate and then re-executed among FPGAs with the same $x$ coordinate. The GHC ($r = {{2}}$) doubles the required number of steps compared to a fully connected network topology with the same number of FPGAs.

3. How Sharing Averaged Parameters Follows the Guidelines

The target data format in the averaging data computation matches the dedicated communication memory. This follows the use of dedicated communication memory in the guidelines in Subsection 3.A. The aggressive approach performs the averaging data computation when performing all-to-all communication. This corresponds to the data specialization in the guidelines. The averaging data operation exploits all-to-all communication. Thus, the natural parallelism of the computation is fully utilized by the collective communication provided by OPTWEB.

4. EVALUATION

A. Conditions

We fabricated custom FPGA cards that can be inserted into a PCIe slot, as shown on the upper-right side of Fig. 7. Each card has a single Stratix10 MX2100 FPGA with two HBM2s and eight optical I/O cores [25], providing multiple optical I/Os in one FPGA card, e.g., 32 or 64 (${{16}} \times {{50}}\;{\rm{Gbps/link}}$ or ${{32}} \times {{25}}\;{\rm{Gbps/link}}$, physical, bidirectional) [5]. We prepared a fiber sheet that allows a fully connected network of up to 16 FPGAs at physically 50 Gbps/link (${{2}} \times {{25}}\;{\rm{Gbps/link}}$) optical interconnections. This sheet can connect to the FPGA card via an 8-core multi-fiber push on (MPO) connector and provide an eight-FPGA fully connected network at 100 Gbps/link (${{4}} \times {{25}}\;{\rm{Gbps/link}}$) interconnections between FPGAs, as shown in Fig. 7.

Fig. 7. Experimental setup of OPTWEB using eight FPGAs.

Download Full Size | PDF

The internal data lane width was set to 256 bits to match the HBM2 channel. With the internal operating frequency set to 200 MHz, it can read and write at 51.2 Gbps per ${\rm{M}}\# n$. Intel’s Seriallite3 was used as the link IP for each link. The effective efficiency of communication with this IP is as high as 90%. The data transfer between memory channels becomes 45 Gbps/link when the physical bandwidth is 50 Gbps/link. The parallel count sort followed eight FPGAs interconnected at 45 Gbps/link. In contrast, the amount of hardware for the application circuit was tiny in the sharing averaged parameters (see Subsection 4.C). We used as many ALMs as possible for OPTWEB. Our OPTWEB implementation can connect up to 16 FPGAs, so we doubled the bandwidth (90 Gbps between two FPGAs) to share averaging parameters in eight FPGAs. These two application designs indicate that it is possible to reduce the usage of ALMs by introducing a conversion circuit in ${\rm{M}}\# n$ that reduces the number of internal data lanes at the cost of total network bandwidth on an FPGA.

We used Intel Quartus Prime Pro version 19.4 to synthesize our designs developed using the Verilog hardware description language. As a reference, the same number of CPU (Xeon Gold 5215, 2.5 GHz) servers with eight 187.2 Gbps bandwidth DDR4 memories were also prepared and connected to a single IB switch (SB7700) at 100 Gbps/link via an IB adapter (CX555A-ECAT). The number of CPU cores was varied from one to eight. CentOS Linux v7.8.2003 was used as the operating system for the CPU servers. OpenMPI v4.1 was used as the interface for collective communications in the CPU servers.

Table 1 summarizes the application operating frequency and the total communication bandwidth per FPGA for memory and network. It includes the loopback within the FPGA. Parallel counting sort requires large-scale circuits, which causes hardware resource constraints; as such, the operating frequency and data transfer bandwidth were lower. Note that we previously evaluated the performance of OPTWEB for unicasts, collective communications, and barrier on a four-FPGA cluster with 45 Gbps/link bandwidth in our prior work [5]. In this study, we extended OPTWEB for a larger FPGA cluster (eight).

Table 1. Operating Frequency, Memory, and Data Transfer Bandwidth of OPTWEB

View Table | View all tables in this article

B. Preliminary

We evaluated the performance of all-to-all communication on an eight-FPGA cluster with up to 90 Gbps/link bandwidth between two FPGAs. The all-to-all communication is frequently performed in parallel counting sort and sharing averaged parameters.

1. Fully Connected Network Topology

Figure 8 shows the execution time required for all-to-all communication on a fully connected network topology. It was measured four times and the worst value was taken. This result is consistent with the start-up latency in [5]. In the case of CPU(1 core) $+$ IB, at least several tens of milliseconds were always required. The case of CPU(8 cores) $+$ IB resulted in a lower latency, but still took several hundred microseconds.

Fig. 8. Execution time of all-to-all communication in both OPTWEB and CPU clusters.

Download Full Size | PDF

When the data size increased to 2 GiB/node, which was the maximum size using a single HBM2, the CPU(1 core) $+$ IB became faster, at 585 ms. When the data size increased to 4 GiB/node, which was the maximum size using two HBM2s, the CPU(8 cores) $+$ IB was again faster, at 1094 ms. On the other hand, FPGA $+$ OPTWEB significantly reduced the execution time due to the high bandwidth of FPGA I/O and the low start-up latency of OPTWEB. We also measured the start-up latency for unicast and found that it was 0.8 µs, which is similar to that of all-to-all communication. For the exchange of the block control signal, less than 1 µs communication was obtained at round-trip time.

Figure 9 shows our analysis of the aggregate throughput of all-to-all communication corresponding to the plots in Fig. 8. Although the primary concern of application users is the execution time, we show the aggregate throughput so as to clarify the contribution of the network part. The aggregate throughput is defined as the communication data amount from each node divided by the worst value of the start-up latency of all nodes. Interestingly, we can see in Fig. 9 that OPTWEB fully uses all links during the all-to-all execution. OPTWEB provides a 720 Gbps network bandwidth on the FPGA with 16 memory controller channels, and the all-to-all throughput gets close to the network bandwidth in 512 KiB or larger data sizes. When using eight memory controller channels, a 360 Gbps network bandwidth is the upper bound. The all-to-all throughput also gets close to 360 Gbps.

Fig. 9. All-to-all aggregate throughput in both OPTWEB and CPU clusters.

Download Full Size | PDF

Our current implementation assumes that the communication data does not exceed the memory capacity. In this case, we expect the FPGA cluster with OPTWEB to always outperform the CPU cluster. In contrast, existing CPU servers can use larger amounts of memory with fewer memory controllers. When the data size exceeds the memory capacity of OPTWEB, existing CPU servers may outperform the FPGAs with OPTWEB.

We next discuss the results with small data size. We also evaluated the time of all-gather communication to clarify the start-up latency in the all-to-all communication of CPU(1 core) $+$ IB, as shown in Fig. 10. Unlike all-to-all communication, all-gather communication includes no data relocation. The execution time of the all-gather communication for 4 KiB was several tens of microseconds. By comparing the all-to-all and all-gather results, we can see that reordering the data accounts for a significant part of the start-up latency in all-to-all communication. For CPU(8 cores) $+$ IB, the start-up latency decreased when the data size was small. This is because each core manages one link and has sufficient buffers to temporarily store communication data.

Fig. 10. Execution time of all-gather communication in both OPTWEB and CPU clusters.

Download Full Size | PDF

Next, we discuss the results on large data sizes. The CPU(8 cores) $+$ IB was faster with the data size of 4 GiB/node because each CPU core manages one link (eight cores to eight links), resulting in higher communication efficiency. Here, note that the all-to-all throughput is the data size per node divided by the execution time. When the data size was 2 GiB/node, the all-to-all throughput was only 29.4 Gbps/node ($= {{2}}\;{\rm{GiB/585}}\;{\rm{ms}}$). When the data size was 4 GiB, the all-to-all throughput was up to 31.4 Gbps/node (4 GiB/1094 ms). Both results are about one-third that of IB’s 100 Gbps/node network bandwidth. This is due to the multiple communication steps required for collective communication. In MPI, the collective communication is performed in ${\rm Log}_2 N$ unicast-communication steps, where $N$ is the number of nodes.

On the other hand, in the all-to-all communication of OPTWEB with a 45 Gbps/link (360 Gbps/FPGA), each DMAC writes data from another FPGA into memory and completes data allocation independently. This design resulted in the low execution time of 0.72 µs, which is a similar result to the 0.7 µs start-up latency at four FPGAs reported in [5]. Due to this low start-up latency, the all-to-all throughput was high for a wide range of data. Even with 64 KiB/node of data, we achieved 231 Gbps/FPGA ($= {{64}}\;{\rm{KiB/2}.{27}}\;{\rm{\unicode{x00B5} s}}$). At 2 MiB, the network throughput reached 358 Gbps/FPGA ($= {{2}}\;{\rm{MiB/48}}\;{\rm{ms}}$).

The execution time in OPTWEB with a 90 Gbps/link (720 Gbps/FPGA) with 1 KiB data was 0.89 µs. This result indicates that doubling the link bandwidth does not cause excessive start-up latency. Furthermore, with 4 GiB data, the communication fully utilized the bandwidth of all optical links, resulting in 48 ms. The all-to-all throughput was 716 Gbps/FPGA ($= {{4}}\;{\rm{GiB/48}}\;{\rm{ms}}$). This was 22.8 times ($= {{716}}\;{\rm{Gbps/31}.{4}}\;{\rm{Gbps}}$) faster than the IB of 100 Gbps/FPGA. The result means that 99% of the total network bandwidth of OPTWEB was utilized ($= {{716}}\;{\rm{Gbps/720}}\;{\rm{Gbps}}$).

Through this evaluation, we confirmed that OPTWEB can dramatically reduce the execution time of all-to-all communication for any data size.

2. GHC Network Topology

All-to-all communication is achieved by executing multiple collective communications on a GHC ($r = {{2}}$). We implemented a ${{4}} \times {{2}}$ GHC ($r = {{2}}$) network in Fig. 3(b). In this case, the degree in $x$-coordinate ${d_x}$ is three, that in $y$-coordinate ${d_y}$ is one, and the total degree $d$ is four. On the experimental system in Fig. 7, we formed a ${{4}} \times {{2}}$ GHC ($r = {{2}}$) by disabling some of the links in the fully connected network topology. If we use the disabling links for the extra links in $y$-coordinate connection, we can double the $y$-coordinate bandwidth. Currently, it was difficult to change the wirings among FPGAs due to the fixed configuration with the fiber sheet and the bundled MPO connectors as shown in Fig. 7.

The operation performs all-to-all communications within the same $y$-coordinate FPGAs followed by sendrecv communications within the same $x$-coordinate FPGAs. Since two FPGAs have the same $x$-coordinate on a ${{4}} \times {{2}}$ GHC ($r = {{2}}$), sendrecv operation is performed instead of all-to-all operation.

The measured execution time of all-to-all communication is plotted in Fig. 11. The throughput in the fully connected network became 44 Gbps per link, which indicates full utilization of the bandwidth of the link. On the other hand, in the case of the ${{4}} \times {{2}}$ GHC ($r = {{2}}$) network, the throughput was just 7.3 Gbps. This is 6 times smaller than the fully connected network topology.

Fig. 11. All-to-all execution time of 8-FPGA network topologies.

Download Full Size | PDF

Note that OPTWEB controls the arrival timing of data for the next communication. If the data size is small, it is temporarily stored in a buffer in the FPGA. If the data size is large, it is executed as a rendezvous type that starts communication when it is ready to receive, so there is no data loss.

Using the experimental results as a basis, we constructed an analytical performance model of all-to-all for generalization.

The analytical time and throughput for all-to-all communications on fully connected network topology (FC) and GHC ($r = {{2}}$) is summarized in Table 2, where $d$ is the network degree. The degree is the maximum number of neighboring FPGAs from an FPGA. A GHC ($r = {{2}}$) was set to ${d_x} + {d_y}$ where the degrees of $x$ and $y$ dimensions were set to ${d_x}$ and ${d_y}$, respectively. $N$ is the number of FPGAs on FC and GHC ($r = {{2}}$), $\alpha$ is the overhead time for a 1-hop direct communication, $\beta$ is the inverse number of link bandwidth, and $L$ is the message size per FPGA.

Table 2. Analytical Model of All-to-All Communications on FC and GHC ($r = {{2}}$)

View Table | View all tables in this article

The FPGA throughput, normalized FPGA throughput, and normalized system throughput are defined as the message size divided by the time for all-to-all when the message size is large enough, the FPGA throughput divided by the FPGA bandwidth, and the product of the normalized FPGA throughput and the number of FPGAs, respectively. Therefore, the normalized FPGA throughput and the normalized system throughput are normalized by the FPGA bandwidth, or the bandwidth per FPGA. In this analytical model, all-to-all is assumed to be performed within each dimension sequentially.

In the all-to-all experiment, the degree $d$, the number of FPGAs $N$, and the link bandwidth $1/\beta$ are the same on both FC and GHC ($r = {{2}}$) networks and $d = {{4}}$. The FPGA throughput on GHC ($r = {{2}}$) is expected to be ${\rm{1/}}(d + {{2}})\;{ = }\;{\rm{1/6}}$ of that on FC, as shown in Table 2. The notation used to develop the model is provided in Appendix A. The experimental results shown in Fig. 11 were in exact agreement with the analytical expectation. If we use the disabling links for the extra links in the $y$ coordinate, the FPGA throughput on GHC ($r = {{2}}$) is expected to be ${\rm{2/}}(d + {d_y} + {{3}})\;{ = }\;{\rm{1/4}}$ of that on FC.

Figure 12 shows the normalized system throughput for all-to-all on FC and GHC ($r = {{2}}$) as a function of degree. The $x$ axis represents the number of transceivers per FPGA, i.e., degree. The $y$ axis, i.e., the normalized system throughput, indicates how many times the system throughput has a better FPGA bandwidth. The normalized system throughput on GHC ($r = {{2}}$) was calculated by ${d_x} = {d_y} = d/{{2}}$ when $d$ is even, otherwise $|{d_x} - {d_y}| = {{1}}$. While the system throughput on FC grows linearly, that on GHC ($r = {{2}}$) grows quadratically. Our analysis showed that GHC decreased the sorting throughput by 35% compared to an ideal fully connected network topology with the same number of FPGAs. These results demonstrate high scalability for all-to-all on GHC ($r = {{2}}$).

Fig. 12. Normalized system throughput for all-to-all on fully connected and GHC ($r = {{2}}$).

Download Full Size | PDF

When the degree per FPGA is the same, GHC has better performance. For example, when ${d_x} = {d_y} = d/{{2}}$ and $d$ is large, the normalized throughput can be approximated as ${d^2}/{{16}}$, which is 16 times faster when $d = {{16}}$.

C. Parallel Counting Sort

1. Fully Connected Network Topology

We implemented a parallel counting sort that consists of local sorting operations at each FPGA. We prepared 128M 32-bit integer data for both key and value for the parallel counting sorting. Since OPTWEB took a 256-bit width of the data lane of a single pseudo channel in HBM2, 64 keys or 64 values were input to the sorter at a time. The sorter checked the three bits of each input key for identifying the destination nodes. It then classified the keys/values (32 bits, one word) into eight types according to the number of nodes and stored them in a buffer. The data stored in each buffer was written to the memory in eight keys/values per clock.

As shown in Table 1, the operating frequency decreased to 125 MHz by the complex application logic. The data transfer bandwidth per port in a distributed switch is 32 $({\rm{= 51.2}\;{\rm Gbps}}\;{\rm{*}}\;{\rm{125/200}})\;{\rm{Gbps}}$, and the effective link bandwidth in OPTWEB was 32 Gbps/link (256 Gbps/FPGA). In this case, the bisection network bandwidth became 512 Gbps. This was roughly equivalent to the 400 Gbps of bisection bandwidth of a comparable CPU cluster with IB at 100 Gbps/node. The data for sorting was placed on HBM2 for the FPGA and on DDR4 memories for the CPU. These results demonstrate that the sort was successful.

Figure 13 shows the execution time of the parallel sorting on eight nodes. In the case of the CPU(1 core) server, the execution time was 0.51 s. In the CPU(8 cores) server, the execution time decreased to 0.165 s. Despite an eightfold increase in the number of processes, the speedup was limited to approximately three times. The performance does not scale because of the high all-to-all communication load on the CPU cluster. In this evaluation, 64 MiB/node of data was placed for key and value, respectively. As shown in Fig. 8, the execution time was about 49 ms, in which start-up latency was dominant. The all-to-all throughput was as small as 10.9 Gbps/FPGA and does not take advantage of the IB’s bandwidth (100 Gbps/node).

Fig. 13. Sort execution time in OPTWEB and CPU clusters.

Download Full Size | PDF

On the other hand, in FPGA $+$ OPTWEB, the execution time was 0.03 s. FPGA $+$ OPTWEB achieved a 17 times speedup over CPU(1 core) and 5.3 times speedup over CPU(8 cores). The speedup was more than 8 times faster than that of CPU(1 core) $+$ IB, which suggests that the FPGA alone may perform sorting processing faster than the CPU(8 cores). We conclude that OPTWEB provided the scalability of eight-FPGA computing. We analyzed the results of sorting throughput, in which the total amount of data for keys and values was divided by the processing time and confirmed that OPTWEB provided a 277 Gbps sorting throughput. This is because using OPTWEB eliminates the all-to-all communication bottleneck seen in CPU clusters.

Figure 14 shows the amount of hardware resources in this parallel sort. In the sorter, eight keys/values were input from each ${\rm{M}}\# n$ of HBM2 with a 256-bit-wide data lane, so 64 keys/values were needed for parallel processing on eight FPGAs. Due to the high number of data lanes, the sorter was both complex and large, consuming 38.7% of the ALMs. However, OPTWEB is lightweight compared to a conventional packet network. We successfully integrated the sorter and OPTWEB into the FPGA, resulting in a total ALM utilization of 81.0 %. The M20K, a high-speed memory used for buffers, accounted for only 18.1% of the total.

Fig. 14. Hardware utilization of parallel counting sort.

Download Full Size | PDF

Note that the utilization of the OPTWEB ALMs was more significant than the results in [5]. This is due to the newly introduced functionality to the DMAC of the sorter. An additional buffer was added to efficiently retrieve data even if data fluctuations broke the data continuity.

The distributed switches and application logic were implemented according to the number of data lanes from ${\rm{M}}\# n$. To implement various applications, it is important to reduce the amount of ALMs used. As shown in Fig. 14, it is possible to reduce the usage of ALMs by introducing a conversion circuit in ${\rm{M}}\# n$ that reduces the number of internal data lanes in OPTWEB.

2. GHC Network Topology

Table 3 summarizes the sorting throughput for each network topology. In OPTWEB, the sorting performance bottleneck is caused by the number of memory I/O operations. In the case of parallel counting sort for both the key data and the value data in a fully connected network, a total of 11 consecutive memory I/O operations are performed for a specific data area: once for the preliminary key data information extraction, four times for the local sort shown in Fig. 5(a), twice for the all-to-all communication for key/value data shown in Fig. 5(b), and finally four times for the local sort shown in Fig. 5(c). Each local sort is performed on two-column data, as it is necessary to input the key data even when sorting on one-column data. On the other hand, GHC ($r = {{2}}$) requires the steps shown in Figs. 5(a) and 5(b) for the $x$ and $y$ axes, respectively. Therefore, a total of 17 memory I/Os occurred. All the all-to-all communications on the sort are performed within one dimension, which leads to one-step operation on FC and GHC ($r = {{2}}$). Therefore, we can roughly estimate the sorting throughput as the number of memory I/O operations. A GHC ($r = {{2}}$) has 65% ($= {\rm{11/17}}$) of the FPGA throughput compared to an ideal FC with the same number of FPGAs. Since the system throughput is the product of the FPGA throughput and the number of FPGAs, the relative system throughput on GHC ($r = {{2}}$) to that on FC is ${{11}}({d_x} + {{1}})({d_y} + {{1}}){\rm{/}}[{{17}}(d + {{1}})]$, which is larger than in the case that ${d_x}$, ${d_y}\; \ge \;{{2}}$. We conclude that we can obtain a scalable sorting throughput by GHC ($r = {{2}}$).

Table 3. Sorting Throughput for Each Network Topology

View Table | View all tables in this article

D. Sharing Averaged Parameters

1. Fully Connected Network Topology

We implemented the sharing averaged parameters as follows. First, we designed a circuit to perform averaging data computation for incoming data from eight memory channels of HBM2 corresponding to eight FPGAs. Next, we integrated OPTWEB and the averaging data circuit via a changeover switch into the FPGA. We designed two circuits for averaging data computation, one for the naïve approach and for the aggressive approach, as shown in Fig. 5. The operating frequency of the internal bus became 200 MHz, and the OPTWEB provided 90 Gbps/link (720 Gbps/FPGA) using two HBM2s, as shown in Table 1.

The computation of the sharing averaged parameters handled 64 pieces of 32-bit input data simultaneously. Eight of them, labeled $x$th ($x = {{1}},{{2}}, \ldots ,\;{{8}}$) from the ${\rm{M}}\# n$, were used as an input unit for one averaging data circuit. We used eight of these averaging data circuits. Since this averaging data circuit obtained the average value of such eight data, seven digital signal processors (DSPs) were used to process the data to maintain the same throughput as the input. In contrast, in the CPU $+$ IB, we utilized the sum computation using the all-reduce operation of OpenMPI v4.1. We successfully verified the results of the sharing averaged parameters.

Figure 15 shows the results of the execution time for each data size. With the conventional CPU $+$ IB, the time for all-reduce operation was shorter than that for all-to-all communication. However, even if the data size was as small as 1 KiB/node, it still required about 100 µs. Here, the throughput application is the execution time divided by the data size per node. The 100 µs latency was not enough to achieve the high throughput of this application. When the data size exceeded 32 MiB/node, the throughput tended to be saturated for both CPU(1 core) and CPU(8 cores). However, the execution time tended to be shorter for CPU(8 cores). As a result, the time required to process a data size of 2 GiB/node was 787 ms, and the throughput of this application was 21.8 Gbps/node ($= {{2}}\;{\rm{GiB/787}}\;{\rm{ms}}$) with CPU(8 cores).

Fig. 15. Execution time of parallel averaged parameters with eight nodes.

Download Full Size | PDF

On the other hand, a significant reduction in the execution time was achieved in both the naïve and aggressive approaches on OPTWEB. Using the aggressive approach, the processing time was reduced by up to 2/3 for all data sizes compared to the naïve approach. The execution time became 2 µs for 1 KiB/FPGA, 3 µs for 32 KiB/FPGA, and 48 ms for 2 GiB/node.

In CPU $+$ IB, the execution time was faster than that in all-to-all communication, as shown in Fig. 8. This parallel data processing uses the all-gather communication instead of all-to-all. The start-up latency of all-gather communication with 1 KiB was only 51 µs. The speedup with CPU(8 cores) $+$ IB is due to the efficient reduction operation on the CPU. However, even with CPU(8 cores) $+$ IB, the data processing efficiency, which is the throughput of this application divided by the network bandwidth per node, was only 21.8% (21.8 Gbps/100 Gbps).

The aggressive approach in OPTWEB consumed 2 µs for 1 KiB/FPGA with almost the sum of the start-up latency of all-to-all communication and all-gather communication. This indicates that the switching time between communication and data processing can be ignored by using OPTWEB. This low start-up latency allowed efficient communication even at 32 KiB/node, resulting in an execution time of 3 µs. The link throughput during this application was as high as 89 Gbps/FPGA ($= {{32}}\;{\rm{KiB/3}}\;{\unicode{x00B5}\text{s}}$). For the largest data size, 2 GiB/node, the aggregate throughput reached 358 Gbps/FPGA, which accounted for 49.7% of the network utilization. Since two collective communications are performed sequentially, we can conclude that parallel processing that fully utilizes the network bandwidth has been achieved.

Figure 16 shows the speedup factor results of the aggressive and naïve approaches for CPU $+$ IB. A speedup was obtained for a wide range of data sizes thanks to the highly efficient collective communication of OPTWEB. At 64 KiB/node, the maximum speedup factor of the aggressive approach reached 250, and even at the maximum data size of 2 KiB/node, a speedup factor of 16.4 was obtained. Another finding is that the naïve approach was always inferior to the aggressive approach. Since the naïve approach has three steps, the operational overhead is larger than the two steps of the aggressive approach.

Fig. 16. Speedup factor of OPTWEB.

Download Full Size | PDF

Figure 17 shows the throughput measured while executing the parallel averaged parameters. These measurements correspond to the results in Fig. 15. We estimated the throughput by dividing the amount of data placed in each FPGA by the execution time. It is obvious that the parallel averaged parameters used all of the links of OPTWEB, i.e., close to 800 Gbps on the FPGA, when the data size was larger than 512 KiB/FPGA. We conclude that sharing averaged parameters has a severe performance bottleneck in terms of network bandwidth, and that the data processing time is trivial.

Fig. 17. Throughput of parallel averaged parameters with eight nodes.

Download Full Size | PDF

Figure 18 shows the hardware resources used for sharing averaged parameters. The utilization of ALMs in the averaging data circuit was smaller than that of the sorter because it was constructed with simple wiring and DSPs. This resulted in OPTWEB achieving twice as wide a bandwidth configuration as that in parallel sort. Even in the 720 Gbps/FPGA configuration, the total utilization of ALMs was only 60.6%. M20K also increased due to the increase in bandwidth but was still low at 30.5%. The averaging data process was a simple calculation, and the amount of DSPs used was negligible. Then, the usage of DSPs was only 2.8%. We thus conclude that our implementation fully exploited the performance of OPTWEB with moderate hardware resources.

Fig. 18. Hardware utilization of sharing averaged parameters.

Download Full Size | PDF

2. GHC Network Topology

A comparison of the application performance of FC and GHC for the aggressive approach in Subsection 3.C is provided in Table 4. After dividing the data into two parts, each FPGA communicates the first half of the data among FPGAs with the same $y$ coordinate. Each FPGA communicates the second half of the data by all-to-all among FPGAs with the same $x$ coordinate. At the same time, the results are exchanged. The results can also be used for data processing through all-gather communication in each communication group.

Table 4. Analytical Model of Sharing Averaged Parameters on FC and GHC ($r = {{2}}$)

View Table | View all tables in this article

The analytical model of sharing averaged parameters on FC and GHC ($r = {{2}}$) is summarized in Table 4. Compared to the throughput in Table 2, we can see that the throughput of sharing averaged parameters is half that of all-to-all for both FC and GHC ($r = {{2}}$). Therefore, the process for sharing averaged parameters is also scalable by using the GHC ($r = {{2}}$), similar to all-to-all, as shown in Fig. 19. We found that the GHC led to an improvement of up to 4.2 times for sharing averaged parameters over an ideal fully connected network topology in the case of 64 degrees per FPGA.

Fig. 19. Normalized system throughput for sharing parameter averaging on FC and GHC ($r = {{2}}$).

Download Full Size | PDF

5. CONCLUSION

In this work, we demonstrated the speedup of parallel data processing using eight FPGAs interconnected by OPTWEB. OPTWEB is a lightweight, fully connected inter-FPGA network for efficient collective communications. It discards the traditional packet communication mechanism and introduces dedicated memory processing to minimize communication overhead.

We provided design guidelines for applications using OPTWEB based in part on the concept of DSA. The key factors are (1) using dedicated communication memory, (2) implementing data specialization, and (3) using natural parallelism. More precisely, data to be exchanged should be directly processed on the dedicated communication memory on each FPGA, and using natural parallelism should match the collective communications provided by OPTWEB.

We illustrated two case studies, one for parallel counting sort and one for sharing averaged parameters, on eight custom Stratix10 MX2100 FPGAs. Each custom FPGA card had eight high-density optical transceivers (${{4}} \times {{25}}\;{\rm{Gbps}}$ transmitting and ${{4}} \times {{25}}\;{\rm{Gbps}}$ receiving) and an optical I/O core developed by PETRA. In parallel counting sort, OPTWEB obtained a speedup of 5.3 times compared to the conventional CPU $+$ IB on a fully connected network topology. We also considered supporting a GHC network topology for obtaining high scalability. Our analysis showed that GHC ($r = {{2}}$) decreased the sorting throughput by 35% compared to an ideal fully connected network topology for any network size. For shared averaged parameters, we demonstrated a maximum speedup of 250 times for the CPU $+$ IB and 16.4 times for the maximum data size. Our analysis showed that the GHC improved the sharing averaged parameters by up to 4.2 times over a fully connected network topology with the same degree of a node.

Our future work is to provide a productive software environment to port algorithms to multiple FPGAs with OPTWEB. As for the FPGA development environment, applications can be created with high-level synthesis tools such as OpenCL and HLS. We will provide a software environment using the high-level synthesis. Our current study demonstrated that parallel applications can be entirely utilized by OPTWEB. In addition, we feel there is still room to explore the trade-off of operating frequency and hardware amount on the OPTWEB implementation. This exploration will be important when the software environment is developed. It is related to the programming language of the DSA design guidelines, as discussed in Subsection 2.B.

APPENDIX A: NOTATION

${{N}}$	network size
${{r}}$	network dimension on GHC
${{d}}$	network degree
${{{d}}_x}$	degree of $x$ dimension on GHC
${{{d}}_y}$	degree of $y$ dimension on GHC
${{L}}$	message length
$\alpha$	overhead time for (1-hop) direct communication
$\beta$	the inverse number of link bandwidth

Funding

New Energy and Industrial Technology Development Organization (JPNP13004).

Acknowledgment

This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). The authors thank Mr. Fujiwara of NEC Platforms, Ltd., and Mr. Ishida of NIPPON SYSTEMWARE CO., LTD., for his technical suggestion and support on the synthesis and implementation of the custom FPGA.

REFERENCES

1. J. Fang, Y. T. B. Mulder, J. Hidders, J. Lee, and H. P. Hofstee, “In-memory database acceleration on FPGAs: a survey,” VLDB J.29, 33–59 (2019). [CrossRef]

2. H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin, “StreamBox-HBM: stream analytics on high bandwidth hybrid memory,” in Architectural Support for Programming Languages and Operating Systems (2019), pp. 167–181.

3. B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An in-network architecture for accelerating shared-memory multiprocessor collectives,” in ACM/IEEE 47th Annual International Symposium on Computer Architecture (2020), pp. 996–1009.

4. M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, “BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” IBM J. Res. Dev.63, 1:1–1:11 (2019). [CrossRef]

5. K. Mizutani, H. Yamaguchi, Y. Urino, and M. Koibuchi, “OPTWEB: a lightweight fully connected inter-FPGA network for efficient collectives,” IEEE Trans. Comput.70, 849–862 (2021). [CrossRef]

6. Y. Urino, K. Mizutani, T. Usuki, and S. Nakamura, “Wavelength-routing interconnect ‘Optical Hub’ for parallel computing systems,” in International Conference on High Performance Computing in Asia-Pacific Region (2020), pp. 81–91.

7. K. Mizutani, H. Yamaguchi, Y. Urino, and M. Koibuchi, “Accelerating parallel sort on tightly-coupled FPGAs enabled by onboard Si-photonics transceivers,” in Optical Fiber Communication Conference (2021), paper Th5H.1.

8. R. L. Graham, L. Levi, D. Burredy, G. Bloch, G. Shainer, D. Cho, G. Elias, D. Klein, J. Ladd, O. Maor, A. Marelli, V. Petrov, E. Romlet, Y. Qin, and I. Zemah, “Scalable hierarchical aggregation and reduction protocol (SHARP)^TM streaming-aggregation hardware design and evaluation,” in International Conference on High Performance Computing (2020), pp. 41–59.

9. A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. Tallent, and K. Barker, “Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect,” IEEE Trans. Parallel Distrib. Syst.31, 94–110 (2020). [CrossRef]

10. A. Mondigo, T. Ueno, K. Sano, and H. Takizawa, “Comparison of direct and indirect networks for high-performance FPGA clusters,” in Applied Reconfigurable Computing. Architectures, Tools, and Applications (2020), pp. 314–329.

11. J. Stern, Q. Xiong, A. Skjellum, and M. C. Herbordt, “A novel approach to supporting communicators for in-switch processing of MPI collectives,” in Workshop on Exascale MPI (2018).

12. Q. Xiong, C. Yang, P. Haghi, A. Skjellum, and M. Herbordt, “Accelerating MPI collectives with FPGAs in the network and novel communicator support,” in 28th IEEE International Symposium on Field-Programmable Custom Computing Machines (2020), p. 215.

13. A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” IEEE Micro35, 10–22 (2015). [CrossRef]

14. K. Azegami, K. Musha, K. Hironaka, A. B. Ahmed, M. Koibuchi, Y. Hu, and H. Amano, “A STDM (static time division multiplexing) switch on a multi FPGA system,” in IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (2019), pp. 328–333.

15. J. Lant, J. Navaridas, M. Jujan, and J. Goodacre, “Toward FPGA-based HPC: advancing interconnect technologies,” IEEE Micro40, 25–34 (2020). [CrossRef]

16. J. Sheng, Q. Xiong, C. Yang, and M. C. Herbordt, “Application-aware collective communication (extended abstract),” in International Symposium on Field-Programmable Custom Computing Machines (2016), p. 197.

17. A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang, “Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects,” in IEEE High Performance Extreme Computing Conference (2016).

18. N. Fujita, R. Kobayashi, Y. Yamaguchi, T. Ueno, K. Sano, and T. Boku, “Performance evaluation of pipelined communication combined with computation in OpenCL programming on FPGA,” in IEEE International Parallel and Distributed Processing Symposium Workshops (2020), pp. 450–459.

19. J. L. Hennessy and A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed., The Morgan Kaufmann Series in Computer Architecture and Design (Morgan Kaufmann, 2017).

20. Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: a genomics coprocessor,” IEEE Micro39, 29–37 (2019). [CrossRef]

21. AIO CORE, http://www.aiocore.com/.

22. L. N. Bhuyan and D. P. Agrawal, “Generalized hypercube and hyperbus structures for a computer network,” IEEE Trans. Comput.C-33, 323–333 (1984). [CrossRef]

23. J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” in International Symposium on Computer Architecture (2008), pp. 77–88.

24. Center for Computational Sciences, University of Tsukuba, “Overall specification of Cygnus,” https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About-Cygnus.pdf.

25. T. Nakamura, K. Yashiki, K. Mizutani, T. Nedachi, J. Fujikata, M. Tokushima, J. Ushida, M. Noguchi, D. Okamoto, Y. Suzuki, T. Shimizu, K. Tanaka, A. Ukita, Y. Ibusuki, M. Kurihara, K. Kinoshita, T. Horikawa, H. Yamaguchi, J. Tsuchida, Y. Hagiwara, and K. Kurata, “Fingertip-size optical module, ‘Optical I/O Core’, and its application in FPGA,” IEICE Trans. Electron.E102–C, 333–339 (2019). [CrossRef]

jocn-14-2-A166-i001 Kenji Mizutani received his Ph.D. in semiconductor engineering from Nagoya University, Nagoya, Aichi, Japan in 2003. He joined NEC Corporation in 2003 and was engaged in research on optical devices and optical telecommunication systems. Since 2014, he has been working at the Photonics and Electronics Technology Research Association (PETRA) as a senior researcher developing optical interconnect technology.

jocn-14-2-A166-i002 Hiroshi Yamaguchi received his B.E. in electronic engineering and M.E. in electrical engineering from Tokyo Denki University, Tokyo, Japan in 1986 and 1989, respectively. He joined NEC Corporation (Fuchu, Japan) in 1989, where he has been engaged in the development of electrical circuits. He is currently a chief researcher for the Photonics and Electronics Technology Research Association (PETRA), Fuchu, Japan, on a temporary basis.

jocn-14-2-A166-i003 Yutaka Urino received his B.E. in communication engineering and M.E. in electronic engineering from Tohoku University in 1985 and 1987, respectively. He joined NEC Corporation (Kawasaki, Japan) in 1987, where he was engaged in the research and development of optical waveguide devices and subsystems. He is currently a chief researcher for the Photonics and Electronics Technology Research Association (PETRA) on a temporary basis. He has won the best paper award at two international conferences (OEC’88 and OECC’98) and is one of the co-authors of Silicon Photonics III published by Springer. His current research interests include silicon photonics, optical interconnects, and parallel computing. He is a topical editor of Optical Review.

jocn-14-2-A166-i004 Michihiro Koibuchi received B.E., M.E., and Ph.D. degrees from Keio University, Yokohama, Kanagawa, Japan in 2000, 2002, and 2003, respectively. He is currently an associate professor at the National Institute of Informatics and SOKENDAI, Tokyo, Japan. His research interests include computer architecture and inter-connection networks. He has published over 100 referreed technical conference and journal papers (nine in IEEE Transactions on Parallel and Distributed Systems, five in IPDPS, four in HPCA, and four in IEEE Transactions on Computers). He is a senior member of the IEEE, the IEEE Computer Society, IPSJ, and IEICE.

	FC	GHC ( $r = 2$ )
Degree	$d$	$d = d_{x} + d_{y}$
No. of FPGAs	$N = d + 1$	$N = (d_{x} + 1) (d_{y} + 1)$
Time for all-to-all	$α + \frac{L}{N} β$	$2 [α + \frac{(d + 2) L}{2 N} β]$
FPGA throughput	$\frac{N}{β}$	$\frac{N}{(d + 2) β}$
FPGA bandwidth	$\frac{N}{β}$	$\frac{(d + 2)}{β}$
Normalized FPGA throughput	$1$	$\frac{N}{{(d + 2)}^{2}}$
Normalized system throughput	$N$	$[\frac{N}{d + 2}]^{2}$

	FC	GHC ( $r = 2$ )
No. of FPGAs	$d + 1$	$(d_{x} + 1) (d_{y} + 1)$
Degree (per FPGA)	$d$	$d = d_{x} + d_{y}$
Memory read/write time	11	17
Relative FPGA throughput	1	$\frac{11}{17}$
Relative system throughput	1	$\frac{11 (d_{x} + 1) (d_{y} + 1)}{17 (d + 1)}$

	FC	GHC ( $r = 2$ )
Degree (per FPGA)	$d$	$d = d_{x} + d_{y}$
No. of FPGAs	$N = d + 1$	$N = (d_{x} + 1) (d_{y} + 1)$
Execution time	$2 (α + \frac{L}{N} β)$	$4 [α + \frac{(d + 2) L}{2 N} β]$
FPGA throughput	$\frac{N}{2 β}$	$\frac{N}{2 (d + 2) β}$
FPGA bandwidth	$\frac{N}{β}$	$\frac{(d + 2)}{β}$
Normalized node throughput	$\frac{1}{2}$	$\frac{N}{2 {(d + 2)}^{2}}$
Normalized system throughput	$\frac{1}{2} N$	$\frac{1}{2} [\frac{N}{d + 2}]^{2}$

	FC	GHC ( $r = 2$ )
Degree	$d$	$d = d_{x} + d_{y}$
No. of FPGAs	$N = d + 1$	$N = (d_{x} + 1) (d_{y} + 1)$
Time for all-to-all	$α + \frac{L}{N} β$	$2 [α + \frac{(d + 2) L}{2 N} β]$
FPGA throughput	$\frac{N}{β}$	$\frac{N}{(d + 2) β}$
FPGA bandwidth	$\frac{N}{β}$	$\frac{(d + 2)}{β}$
Normalized FPGA throughput	$1$	$\frac{N}{{(d + 2)}^{2}}$
Normalized system throughput	$N$	$[\frac{N}{d + 2}]^{2}$

	FC	GHC ( $r = 2$ )
No. of FPGAs	$d + 1$	$(d_{x} + 1) (d_{y} + 1)$
Degree (per FPGA)	$d$	$d = d_{x} + d_{y}$
Memory read/write time	11	17
Relative FPGA throughput	1	$\frac{11}{17}$
Relative system throughput	1	$\frac{11 (d_{x} + 1) (d_{y} + 1)}{17 (d + 1)}$

Accelerating parallel data processing using optically tightly coupled FPGAs

Abstract

1. INTRODUCTION

2. BACKGROUND INFORMATION

A. Existing Inter-FPGA Networks

B. Aspect of DSA

C. OPTWEB

D. Network Topology of OPTWEB

3. APPLICATION DESIGN

A. Guidelines for Parallel Data Processing

B. Parallel Counting Sort

1. Assumption

2. Procedure

3. How Parallel Sort Follows the Guidelines

C. Sharing Averaged Parameters

1. Assumption

2. Procedure

3. How Sharing Averaged Parameters Follows the Guidelines

4. EVALUATION

A. Conditions

B. Preliminary

1. Fully Connected Network Topology

2. GHC Network Topology

C. Parallel Counting Sort

1. Fully Connected Network Topology

2. GHC Network Topology

D. Sharing Averaged Parameters

1. Fully Connected Network Topology

2. GHC Network Topology

5. CONCLUSION

APPENDIX A: NOTATION

Funding

Acknowledgment

REFERENCES

Cited By

Figures (19)

Tables (4)

Journal of Optical Communications and Networking

	All-to-All	Counting Sort	Sharing Averaged Parameters
Operating frequency	200 MHz	125 MHz	200 MHz
Mem. BW (per FPGA)	819.2 Gbps	409.6 Gbps	819.2 Gbps
Data transfer BW of OPTWEB (per FPGA)	720 Gbps	256 Gbps	720 Gbps