Efficient implementation of the Shack–Hartmann centroid extraction for edge computing

Jacopo Mocci; Federico Busato; Nicola Bombieri; Stefano Bonora; Riccardo Muradore

doi:10.1364/JOSAA.401376

Journal of the Optical Society of America A
Vol. 37,
Issue 10,
pp. 1548-1556
(2020)
•https://doi.org/10.1364/JOSAA.401376

Efficient implementation of the Shack–Hartmann centroid extraction for edge computing

Jacopo Mocci, Federico Busato, Nicola Bombieri, Stefano Bonora, and Riccardo Muradore

Open Access

Get PDF
Email
Share
Get Citation
Copy Citation Text
Jacopo Mocci, Federico Busato, Nicola Bombieri, Stefano Bonora, and Riccardo Muradore, "Efficient implementation of the Shack–Hartmann centroid extraction for edge computing," J. Opt. Soc. Am. A 37, 1548-1556 (2020)

Export Citation
- BibTex
- Endnote (RIS)
- HTML
- Plain Text
Citation alert
Save article
Spotlight Summary

Check for updates

More Like This

Measuring the centroid gain of a Shack–Hartmann quad-cell wavefront sensor by using slope...
Marcos A. van Dam
J. Opt. Soc. Am. A 22(8) 1509-1514 (2005)

GPU implementation for spline-based wavefront reconstruction
Elisabeth Brunner, et al.
J. Opt. Soc. Am. A 35(6) 859-872 (2018)

Centroid gain compensation in Shack–Hartmann adaptive optics systems with natural or laser guide...
Jean-Pierre Véran, et al.
J. Opt. Soc. Am. A 17(8) 1430-1439 (2000)

Related Topics
Table of Contents Category
- Instrumentation, Measurement, and Metrology
Optics & Photonics Topics
?

The topics in this list come from the Optics and Photonics Topics applied to this article.

About this Article
History
- Original Manuscript: June 29, 2020
- Revised Manuscript: August 14, 2020
- Manuscript Accepted: August 16, 2020
- Published: September 10, 2020
Virtual Issues
September 30, 2020 Spotlight on Optics

Abstract

Adaptive optics (AO) is an established technique to measure and compensate for optical aberrations. One of its key components is the wavefront sensor (WFS), which is typically a Shack–Hartmann sensor (SH) capturing an image related to the aberrated wavefront. We propose an efficient implementation of the SH-WFS centroid extraction algorithm, tailored for edge computing. In the edge-computing paradigm, the data are elaborated close to the source (i.e., at the edge) through low-power embedded architectures, in which CPU computing elements are combined with heterogeneous accelerators (e.g., GPUs, field-programmable gate arrays). Since the control loop latency must be minimized to compensate for the wavefront aberration temporal dynamics, we propose an optimized algorithm that takes advantage of the unified CPU/GPU memory of recent low-power embedded architectures. Experimental results show that the centroid extraction latency obtained over spot images up to $700 \times 700$ pixels wide is smaller than 2 ms. Therefore, our approach meets the temporal requirements of small- to medium-sized AO systems, which are equipped with deformable mirrors having tens of actuators.

1. INTRODUCTION

Light beams emanated from an object are distorted by perturbations presented in the optical path between the object and the observer. Such perturbations affect the phase of each beam, leading to an aberrated wavefront that, in turn, deteriorates the image quality of the observed object [1].

Adaptive optics (AO) is a widely known technique whose objective is to obtain diffraction-limited images of the observed object. An AO system is composed of one or more wavefront sensors (WFSs), one or more deformable mirrors (DMs), and a controller. By the phase conjugation principle, the controller shapes the DM into the conjugated wavefront measured by the WFS, compensating for the aberrations. Extensive literature in a wide range of fields including astronomy [2], ophthalmology [3], microscopy [4], communication [5], and high-power lasers [6] has demonstrated the practical value of AO for improving the image quality despite optical and/or atmospheric aberrations.

The most common WFS is the Shack–Hartmann WFS (SH-WFS), which measures the wavefront local gradients by spatially sampling the incoming beam with a lenslet array, with each lenslet focusing the local subaperture into a CCD or CMOS camera pixel array. Several algorithms deal with the centroid extraction from the spots image, the choice of which depends on the dynamic range of the centroids (e.g., spots overlapping neighboring reference spots), image quality (e.g., faint, non-uniformly distributed spots), and software complexity [7–9].

Since the AO control loop frequency should be at least one order of magnitude higher than the cutoff frequency of the wavefront aberrations dynamics, WFS measurement latency must be minimized by either sacrificing wavefront spatial resolution (e.g., smaller beam aperture leads to shorter exposure time because of the higher photon flux and fewer pixels to be transferred), or by choosing a fine-tuned extraction algorithm to be implemented into a suitable architecture. Under normal conditions, each subaperture is completely independent of the others. Hence, it is natural to tackle the SH-WFS centroid extraction problem exploiting the throughput-oriented single-instruction–multiple-data (SIMD) paradigm, so that the centroids can be computed in parallel using the same set of instructions.

Graphic processing unit (GPU) and field-programmable gate array (FPGA) architectures are examples of architectures well suited to SIMD computation. However, while FPGA excels in raw performance compared to GPU, the latter offers programming flexibility also found in central processing unit (CPU) latency-oriented architectures. In fact, most of the GPU development tools are extensions of those used for programming CPU code. Therefore, both GPU and CPU solutions offer easier debugging solutions than FPGA and faster deployment for experimental testing.

Wavefronts related to the aberrated eye pupil can be reconstructed from SH-WFS images using a desktop-class GPU, achieving lower latency when compared to its CPU implementation [10]. The wavefront can also be estimated from SH-WFS images by means of a GPU-accelerated neural network, with detection accuracy near that of traditional methods when properly trained [11]. The big amount of data coming from the SH-WFS arrays in the extremely large telescope (ELT) class can be tackled by GPU clusters to accelerate the gradients extraction [12,13]. Aside from the real-time computation, introductory work on the tomographic reconstruction of atmospheric turbulence through neural networks on GPUs has also been done [14] and several GPU-accelerated optics simulation frameworks have been developed [15–17]. Some AO-related algorithms implemented on GPUs are demonstrated to perform even better than their FPGA counterparts [18].

However, little attention has been given to the implementation of the SH-WFS algorithms for edge computing [19], which is an effective solution for small- to medium-sized telescopes [2]. Many solutions for edge computing combine a hardware accelerator to the input sensor. Examples are FPGAs embedded with camera sensors (i.e., smart cameras) [20], which can process the wavefront with minimal latency [21,22]. Nonetheless, heterogeneous architectures in which CPU processing elements are combined with GPU accelerators are the preferred alternative to FPGA-based smart cameras when dealing with hard to implement visual computing algorithms [23–25]. In addition, recent heterogeneous low-power architectures (e.g., NVIDIA Jetson TX2 and Nano) allow to reduce the memory transfer overhead by implementing a unified CPU/GPU memory. They are getting pervasive for edge computing due to their compact footprint, power and energy efficiency, and competitive high-performance computing (HPC) capabilities [26].

In this paper, we propose a SH-WFS centroid extraction algorithm that is tailored for edge computing on low-power CPU/GPU devices with unified memory. We show that the time needed to complete the centroid extraction from the image acquisition is small enough to satisfy the AO closed-loop latency constraint. Our contribution is the implementation of such an algorithm that:

• Can efficiently run on portable, low-power, energy-efficient devices, enabling at-the-edge AO control;
• Is flexible with respect to the WFS hardware configuration;
• Guarantees lower latency than the on-board CPU counterparts.

The paper is organized as follows. Section 2 briefly recalls how a SH sensor works, whereas Section 3 introduces the moment-based centroid extraction algorithm. Section 4 explains the parallel implementation of the centroid extraction algorithm, whose experimental results are shown in Section 5. In Section 6, some conclusions are drawn together with a future work plan.

2. SH-WFS OPERATION PRINCIPLE

The SH-WFS is a sensor that measures wavefront distorsions by computing its local gradients. Figure 1 shows the optical principle that allows the SH-WFS to spatially and temporally sample the incoming wavefront. The $l$-th lenslet of the lenslet array focuses the local wavefront into a light spot on the pixel array of the capture device. Assuming a point-like light source (e.g., a distant star), the centroid position ${c_l} \in {{\mathbb R}^2}$ of the spot is related to the spatial displacement of the incoming aberrated wavefront with respect to the flat wavefront (no aberrations).

Fig. 1. Operation principle of the Shack–Hartmann WFS.

Download Full Size | PDF

Let $\rho \in {\mathbb R}$ be the diameter of the telescope circular aperture. The grid-shaped lenslet array $L$ inscribes the wavefront of the incoming light beam, with the number of lenslets in each row given by

(1)$${w_L} = \left\lfloor {\frac{\rho}{{{d_L}}}} \right\rfloor ,$$

where ${d_L} \in {\mathbb R}$ is the size of each lenslet, as shown in Fig. 1. The lower bracket $\lfloor z \rfloor$ is the floor rounding operator that returns the greatest integer smaller than or equal to $z \in {\mathbb R}$.

The lenslets focal length $f \in {\mathbb R}$ determines the size of the pixel region in which the $l$-th spot is focused into. A large focal length favors centroid sensitivity, whereas a small focal length increases the centroid dynamic range. Spots might be imaged anywhere in the pixel array, resulting in arbitrarily large and overlapping pixel regions. However, if $f$ is sufficiently small in relation to the expected incoming wavefront spatial variance, then the pixel regions are disjoint. This is a valid assumption when measuring the wavefront compensated for by an AO control system.

Each pixel region is a square since the lenslet is arranged in a grid. The number of pixels contained in a row of a pixel region is

(2)$$d = \frac{{{d_L}}}{{{d_P}}},$$

where ${d_P} \in {\mathbb R}$ is the pixel width (assuming no dead zone among regions).

Let $p = ({x_p},{y_p})$ be a pixel position, with ${x_p},{y_p} \in {\mathbb N}$ being the Cartesian coordinates of the pixel in the pixel array. Each lenslet $l$ focuses the spot into the pixels in the $l$-th pixel region:

(3)$$\!\!\!{P_l} = \{p\;|\lfloor {{x_l}} \rfloor \le {x_p} \lt \lfloor {{x_l} + d} \rfloor ,\lfloor {{y_l}} \rfloor \le {y_p} \lt \lfloor {{y_l} + d} \rfloor \} ,\!$$

where $({x_l},{y_l}) \in {{\mathbb R}^2}$ is the pixel coordinate of the bottom-left corner of the pixel region. The mapping from the lenslet array into the pixel array is calibrated by identifying the pixel position $({x_0},{y_0})$ located at the bottom-left pixel of the bottom-left lenslet in the lenslet grid. (Throughout the paper, 1D lenslet and pixel arrays are indexed as row-major arrays starting from zero.) Then, the bottom-left pixel coordinate of the $l$-th pixel region is calculated as

(4)$$({x_l},{y_l}) = \left({\left\lfloor {{x_0} + d\left\lfloor {\frac{l}{{{w_L}}}} \right\rfloor} \right\rfloor ,\left\lfloor {{y_0} + d\left({l{\rm mod} {w_L}} \right)} \right\rfloor} \right).$$

Merging all pixel regions yields the pixel region of interest (RoI) $P$, which is a square region having width

(5)$${w_P} = d{w_L},$$

as illustrated in Fig. 2. (The pixel RoI width ${w_P} \in {\mathbb N}$ is expressed in pixel units, whereas the lenslet RoI width ${w_L} \in {\mathbb N}$ is expressed in lenslet units.)

Fig. 2. SH-WFS spot image. The pixel regions ${P_L}$ are adjacent to each other, and together form the pixel RoI $P$.

Download Full Size | PDF

Comparing the measured centroid position ${c_l}$ with its reference centroid, i.e., the centroid measured when the wavefront is flat, yields the incoming wavefront phase gradients (slopes) from which the wavefront phase can be reconstructed by zonal or modal techniques [1]. The centroids can be extracted from the WFS spot image by a moment-based method [7], which is used to calculate the center of gravity (CoG) of each spot on the pixel array.

3. MOMENT-BASED CENTROID EXTRACTION

Let $I(p) \in {\mathbb R}$ be the measured intensity value at a RoI pixel $p \in P$. The image moments for the $l$-th spot image are defined as

(6)$${_lm}_l^{\textit{ij}} = \sum\limits_{p \in {P_l}} x_p^iy_p^jI(p),$$

where $i,j \in {\mathbb N}$ are the moments order over the Cartesian directions. The $l$-th centroid position ${c_l}$ is the CoG of the $l$-th spot calculated using the image moments as

(7)$${c_l} = \left({\frac{{m_l^{10}}}{{m_l^{00}}},\frac{{m_l^{01}}}{{m_l^{00}}}} \right).$$

According to the pixel region defined in Eq. (3), all the centroids ${c_l}$ of the spot image can be extracted by implementing Algorithm 1 [27].

Algorithm 1. Lenslet-Wise Centroid Extraction

View Table | View all tables in this article

The lower bound of the computational cost for the centroid extraction algorithm is given by the analysis of the two nested for-loops. Since the pixel regions ${P_l}$ covering the lenslet array $L$ are adjacent, disjoint, and completely contained in the pixel RoI $P$, the inner loop iterates over each pixel $p \in P$. The pixel RoI $P$ is square, and hence its size is $\lfloor d w_L \rfloor^{2}$. A total of five operations are computed for each pixel $p$ to update the moments. The outer loop iterates over each lenslet $l \in L$, where the lenslet grid $L$ is a square of size $w_L^2$. Each centroid ${c_l}$ requires two instructions to be calculated from the moments, as shown in Eq. (7).

Therefore, the lower bound complexity depends on the lenslet grid width ${w_L}$ and the pixel region width $d$:

(8)$$\Omega ({w_L},d) = 2w_L^2 + 5{\lfloor {d{w_L}} \rfloor ^2}.$$

Since the moment partials of Eq. (6) (i.e., the addends of the sum) are mutually independent, they can be computed concurrently at once and then summed up to yield the lenslet moment.

Since the CoG method to extract the centroids is sensitive to the WFS signal-to-noise ratio (SNR), standard image-processing techniques could be implemented to enhance the SNR such as (i) thresholding pixels’ intensities below the noise level, (ii) correcting the gamma function of the intensity by applying a power-law transformation, and (iii) excluding pixels near the lenslet edges (i.e., windowing) [28].

4. CUDA IMPLEMENTATION

CUDA and OpenCL are the two dominant frameworks for parallel programming. They offer the same capabilities, e.g., exploiting unified memory, albeit with different hardware terminology and code syntax. OpenCL is open-source and compatible with a wide range of GPU and multi-core CPU architectures, whereas CUDA is proprietary and compatible only with NVIDIA architectures. On the other hand, since the CUDA framework is specific to NVIDIA architectures, it guarantees tighter implementation than OpenCL and deeper development tools integration.

We rely on the CUDA parallel framework to fully exploit the Jetson architecture [29]. In the CUDA programming language, a device (GPU) utilizes its threads and memory to execute kernels (i.e., functions) called by the host (CPU). Threads are organized in blocks, and blocks are contained in a grid. The scheduler maps blocks into multiple streaming multiprocessors. Each thread in a block is mapped to a core. Up to 32 threads (i.e., one warp of threads) can be scheduled to concurrently execute the same instructions (i.e., SIMD).

Each thread can efficiently access register, local, or shared memory. While the first two are limited to the thread scope, the latter is available to all threads in a block and, hence, can be used for efficient communication among threads. The register memory is faster than local memory, but it is also the least available. Register, local, and shared memories are device-side resources meant to store temporary results. The host must access the device’s slower global memory to store the dataset to be processed by the kernel and read the results.

A. Data Levels

Since the spot image is stored into memory as a row-major linear array, the pixel expressed in Cartesian coordinate $p = ({x_p},{y_p})$ is mapped into the array index

(9)$$\phi = {x_p} + {w_P}{y_p}.$$

By accessing contiguous elements along the rows, neighboring data chunks are cached in fast memory, leading to smaller transfer latency time. To leverage the memory cache, the spot image is partitioned into data levels.

The topmost level represents the entire spot image in which the pixels RoI $P$, i.e., the lenslet grid, is immersed. The lenslet grid is divided into rows that are spanned by lenslet groups. Each lenslet group $\bar l$ has the same power-of-two size, up to 16 lenslets wide, to satisfy the condition for the reduction operation on its elements. Multiple lenslet groups are concatenated to cover all lenslets in a row. Eventual lenslets remaining in the last lenslet group are padded with zeros. A lenslet $l$ is partitioned in a power-of-two number of stacked row groups. The size of each row group $\phi _{\bar r}^l$ is given by the rounded-up ratio between the number of pixel rows in a lenslet and the number of row groups to be assigned for a lenslet. The remaining rows of the last row group are outside the lenslet and hence are ignored from the computation. Each row $\phi _r^{l,\bar r}$ in a row group is composed of pixels. Pixels in a row are covered by pixel groups, with each pixel group $\phi _{\bar p}^{l,\bar r,r}$ containing four pixels $\phi _p^{l,\bar r,r,\bar p}$ to optimize memory transfer. Since pixel groups are aligned to the beginning of the spot image array, leading and trailing pixel groups may contain pixels outside the scope of the actual lenslet. The intensities of such pixels are read as zero so that they do not contribute to the final result.

Let the pixel partial be the set of partial moments ${m^{00}}$, ${m^{10}}$, and ${m^{01}}$ calculated at the pixel position $\phi _p^{l,\bar r,r,\bar p}$:

(10)$$\!\!\!\mu _p^{l,\bar r,r,\bar p} = \{{m^{00}}(\phi _p^{l,\bar r,r,\bar p});\,\, {m^{10}}(\phi _p^{l,\bar r,r,\bar p});\,\, {m^{01}}(\phi _p^{l,\bar r,r,\bar p})\} .$$

The sum between two pixel partials $\mu_i^{l,\bar r,r,\bar p} +\mu_j^{l,\bar r,r,\bar p}$ is the set defined as the piece-wise sum of their respective partial moments. Summing together all the pixel partials associated with a pixel group yields the pixel group partials $\mu_{\bar p}^{l,\bar r,r}$. The row partial $\mu_r^{l,\bar r}$ of a row group is obtained by adding the pixel group partials calculated on its underlying pixel groups. The row group partial $\mu_{\bar r}^l$ of a lenslet is given by summing up all the row partials in a row group. Finally, the lenslet moment ${\mu _l}$ of the lenslet grid is the sum of the row group partials contained in the $l$-th lenslet.

Figure 3 describes how a single pixel $p$ of the spot image is indexed through the data levels. The lenslet grid is seven lenslets wide, and the lenslet size is $10 \times 10\; {\rm pixels}$. With the lenslet groups size fixed to four lenslets, it takes two lenslet groups to cover a row of lenslets, with one remaining lenslet. The fifth lenslet group accesses lenslets 18, 19, and 20, zero padding the others. Each lenslet is partitioned into two row groups, and hence each row group spans five rows, with no rows left out. The lenslet pitch is 10 pixels and can be covered by up to four pixel groups, depending on the memory alignment. In the figure, the fourth pixel row needs only two pixel groups. The remaining pixel intensities of the second pixel group are read as zeros (and no further pixel groups are to be drawn). The selected pixel $p$ is indexed as $\phi _1^{20,1,4,2}$ and produces the partial pixel moment $\mu_1^{20,1,4,2}$. Adding together the partial moments from the bottom to the top of the data hierarchy levels yields the lenslet moment ${\mu _{20}}$.

Fig. 3. Diagram of the data hierarchy levels as seen from the CUDA extraction algorithm. On the right: partials obtained by reducing the current level data. With the exception of the lenslet moment, numerical indices are relative to the level.

Download Full Size | PDF

B. Optimized GPU Data Transfer through Coalesced Memory Accesses

In most architectures, the global memory and the CPU memory are physically decoupled. This means that data have to be transferred between memories, with each transfer increasing the temporal overhead over the execution time. To overcome such overheads, latency-hiding techniques can be exploited. However, the kernel execution time must be comparable to the transfer time to take advantage of those techniques. Since the host and device on a Jetson architecture physically share the same global memory, allocated memory can be addressed by both the host and device by pinning it (page-locked mapped memory, also called zero-copy), hence avoiding any transfer overhead.

Remark 1 Page-locked memory on the Jetson TX2 GPU is not cached when accessed by the CPU. As a consequence, host-side memory reading is not optimized. However, the CPU accesses the memory only to write the intensity values acquired from the sensor and read back the computed commands to be sent to the actuators. Therefore, there is no performance penalty using page-locked memory instead of other addressing options.

Up to 128 bytes of data in global memory can be accessed in one transaction. To fully exploit the cache, threads in a warp should ideally access consecutive single precision words (four bytes) starting from a 128-byte-aligned address to realize a coalesced memory transfer. Since the spot image is encoded into a row-major eight-bit array with stride divisible by four, each thread of a warp reads adjacent words of four bytes, hence accessing four pixels (a full pixel group) at once.

The highest bandwidth throughput is achieved when a warp reads 32 pixel groups, i.e., 128 bytes. To maximize caching performance, the warp operates on the contiguous rows of a lenslet group instead of a single lenslet row. This way, a warp services multiple lenslets. The lenslet pitch $d$ determines the maximum number of pixel groups needed to cover a single lenslet row. Since pixel groups are read from contiguous memory, their memory alignment must be accounted for. The worst case scenario is that the first pixel of the row addresses the last memory location of a packet, hence loading the full packet while requiring only one pixel out of four. Similarly, the logical address of the last pixel of the row can point to the first memory location of the packet, again ignoring the remaining three pixels. The maximum number of packets is therefore calculated by considering those extra six offset pixels:

(11)$${n_{\rm{packets}}} = \left\lfloor {\frac{{d + 3 + 3}}{4}} \right\rfloor .$$

The number of packets determines the optimal lenslet group size:

(12)$${d_{\bar L}} = {\exp}_2 \left\lfloor { {\log}_2 \left\lfloor {\frac{{32}}{{{n_{\rm{packets}}}}}} \right\rfloor} \right\rfloor .$$

Since the lenslet groups are consecutively stacked to cover a row of lenslets of the lenslet grid starting from the leftmost lenslet, the lenslets of the last rightmost lenslet group outside the lenslet grid are ignored.

C. Data Reduction

Parallel data reduction summarizes (by using a commutative binary operator) all the homogeneous data of a dataset by exploiting communication and synchronization functions among concurrent execution units (i.e., CUDA threads). Given ${2^n}$ elements in a dataset mapped to a pool of ${2^n}$ execution units, $n \in {\mathbb N}$, the elements of each disjoint pair of execution units in the pool are reduced concurrently into intermediate results, which are half the size of the original dataset. This process is iterated over such intermediate results, with each iteration halving their size. The reduction result is the intermediate result at the last iteration, reached when there are no pairs left. Since operations are done concurrently in a logarithmic-tree fashion, the computational cost to reduce ${2^n}$ elements is $O(n)$. If the number of elements is not power-of-two, then the dataset is padded with elements whose values are neutral with respect to the binary operator considered.

In the CUDA programming model, the parallel reduction can be efficiently implemented at warp level through shuffle primitives, which are low-level CUDA communication and synchronization functions [30]. The shuffle instructions let threads access the registers of other threads scheduled in the same warp, despite registers being local to the threads. By exploiting such primitives, the parallel reduction operates over registers, which are the fastest type of memory in the CUDA architecture.

Since a warp consists of 32 concurrent threads, the dataset to be reduced must be 32 elements large to achieve peak efficiency. However, several smaller datasets can be operated at once by combining them into a 32-element dataset. To do so, the datasets must have the same power-of-two size, eventually padding remainder elements with neutral elements. Then, they are concatenated into the full dataset that must also be padded with neutral elements if its size is not power-of-two.

The full dataset is processed via the XOR scheme (butterfly accessing pattern [30]) to reduce the sub-datasets concurrently. Each thread participating in the shuffle stores the sum of its value with the one of the thread addressed by the bitwise XOR between the caller thread index and the mask value. The reduction can then be performed on all datasets without cross-talking by specifying the mask value at each iteration of the algorithm. Assuming ${2^m}$ datasets (i.e., the lenslet group size), the reduction algorithm returns the results after ${\log}_2 (32) - m + 1$ iterations.

Figure 4 shows how the pixels of the spot image are reduced to lenslet moments ${\mu _l}$.

Fig. 4. Diagram of the data reduction. In this example, since the pixel region width (i.e., lenslet pitch) is $d = 10$, it is required to read six pixel groups from global memory (${\bar p_0}$ to ${\bar p_5}$) in order to cover the two lenslet rows. The rows are partitioned into two row groups of five rows each, which are processed independently. Each thread reduces the pixel group into a partial moment. Assuming eight threads per warp, the pixel groups partials of two lenslet rows are reduced at once using the shuffle reduction. The resulting row partial moments are accumulated into the row group partial moments. When all row groups are processed, their results are reduced in shared memory yielding the two lenslet moments and then copied into global memory.

Download Full Size | PDF

As described in Section 4.B, a warp covers the adjacent pixel rows of the lenslets contained in the lenslet group $\bar l$, one row per lenslet. Each thread in the warp sequentially reduces the pixel partials $\mu_p^{l,\bar r,r,\bar p}$ into the pixel group partial $\mu_{\bar p}^{l,\bar r,r}$. Then, the warp performs a parallel shuffle reduction over the dataset built on the pixel group partials, yielding the row partials $\mu_r^{l,\bar r}$.

Row partials within a row group $\phi _{\bar r}^l$ are operated one after the other by a single warp. To increase occupancy, multiple warps can be associated with a lenslet group, one for each row group. To do so, threads in a block are partitioned into warps by using cooperative groups, a CUDA implementation feature that lets warps be synchronized independently. All row group partials $\mu_{\bar r}^l$ are then stored in shared memory and reduced in parallel into lenslet partials ${\mu _l}$.

In the example in Fig. 4, the lenslet moments of lenslets 0 and 1, which are inside the same lenslet group, are extracted at the same time. Since the lenslet pitch is $d = 10$, the maximum number of packets for each row is ${n_{\rm{packets}}} = 4$, and the optimal lenslet group size is ${d_{\bar L}} = 2$ according to Eq. (12) (assuming eight threads per warp for the sake of space). The pixel groups’ memory location needed for the two lenslet rows, ${\bar p_0}$ to ${\bar p_5}$, are loaded from the global memory into each thread. The memory location from ${\bar p_0}$ to ${\bar p_3}$ contains the pixel groups $\phi _0^{0,0,r}$ to $\phi _3^{0,0,r}$, whereas the location from ${\bar p_3}$ to ${\bar p_5}$ contains the pixel groups $\phi _0^{1,0,r}$ to $\phi _2^{1,0,r}$. Since pixel groups are contiguous, ${\bar p_3}$ is used by both lenslets and is therefore cached. Each thread demuxes and sums the pixels into group partials. The last pixel group partial does not make part of the currently considered lenslets and is therefore imposed to zero. The shuffle reduction yields the row partials of both lenslets, which are then added to the other row partials (the accumulators are denoted by $\sigma _0^0$ and $\sigma _0^1$). The sum of all row partials gives the zeroth row group partials of both lenslets and is stored into shared memory. Since the lenslet is partitioned into two row groups, the number of threads scheduled for a block is set to 16 so that two cooperative groups of eight threads can be formed. Hence, the zeroth and first row group partials are calculated concurrently. Reducing the row group partials stored in shared memory yields the two lenslet moments.

D. Kernel Algorithm

Algorithm 2 illustrates the steps of each concurrent thread to extract the centroids of a spot image. Lines 6–11 calculate the pixel group partials. To obtain the row partial, the butterfly shuffle reduction is carried out in line 12. It is worth remarking that rows from other lenslets in the lenslet group are also reduced at the same time. The rows in the row group are scanned in lines 3–13, accumulating the row partials into the row group partial at each iteration. The shared reduction over the row group partials performed in line 16 yields the lenslet moments.

Algorithm 2. CUDA Centroid Extraction

View Table | View all tables in this article

Remark 2 Algorithm 2 does not implement the image-processing techniques described in Section 3 for enhancing the SNR ratio. However, since such techniques operate on each pixel individually, they can be introduced by transforming the intensities right after reading them from memory, keeping the same level of concurrency. The added computational cost is therefore negligible. Alternatively, per-pixel processing can be implemented by a look-up table intensity transformation (e.g., fine-tuning the camera intensity mapping).

5. EXPERIMENTS

In these experiments, the centroid extraction is performed on images stored in memory. This choice is motivated by the fact that the exposure and transfer of the image are performed before processing, and hence they have no impact on the execution time of the centroid extraction implementation.

The moment-based centroid extraction routine assumes that the captured image contains one intensity spot per lenslet, as shown in Fig. 2. However, since all pixels must be accessed and processed (with no additional data-dependent conditions), the information contained in the spot image does not impact the execution time. A pool of eight-bit white images ($I = 255$ for all pixels) is fed to both the proposed implementation of Algorithm 2 and the sequential implementation of Algorithm 1 to test the correctness of our approach. However, the spot image used in the benchmark experiments are synthesized as eight-bit images with each pixel having random intensity ($I \in [0,255]$) to prevent memory caching optimizations.

The target platform is the NVIDIA Jetson TX2, which integrates a 256-core Pascal GPU, a dual-core NVIDIA Denver 2 CPU, and a quad-core ARM Cortex-A57 CPU. The test-bench runs on the stock Linux distribution that comes with the NVIDIA Jetpack 4.2.1 firmware. The GPU frequency is locked at 1.3 GHz, and each CPU core frequency is locked at 2 GHz.

A test run measures the time elapsed from the issue of the extraction command to the transfer of all extracted centroids, averaged over 50 executions on the same spot image.

Table 1. Execution Times of CPU and GPU (${t_{\rm{CPU}}},{t_{\rm{GPU}}}$) and Relative Speed-Up for the Configurations of RoI Width ${w_P}$ and Pixel Region Width $d$, Along with the Number of Extracted Centroids $w_L^2$.

View Table | View all tables in this article

Fig. 5. Execution times of the centroid extraction for a $100 \times 100$ RoI of the spot image.

Download Full Size | PDF

Fig. 6. Execution times of the centroid extraction for a $200 \times 200$ RoI of the spot image.

Download Full Size | PDF

Fig. 7. Execution times of the centroid extraction for a $500 \times 500$ RoI of the spot image.

Download Full Size | PDF

Fig. 8. Execution times of the centroid extraction for a $700 \times 700$ RoI of the spot image.

Download Full Size | PDF

Fig. 9. Execution times of the centroid extraction for a $1000 \times 1000$ RoI of the spot image.

Download Full Size | PDF

Different WFS optical aperture diameters $\rho$, lenslet sizes ${d_L}$, and pixel sizes ${d_P}$ are taken into account with the pixel region width $d$ and pixels RoI width ${w_P}$ parameters (as shown in Fig. 2). The RoI ranges from $100 \times 100$ to $1000 \times 1000\; {\rm pixels}$, while the pixel region resolution ranges from $3 \times 3$ to $28 \times 28$ pixels. The result of each test run is presented in Table 1, where the CPU and GPU execution time ${t_{\rm{CPU}}},{t_{\rm{GPU}}}$ for a given combination of $d$ and ${w_P}$ are compared to calculate the speed-up:

(13)$${\rm Speed} - {\rm up} = \frac{{{t_{\rm{CPU}}}}}{{{t_{\rm{GPU}}}}}.$$

For every configuration except for the $100 \times 100$ RoI size, the GPU implementation results in a speed-up over the CPU implementation from two up to 13. Figures 5–9 show the execution time for all parameter combinations. The algorithm execution time takes less than 1 ms for pixel region widths $d$ larger than five pixels. Small pixel regions mean that more lenslets fit the same RoI, and this leads to a large number of centroids, increasing the execution time. In the case of small RoI size (Fig. 5), the overhead latency when issuing a CUDA kernel launch (experimentally measured to be 50 µs on average with an empty kernel) completely dominates the GPU execution time. The relatively high execution time variance for the CPU implementation is due to the underlying operating system task scheduler behavior.

6. CONCLUSION

The NVIDIA Jetson platform is a CPU/GPU hybrid platform that, while compact and power efficient, is powerful enough to justify its use in edge-computing and HPC. Due to its unified memory architecture, the latency introduced by copying data is avoided. Hence, images are processed as soon as they are transferred from the camera. The Jetson platform is positioned as an alternative to FPGA-based smart cameras, with the advantage of being easier to program and more flexible.

In small-scale AO systems, the SH-WFS is usually implemented on CPU or FPGA architectures. However, despite being low-latency, CPU solutions are not portable, and FPGA designs lead to long development time. The experimental results carried on the Jetson CPU/GPU platform show that the time required for the centroid extraction is less than 1 ms given a pixel region width larger than five pixels, and hence compatible with the AO closed-loop latency constraint. Our approach suggests that an embedded GPU architecture is a valid alternative to FPGA-based SH-WFS solutions. The parallel capabilities of the device can be leveraged to develop advanced wavefront reconstruction schemes, e.g., extended source and high dynamics sensing.

As future work, it is worth highlighting that the full AO control loop can be implemented into the Jetson, leveraging on its GPU computational power to produce the commands for the DM of small- to medium-sized AO systems. Since DMs in such systems have tens to a few hundreds actuators, the resolution of the SH grid needed to image the DM into the measured wavefront ranges from hundreds to thousands of lenslets. Depending on the AO setup requirements, the proposed centroid extraction algorithm can be adapted to extract the same number of centroids from a small image with small lenslet pitch (e.g., low light condition) or a larger image with larger pitch (i.e., to improve detection accuracy). Taking into account the transfer delay of a USB3 camera and the actuators interface, the wavefront-to-command latency of a standard proportional-integral array (e.g., one PI regulator per wavefront mode) can be quantifiable to be smaller than 2 ms. More effective control techniques such as modal control and predictive/optimal control can take advantage of the GPU parallelism to be computationally feasible. Furthermore, the spatiotemporal dynamics of atmospheric aberrations can be learned and predicted by on-board machine learning algorithms, routinely updating the controller parameters.

Disclosures

The authors declare no conflicts of interest.

REFERENCES

1. R. Tyson, Adaptive Optics Engineering Handbook (CRC Press, 1999), Vol. 10.

2. M. Quintavalla, J. Mocci, R. Muradore, A. Corso, and S. Bonora, “Adaptive optics on small astronomical telescope with multi-actuator adaptive lens,” Proc. SPIE 10524, 1052414 (2018). [CrossRef]

3. P. Zhang, J. Mocci, D. J. Wahl, R. K. Meleppat, S. K. Manna, M. Quintavalla, R. Muradore, M. V. Sarunic, S. Bonora, E. N. Pugh Jr., and R. J. Zawadzki, “Effect of a contact lens on mouse retinal in vivo imaging: effective focal length changes and monochromatic aberrations,” Exp. Eye Res. 172, 86–93 (2018). [CrossRef]

4. M. Quintavalla, P. Pozzi, M. Verhaegen, H. Bijlsma, H. Verstraete, and S. Bonora, “Adaptive optics plug-and-play setup for high-resolution microscopes with multi-actuator adaptive lens,” Proc. SPIE 10498, 104981X (2018). [CrossRef]

5. S. A. Moosavi, M. Quintavalla, J. Mocci, R. Muradore, H. Saghafifar, and S. Bonora, “Improvement of coupling efficiency in free-space optical communication with a multi-actuator adaptive lens,” Opt. Lett. 44, 606–609 (2019). [CrossRef]

6. M. Negro, M. Quintavalla, J. Mocci, A. G. Ciriolo, M. Devetta, R. Muradore, S. Stagira, C. Vozzi, and S. Bonora, “Fast stabilization of a high-energy ultrafast OPA with adaptive lenses,” Sci. Rep. 8, 14317 (2018). [CrossRef]

7. S. Thomas, T. Fusco, A. Tokovinin, M. Nicolle, V. Michau, and G. Rousset, “Comparison of centroid computation algorithms in a Shack-Hartmann sensor,” Mon. Not. R. Astron. Soc. 371, 323–336 (2006). [CrossRef]

8. F. Kong, M. C. Polo, and A. Lambert, “Centroid estimation for a Shack-Hartmann wavefront sensor based on stream processing,” Appl. Opt. 56, 6466–6475 (2017). [CrossRef]

9. A. Vyas, M. B. Roopashree, and B. R. Prasad, “Optimization of existing centroiding algorithms for Shack-Hartmann sensor,” Arxiv preprint arXiv:0908.4328 (2009).

10. J. Mompeán, J. L. Aragón, P. M. Prieto, and P. Artal, “GPU-based processing of Hartmann–Shack images for accurate and high-speed ocular wavefront sensing,” Future Gener. Comput. Syst. 91, 177–190 (2019). [CrossRef]

11. L. Hu, S. Hu, W. Gong, and K. Si, “Learning-based Shack-Hartmann wavefront sensor for high-order aberration detection,” Opt. Express 27, 33504–33517 (2019). [CrossRef]

12. D. Perret, M. Lainé, J. Bernard, D. Gratadour, and A. Sevin, “Bridging FPGA and GPU technologies for AO real-time control,” Proc. SPIE 9909, 99094M (2016). [CrossRef]

13. M. Lainée, A. Sevin, D. Gratadour, J. Bernard, and D. Perret, “A GPU based RTC for E-ELT adaptive optics: real time controller prototype,” in Adaptive Optics for Extremely Large Telescopes (2018), p. AO4ELT5.

14. R. Á. F. Díaz, J. L. C. Rolle, N. R. Gutiérrez, and F. J. de Cos Juez, “Using GPUs to speed up a tomographic reconstructor based on machine learning,” in International Joint Conference SOCO’16-CISIS’16-ICEUTE’16: San Sebastián, Spain, October 19th-21st, 2016 Proceedings (Springer, 2016), Vol. 527, p. 279.

15. F. Ferreira, D. Gratadour, A. Sevin, and N. Doucet, “COMPASS: an efficient GPU-based simulation software for adaptive optics systems,” in International Conference on High Performance Computing and Simulation (HPCS) (2018), pp. 180–187.

16. J. Beck and J. P. Bos, “Open source acceleration of wave optics simulations on energy efficient high-performance computing platforms,” Proc. SPIE 10204, 102040F (2017). [CrossRef]

17. F. L. Rosa, J. G. Marichal-Hernandez, and J. M. Rodriguez-Ramos, “Wavefront phase recovery using graphic processing units (GPUs),” Proc. SPIE 5572, 262–272 (2004). [CrossRef]

18. V. Venugopalan, “Evaluating latency and throughput bound acceleration of FPGAs and GPUs for adaptive optics algorithms,” in IEEE High Performance Extreme Computing Conference (HPEC) (2014), pp. 1–6.

19. W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: vision and challenges,” IEEE Internet Things J. 3, 637–646 (2016). [CrossRef]

20. F. Dias, F. Berry, J. Sérot, and F. Marmoiton, “Hardware, design and implementation issues on a FPGA-based smart camera,” in First ACM/IEEE International Conference on Distributed Smart Cameras (IEEE, 2007), pp. 20–26.

21. R. Ragazzoni, C. Arcidiacono, E. Diolaiti, J. Farinato, A. M. Moore, and R. Soci, “A smart fast camera,” Proc. SPIE 5492, 121–127 (2004). [CrossRef]

22. M. Thier, R. Paris, T. Thurner, and G. Schitter, “Low-latency Shack-Hartmann wavefront sensor based on an industrial smart camera,” IEEE Trans. Instrum. Meas. 62, 1241–1249 (2012). [CrossRef]

23. S.-H. Lee and C.-S. Yang, “A real time object recognition and counting system for smart industrial camera sensor,” IEEE Sens. J. 17, 2516–2523 (2017). [CrossRef]

24. M. Carraro, M. Munaro, and E. Menegatti, “Cost-efficient RGB-D smart camera for people detection and tracking,” J. Electron. Imaging 25, 041007 (2016). [CrossRef]

25. Z. Zhao, Z. Jiang, N. Ling, X. Shuai, and G. Xing, “ECRT: an edge computing system for real-time image-based object tracking,” in 16th ACM Conference on Embedded Networked Sensor Systems (ACM, 2018), pp. 394–395.

26. Y. Ukidave, D. Kaeli, U. Gupta, and K. Keville, “Performance of the NVIDIA Jetson TK1 in HPC,” in IEEE International Conference on Cluster Computing (IEEE, 2015), pp. 533–534.

27. J. Mocci, M. Quintavalla, C. Trestino, S. Bonora, and R. Muradore, “A multiplatform CPU-based architecture for cost-effective adaptive optics systems,” IEEE Trans. Ind. Inf. 14, 4431–4439 (2018). [CrossRef]

28. A. M. Nightingale and S. V. Gordeyev, “Shack-Hartmann wavefront sensor image analysis: a comparison of centroiding methods and image-processing techniques,” Opt. Eng. 52, 071413 (2013). [CrossRef]

29. J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” Queue 6, 40–53 (2008). [CrossRef]

30. NVIDIA Corporation, NVIDIA CUDA C Programming Guide, Version 11.0 (2020).

Previous Article Next Article

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.

Fig. 1. Operation principle of the Shack–Hartmann WFS.

View in Article | Download Full Size | PDF

Fig. 2. SH-WFS spot image. The pixel regions

${P_L}$

are adjacent to each other, and together form the pixel RoI

$P$

View in Article | Download Full Size | PDF

View in Article | Download Full Size | PDF

Fig. 4. Diagram of the data reduction. In this example, since the pixel region width (i.e., lenslet pitch) is

$d = 10$

, it is required to read six pixel groups from global memory (

${\bar p_0}$

${\bar p_5}$

) in order to cover the two lenslet rows. The rows are partitioned into two row groups of five rows each, which are processed independently. Each thread reduces the pixel group into a partial moment. Assuming eight threads per warp, the pixel groups partials of two lenslet rows are reduced at once using the shuffle reduction. The resulting row partial moments are accumulated into the row group partial moments. When all row groups are processed, their results are reduced in shared memory yielding the two lenslet moments and then copied into global memory.

View in Article | Download Full Size | PDF

Fig. 5. Execution times of the centroid extraction for a

$100 \times 100$

RoI of the spot image.

View in Article | Download Full Size | PDF

Fig. 6. Execution times of the centroid extraction for a

$200 \times 200$

RoI of the spot image.

View in Article | Download Full Size | PDF

Fig. 7. Execution times of the centroid extraction for a

$500 \times 500$

RoI of the spot image.

View in Article | Download Full Size | PDF

Fig. 8. Execution times of the centroid extraction for a

$700 \times 700$

RoI of the spot image.

View in Article | Download Full Size | PDF

Fig. 9. Execution times of the centroid extraction for a

$1000 \times 1000$

RoI of the spot image.

View in Article | Download Full Size | PDF

Tables (3)

Algorithm 1. Lenslet-Wise Centroid Extraction

View Table | View all tables in this article

Algorithm 2. CUDA Centroid Extraction

View Table | View all tables in this article

Table 1. Execution Times of CPU and GPU ( $t_{C P U}, t_{G P U}$ ) and Relative Speed-Up for the Configurations of RoI Width $w_{P}$ and Pixel Region Width $d$ , Along with the Number of Extracted Centroids $w_{L}^{2}$ .

View Table | View all tables in this article

Equations (13)

Equations on this page are rendered with MathJax. Learn more.

w_{L} = ⌊ \frac{ρ}{d_{L}} ⌋,

d = \frac{d_{L}}{d_{P}},

P_{l} = {p | ⌊ x_{l} ⌋ \leq x_{p} < ⌊ x_{l} + d ⌋, ⌊ y_{l} ⌋ \leq y_{p} < ⌊ y_{l} + d ⌋},

(x_{l}, y_{l}) = (⌊ x_{0} + d ⌊ \frac{l}{w_{L}} ⌋ ⌋, ⌊ y_{0} + d (l m o d w_{L}) ⌋) .

w_{P} = d w_{L},

{_{l} m}_{l}^{ij} = \sum_{p \in P_{l}} x_{p}^{i} y_{p}^{j} I (p),

c_{l} = (\frac{m_{l}^{10}}{m_{l}^{00}}, \frac{m_{l}^{01}}{m_{l}^{00}}) .

Ω (w_{L}, d) = 2 w_{L}^{2} + 5 ⌊ d w_{L} ⌋^{2} .

ϕ = x_{p} + w_{P} y_{p} .

μ_{p}^{l, \bar{r}, r, \bar{p}} = {m^{00} (ϕ_{p}^{l, \bar{r}, r, \bar{p}}); m^{10} (ϕ_{p}^{l, \bar{r}, r, \bar{p}}); m^{01} (ϕ_{p}^{l, \bar{r}, r, \bar{p}})} .

n_{p a c k e t s} = ⌊ \frac{d + 3 + 3}{4} ⌋ .

d_{\bar{L}} = \exp_{2} ⌊ \log_{2} ⌊ \frac{32}{n_{p a c k e t s}} ⌋ ⌋ .

S p e e d - u p = \frac{t_{C P U}}{t_{G P U}} .

$w_{P} [p x]$	$d [p x]$	$w_{L}^{2}$	$t_{C P U}$ [µs]	$t_{G P U}$ [µs]	Speed-Up
100	3.8	676	54	104	0.5192
100	11	81	44	59	0.7458
100	20	25	10	57	0.1754
100	29	9	1	57	0.0175
200	3.8	2704	308	151	2.0397
200	11	324	186	57	3.2632
200	20	100	158	57	2.7719
200	29	36	102	57	1.7895
500	3.8	17161	2095	615	3.4065
500	11	2025	1268	208	6.0962
500	20	625	1220	165	7.3939
500	29	289	1106	107	10.3364
700	3.8	33856	4194	1090	3.8477
700	11	3969	2475	331	7.4773
700	20	1225	2405	265	9.0755
700	29	576	2228	162	13.7531
1000	3.8	69169	8922	2307	3.8674
1000	11	8100	5158	631	8.1743
1000	20	2500	4266	481	8.8690
1000	29	1156	3210	317	10.1262

$w_{P} [p x]$	$d [p x]$	$w_{L}^{2}$	$t_{C P U}$ [µs]	$t_{G P U}$ [µs]	Speed-Up
100	3.8	676	54	104	0.5192
100	11	81	44	59	0.7458
100	20	25	10	57	0.1754
100	29	9	1	57	0.0175
200	3.8	2704	308	151	2.0397
200	11	324	186	57	3.2632
200	20	100	158	57	2.7719
200	29	36	102	57	1.7895
500	3.8	17161	2095	615	3.4065
500	11	2025	1268	208	6.0962
500	20	625	1220	165	7.3939
500	29	289	1106	107	10.3364
700	3.8	33856	4194	1090	3.8477
700	11	3969	2475	331	7.4773
700	20	1225	2405	265	9.0755
700	29	576	2228	162	13.7531
1000	3.8	69169	8922	2307	3.8674
1000	11	8100	5158	631	8.1743
1000	20	2500	4266	481	8.8690
1000	29	1156	3210	317	10.1262

Abstract

1. INTRODUCTION

2. SH-WFS OPERATION PRINCIPLE

3. MOMENT-BASED CENTROID EXTRACTION

4. CUDA IMPLEMENTATION

A. Data Levels

B. Optimized GPU Data Transfer through Coalesced Memory Accesses

C. Data Reduction

D. Kernel Algorithm

5. EXPERIMENTS

6. CONCLUSION

Disclosures

REFERENCES

Cited By

Figures (9)

Tables (3)

Equations (13)

Journal of the Optical Society of America A