FASFLNet: feature adaptive selection and fusion lightweight network for RGB-D indoor scene parsing

Xiaohong Qian; Xingyang Lin; Lu Yu; Wujie Zhou; Wujie Zhou

doi:10.1364/OE.480252

1. Introduction

Scene parsing aims to identify different objects in an image at the pixel level using annotations. It has been widely used in many applications, such as virtual reality, medical outcome analysis, and road scene analysis [1–3]. Service robots primarily work indoors; consequently, they must be able to understand indoor scenes efficiently and accurately. This can be achieved through scene parsing, which is a basic computer vision task for pixel-level classification of input images. However, indoor scenes differ from outdoor scenes because they are unordered and complex. Hence, conventional scene-parsing methods based on manual feature extraction cannot achieve satisfactory results for indoor scene parsing. Deep learning has resulted in significant breakthroughs in large-scale image classification, and further developments in scene parsing are underway. Researchers have improved the convolutional neural network (CNN) structure of the image recognition task to increase its suitability for segmentation tasks. In particular, the emergence of fully convolutional neural networks (FCNs) has substantially improved the accuracy of scene parsing task results [4].

Several excellent CNN architectures directly use RGB images as input and have achieved impressive results in open competitions [4,5]. Currently, RGB-D images can be easily obtained by sensors such as RealSense, Kinect, and Xition. Depth information can be essential for describing the layout of 3-D scenes, and properly captured depth feature maps can show the geometric structure of images [6–9]. Studies show that significant improvement in scene parsing can be obtained by fusing depth information with RGB images.

Early fusion models simply fed the concatenated RGB and depth channels into a CNN [10]. Such models may not fully explore the geometric structure information provided by the depth data. Subsequently, several proposed methods have employed a two-channel fusion network that processes the modalities using separate but identical encoders and subsequently fuses multimodal features in a single decoder [11–15]. Late fusion models [11], [12] combine the multimodal features at the end of the encoders using a combination operation such as element-wise concatenation or summation. Although late-fusion-based models tend to provide relatively good results, their overall efficiency remains low. The direct summation or concatenation leads to CNN features confusion, whereas convolution leads to inadequate fusion due to inadequate consideration of the macro-level or micro-level features of objects.

Instead of fusing in the early or later stage, hierarchical fusion models fuse the CNN features at multiple levels. These models typically fuse multi-level CNN features from one modality to another modality in a bottom-up manner [13] or fuse multi-level features in a top-down manner [14,15]. Although these models have achieved promising results, they do not fully exploit the interdependencies of the multimodal features. It is essential for the models to interact with and inform each other to reduce ambiguity in scene parsing. The construction of an effective selection and fusion mechanism for scene parsing remains an open problem.

In this study, we propose a feature adaptive selection and fusion network (FASFNet) for RGB-D indoor scene parsing. In our proposed FASFNet, MobileNetV2 [16] constitutes the backbone to achieve rapid inference speed and high feature extraction capability. To prevent severe image spatial resolution reduction from excessive down sampling, we referenced the operations in DeepLabV3 [17] and employed output_stride = 16. We only normalized the input RGB and depth images and performed no additional post-processing. A depth map was used to supplement the mainline through an adaptive special feature selection module (FSM), rather than using simple element-wise addition. Finally, we fused the extracted features from top-bottom layers through an adaptive feature fusion module (FFM) and subsequently integrated the features at different layers for pixel-level classification. The following are the main contributions of this study:

1) We used a lightweight backbone (MobileNetV2) as a feature extractor, which makes the proposed FASFLNet far more efficient than other methods. We also used an adaptive FSM for feature aggregation.
2) We fused extracted features from top-bottom layers through an adaptive fusion strategy (FFM) and subsequently integrated the features at different layers for final pixel-level classification, which had an effect similar to that of pyramid supervision.
3) The proposed FASFLNet for RGB-D indoor scene parsing outperforms existing state-of-the-art (SOTA) models and obtains efficient and accurate results on the NYU V2 and SUN datasets.

2. Related work

With the emergence of multi-modal data, several methods have adopted different modalities to obtain multiple cues for improving the results of scene parsing. For example, RGB-D and RGB-T scene parsing methods have been proposed.

3.1 RGB-D scene parsing

Indoor scene parsing has been studied for several years, and numerous models have been introduced. In the early years, RGB-D indoor scene-parsing models [10–12] based on early fusion and late fusion techniques made some progress. However, hierarchical fusion models have recently become the most popular approaches for RGB-D indoor scene parsing [13–43]. Hazirbas et al. [13] proposed an encoder-decoder structure named FuseNet, which is similar to SegNet [18], with the exception that its encoder adds depth information to the mainline via an element-wise summation strategy (This strategy is also used in Ref. [19]). Jiang et al. [20] proposed RedNet, which uses ResNet [21] as its backbone. It has extra skip connections and multi-scale outputs for pyramid supervision. Dai et al. [22] also employed ResNet for feature extraction in their ResFusion method. ResFusion uses a spatial pyramid pooling approach to fully exploit different sub-region features. It also adopts multiple features fusion models and various auxiliary loss streams. Wang et al. [23] employed the encoder–decoder structure for both the RGB and depth streams in their proposed method, with transformation applied at the end of the encoding process. Li et al. [24] proposed Long Short-Term Memorized Context Fusion, a novel method that extracts and fuses contextual features from two streams. Ma et al. [25] trained a deep network to predict semantic cues in a self-supervised manner. Li et al. [26] proposed a two-stream network that first learns deep features at various levels, and then learns to integrate features from low-level to high-level under the guidance of semantics. Liu et al. [27] adopted conditional random fields (CRF) as a post-processing strategy and proposed combining a pairwise potential approach with a normal kernel to excavate the geometric spatial structure. Hu et al. [34] presented an attention complementary module for RGB-D indoor scene parsing that selectively gathers feature maps from RGB and depth branches. Deng et al. [35] introduced a residual fusion module for RGB-D indoor scene parsing. Yuan et al. [36] proposed an indoor scene-parsing network that incorporates RGB and depth feature maps to build scene parsing. Xiong et al. [37] introduced a variational context-deformable block to learn receptive fields in a structured fashion. Chen et al. [38] introduced an efficient and unified cross-modality guided encoder for RGB-D indoor scene parsing. Lin et al. [39] introduced a switchable context block to facilitate RGB-D indoor scene parsing. Zhou et al. [40] presented a three-branch self-attention method for RGB-D indoor scene parsing including two asymmetric encoders and a cross-modal distillation stream. Wang et al. [41] proposed a component aware feature fusion module for RGB-D indoor scene parsing. Song et al. [42] proposed an alternative strategy for indoor scene parsing. Du et al. [43] proposed a translate-to-recognize model for indoor scene parsing. Zhou et al. [44] proposed a progressive guided fusion and depth enhancement network for RGB-D indoor scene parsing. Fang et al. [45] proposed depth removal distillation for RGB-D scene parsing. Zhou et al. [46] proposed co-attention network for RGB-D scene parsing.

3.2 RGB-T scene parsing

In recent years, with the popularity of thermal infrared sensors, the strategy of integrating RGB and thermal infrared data has been used for numerous computer vision tasks [47–54]. Unlike depth data, thermal infrared data is insensitive to illumination conditions. Thermal infrared data can also be used to complement RGB for scene parsing. Ha et al. [47] introduced MFNet for urban scene parsing using both RGB imaging and thermal infrared cameras. Sun et al. [48] proposed a scene parsing model that adopts two instances of ResNet [21] for feature extraction and developed a novel decoder network to restore the features resolution. Shivakumar et al. [49] introduced a dual-stream network that performed fast scene parsing by combining RGB information and thermal infrared data. Dutta et al. [50] proposed an efficient network that uses multispectral information for scene parsing. Lyu et al. [51] introduced a multi-modal fusion model for scene parsing of RGB-T images. Sun et al. [52] proposed the FuseSeg model, which employs DenseNet [55] as an encoding backbone for scene parsing.

The RGB-D and RGB-T scene parsing methods presented above use CNNs with a large number of parameters and long inference times as feature extractors; examples include VGGNet [56] and ResNet [21]. Therefore, they are not suitable for several battery powered applications that must process images at relatively high speeds [57–64].

3. Proposed FASFNet

The block diagram of our FASFLNet is shown in Fig. 1. MobileNetV2 constitutes the backbone for feature extraction. The features that originate from the depth image are considered as replenishment; a specific FSM is used to incorporate these features into the mainline. Features extracted from the mainline are refined and fused from top-bottom through an adaptive fusion strategy (FFM). Finally, pixel-level classification is performed by integrating the multi-level features.

Fig. 1. Proposed feature adaptive selection and fusion network (FASFNet).

Download Full Size | PDF

3.1 Backbone

MobileNet [65] uses depth-wise separable convolution as the basic block, which is a depth-wise convolution method followed by a point-wise convolution. Subsequently, with the introduction of assistance in depth-wise operation, the numbers of multi-adds and parameters are reduced significantly with only a minor reduction in accuracy. A second version of MobileNet uses a layer module referred to as the inverted residual with linear bottleneck and obtains improved results. It appears to be a form of decompression–filter–compression for information. Similar to DeepLabv3, we adopted atrous convolution, which is a powerful method to control the resolution of feature maps and employed output_stride = 16 for denser feature maps. The configuration of the backbone is summarized in Table 1.

Table 1. MobileNetV2-backbone: Each line describes a sequence of one model, repeated n times. c denotes the output channels of the layer. All spatial convolution operators use 3 × 3 kernels. The first layer has a stride s and all others utilize stride 1. The expansion factor t is used to indicate the input size

View Table | View all tables in this article

3.2 Feature selection module (FSM)

Depth images containing spatial information such as the shape and scale of objects can improve the efficiency of CNNs. Inspired by the structure of selective kernel (SK) convolution [66], we performed an automatic selection operation using the aforementioned FSM, which adaptively selects different spatial scales of depth information. Thus, we use a gate to control the outflow of depth cues to the next layer. Further, we utilized only two operations: Fuse and Select. As depicted in Fig. 2, in Fuse, we first aggregated the information of the two streams using element-wise summation. Then, a simple global average pooling operator F_gp was used to generate channel-wise statistics s. To enable adaptive selection, a more compact feature z was obtained by a sequence of fully connected layers, batch normalization, and a ReLU6 activation function, which is employed in MobileNetV2. We used a reduction ratio of r = 2 to reduce the dimension of z. Unlike SK convolution, we only used the gate to control the depth information in Select, because we utilize depth information as the replenishment. A sigmoid operator was used on the channel-wise digits d, which were generated by a soft attention vector. Subsequently, we merged the features from the depth branch and RGB branch through element-wise summation. The feature map fused by the FSM was used as the input for the next CNN layer and the FFM. Considering that MobileNetV2 uses ReLU6 as the activation function, we also used ReLU6 in other parts of the proposed FASFNet.

Fig. 2. FSM of FASFNet.

Download Full Size | PDF

3.3 Feature fusion module (FFM)

Because of the structure of MobileNetV2, features extracted from the mainline are compressed. To enhance the expression of low-level features, we fused features from top-bottom, which can be considered as refinement [16]. We integrated the features from two adjacent layers using the aforementioned adaptive fusion strategy (FFM). As shown in Fig. 3, the adaptive FFM takes the high-level and low-level features generated by the higher-level FFM as inputs. For improved fusion of different feature maps, we used atrous spatial pyramid pooling (ASPP), which can capture multi-scale information effectively, to further enhance the expression of low-level CNN features. We reduced the number of weights and the computational requirements of the ASPP by using one shortcut and three 3 × 3 depth-wise convolutions with rates = (1, 2, 4). We then concatenated these features followed by a point-wise 1 × 1 convolution and a dropout operation. Owing to the down sampling in the backbone, which reduces the sizes of feature maps, the high-level feature maps should be bilinearly up sampled to the size of low-level CNN features if necessary. Subsequently, the enhanced high-level and low-level CNN features are aggregated by concatenation and one 1 × 1 convolution, which is used for compression. Finally, we enhanced the expression of low-level feature maps by element-wise summation of the fused features.

Fig. 3. FFM of FASFNet.

Download Full Size | PDF

3.4 Concatenation and classifier

Through the visualization of the features in Fig. 1, it is observed that high-level CNN features contain textual semantic cues, which is important for the accurate classification of objects. On the other hand, the low-level CNN features contain spatial and edge cues, which is helpful for accurate segmentation of the boundaries between different objects. Because low-level and high-level CNN features are complementary for effective segmentation, we concatenate all of these multi-level CNN features for final pixel-level classification. The final classifier is composed of two convolutions. We used a point-wise convolution followed by batch normalization and a ReLU to perform a channel-wise shuffle. We then bilinearly up sampled the features by a factor of two. Subsequently, we performed pixel-level classification using a convolution operator with a kernel size of 3 × 3, and restored the output to the size of the raw image using bilinear up sampling.

4. Experimental results and analyses

4.1 Experimental setup

Datasets: To verify the efficiency of our FASFNet, we performed experiments using the NYU V2 dataset [67] and SUN RGB-D dataset [68]. The NYU comprises 1,449 pairs of RGB-D images, and all pixels are labeled according to 40 categories. We selected 795 images as the training set, and the remaining 654 images for evaluation. The SUN is a large-scale dataset with 10,335 RGB-D images. All of the RGB-D images are annotated into 37 categories. In line with previous research [30–34], we utilized 5,285 images for training, and used the remaining for evaluation.

Evaluation metrics: Following Refs. [69–73], we employed three widely used metrics, namely global accuracy (pixel Acc.), mean IoU (mIoU), and mean accuracy (mAcc), to evaluate the accuracy of the architecture. mAcc is the mean of the pixel accuracy values across all classes. Global accuracy denotes the ratio of the number of correctly classified pixels over the total number of pixels. IoU denotes the intersection of the prediction and annotation regions over their union, whereas the mIoU is the average IoU value among all classes.

Training and inference: During training, we applied synchronous transforms to the inputs, including the RGB image, depth image, and the corresponding annotations for data augmentation. The operations included random cropping, random scaling, flipping, random brightness adjustment, and contrast and saturation adjustments. Subsequently, inputs were decoded into 32-bit floating point raw pixel values in the range [0, 1]; we then normalized the inputs by subtracting the mean and dividing by the variance.

For all experiments, we used MobileNet [65] pretrained on the ImageNet dataset as the backbone network. As the depth map contains one channel, we averaged the three channels in the first layer of MobileNet for the one-channel depth branch. For an image X, let the corresponding ground truth be Y and the predicted map be Ŷ. For a pixel px, let us assume that the network predicts ŷ_px, and let the ground truth value be y_px. Therefore, the cross-entropy loss function for scene parsing can be applied as:

(1)$$CE({X,Y,\hat{Y}} )={-} \sum\limits_{px \in X} {{y_{px}}\log ({{{\hat{y}}_{px}}} )} . $$

Implementation details: We applied the PyTorch deep learning technology to train and test the FASFNet. For all the experiments, training considered a batch size of 2 and was performed on a single NVIDIA TITAN V graphics processor and 12 GB memory. The backbone was pretrained on the ImageNet database, and the other parameters were initialized as in [74]. We used the class weighting strategy described in [75] and the Adam optimization algorithm with (β₁, β₂) = (0.9, 0.999) to train the proposed FASFNet. The initial learning rate was set to 5e^-4 and decayed according to a “poly” learning rate policy, under a polynomial learning rate policy with the initial learning rate multiplied by ${\left( {1 - \frac{{\textrm{ep}}}{{\max \_ep}}} \right)^{0.9}}$ over 300 epochs for the NYUDv2 and 200 epochs for the SUN datasets. In the evaluation phase, we only resized inputs in fixed sizes of 480 × 640 and applied normalization on the inputs without extra data augmentation.

4.2 Results on NYU dataset

We compared the results of different SOTA approaches; specifically, those proposed by: Long et al. [4], Ma et al. [25], Yuan et al. [19], Liu et al. [27], Fayyaz et al. [30], Liu et al. [31], He et al. [29], Lin et al. [32], Hu et al. [34], Wang et al. [33], Yuan et al. [36], Chen et al. [38], Xiong et al. [37], Lin et al. [39], Zhou et al. [40], Fang et al. [45] and Zhou et al. [46]. The experimental results listed in Table 2 indicate that our FASFLNet achieved the best results on the NYU V2 dataset. The results verify the validity and superiority of the FASFNet.

Table 2. Comparison of the NYU V2 testing results

View Table | View all tables in this article

We used the strategies described above to train the proposed FASFLNet in Fig. 1. Then, we further verified the effectiveness of the important modules through ablation experiments.

4.3 Results on SUN dataset

We performed further experiments on the SUN dataset to verify the effectiveness of our FASFNet. The SUN dataset contains RGB-D images from several different datasets. Compared to NYU dataset, the SUN dataset has more complex scenes and conditions, which we deemed more suitable to evaluate the generality of the FASFNet. We kept all hyper-weights the same as those for the NYU dataset, except for the number of epochs. The experimental results are listed in Table 3. Our proposed FASFLNet outperformed most of the SOTA models.

Table 3. Comparison of the SUN RGB-D testing results

View Table | View all tables in this article

4.4 Ablation study

To evaluate the effectiveness of the important modules in FASFNet, we performed ablation experiments on the NYU dataset. For this analysis, three variations were devised. We designed three performance comparison approaches, namely, A, B, and C.

As shown in Fig. 4 (a), we built Model A, which replaces the FSM with simple element-wise summation and directly concatenates the features of different layers without feature fusion (without FFM).

Fig. 4. Ablation study results.

Download Full Size | PDF

To verify the effectiveness of the FSM in FASFNet, we replaced five simple element-wise summation operations with five FSMs in Model B (as shown in Fig. 4 (b)), based on Model A.

To demonstrate the effectiveness of top-down fusion using the FFM, as shown in Fig. 4 (c), we designed Model C, which concatenates the features of different layers with FFM.

Table 4 lists the performance comparison of models A, B, C, and FASFNet. The results of Model A and Model B demonstrate that selecting depth information with FSM can improve accuracy. The results also show that using depth information as a supplement and extracting features from mixed information can boost performance. From the comparative results of Model A and Model C, this comparison well demonstrates the successful application of FFM. The experimental results show that the enhanced low-level features from top-bottom can improve the accuracy of the model. A comparison of results of model A and FASFLNet shows that both the FSM and FFM effectively improves the FASFLNet performance. It is also shown that the decision whether to use or not use pretrained backbones or not, significantly influences the accuracy of the model. From another perspective, it is clear that the feature extraction capability of backbones significantly influences the results of the method. The predictions for various test images are shown in Fig. 5.

Fig. 5. Experimental results of different models for scene segmentation.

Download Full Size | PDF

Table 4. Effect of important modules in the proposed FASFNet

View Table | View all tables in this article

To verify the influence of the backbone on FASFLNet, we compared the prediction performances of other popular backbones (e.g., VGGNet, and ResNet). Table 5 shows the experimental results. To achieve a trade-off between parameter and accuracy, the proposed model seems to be a good choice.

Table 5. Experimental results of FASFLNet based on different backbones

View Table | View all tables in this article

4.5 Time-complexity analysis

We also analyzed the time complexity stemming from real-time requirements in actual situations on the NYU V2 dataset. We concluded that less inference time results in better performance. We calculated computational complexity (i.e., the average runtime for a sample) in the testing stage. The test platform was equipped with a Hexa-core Intel Core i5-8500 CPU @ 3.00 GHz and an NVIDIA Titan V GPU. The inference costs were 0.021 s per image, demonstrating the time-saving capability of the proposed FASFLNet. Additionally, the size of the parameters for the entire network was relatively small (only 12.03 MB in fp32), which provides greater efficiency than other methods using backbones with large numbers of parameters (e.g., VGG and ResNet) (shown in Table 6). Overall, the proposed FASFLNet provides a low-complexity solution for high-performance indoor scene parsing.

Table 6. Results from time complexity analysis

View Table | View all tables in this article

5. Conclusion

We proposed an architecture using lightweight backbones for RGB-D indoor scene parsing. Because depth images provide spatial information such as the shape and scale of objects, we used depth information as supplemental information and performed feature-level adaptive fusion between depth and RGB branches using an FSM. Further, we fused the feature maps of different layers from top to bottom using an FFM and concatenated multi-level features for pixel-level classification.

In future, we intend to use more efficient modules for feature fusion and special backbones for RGB and depth images.

Funding

National Natural Science Foundation of China (61502429).

Disclosures

The authors declared that they have no conflicts of interest to this work.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. K. Xiang, K. Yang, and K. Wang, “Polarization-driven semantic segmentation via efficient attention-bridged fusion,” Opt. Express 29(4), 4802–4820 (2021). [CrossRef]

2. W. Zhou, Y. Zhu, J. Lei, J. Wan, and L. Yu, “CCAFNet: Crossflow and cross-scale adaptive fusion network for detecting salient objects in RGB-D images,” IEEE Trans. Multimedia 24(20), 2192–2204 (2022). [CrossRef]

3. S. Minaee, X. Liang, and S. Yan, “Modern Augmented Reality: Applications, Trends, and Future Directions,” arXiv, arXiv preprint arXiv:2202.09450, (2022).

4. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.

5. W. Zhou, X. Fan, L. Yu, and J. Lei, “MISNet: Multiscale cross-layer interactive and similarity refinement network for scene parsing of aerial images,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing (2023).

6. J. Wu, W. Zhou, X. Qian, J. Lei, L. Yu, and T. Luo, “MENet: Lightweight multimodality enhancement network for detecting salient objects in RGB-Thermal images,” Neurocomputing 527, 119–129 (2023). [CrossRef]

7. W. Zhou, Y. Lv, J. Lei, and L. Yu, “Global and local-contrast guides content-aware fusion for RGB-D saliency prediction,” IEEE Trans. Syst. Man Cybern, Syst. 51(6), 3641–3649 (2021). [CrossRef]

8. Y. Cai, W. Zhou, L. Zhang, L. Yu, and T. Luo, “DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation,” Vis. Comput.

9. W. Zhou, Q. Guo, J. Lei, L. Yu, and J.-N. Hwang, “IRFR-Net: Interactive recursive feature-reshaping network for detecting salient objects in RGB-D images,” IEEE Transactions on Neural Networks and Learning Systems, early access, August 20 (2021).

10. C. Couprie, C. Farabet, L. Najman, and Y. Lecun, “Indoor semantic segmentation using depth information,” in Int. Conf. Learn. Represent., (2013).

11. W. Zhou, Y. Yue, M. Fang, X. Qian, R. Yang, and L. Yu, “BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images,” Inf. Fusion 94, 32–42 (2023). [CrossRef]

12. Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3029–3037.

13. C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Proc. ACCV 213–228 (2016).

14. W. Zhou, E. Yang, J. Lei, and L. Yu, “FRNet: Feature Reconstruction Network for RGB-D Indoor Scene Parsing,” IEEE J. Sel. Top. Signal Process. 16(4), 677–687 (2022). [CrossRef]

15. S.-J. Park, K.-S. Hong, and S. Lee, “RDFNet: RRG-D multi-level residual feature fusion for indoor semantic segmentation,” in IEEE Int. Conf. Comput. VisC, 2017, pp. 4980–4989.

16. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.

17. L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv, arXiv:1706.05587, (2017).

18. V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). [CrossRef]

19. J. Yuan, K. Zhang, Y. Xia, et al., “A fusion network for semantic segmentation using RGB-D data,” in Proc. ICGIP10615–1061523 (2018).

20. J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “RedNet: Residual encoder-decoder network for indoor rgb-d semantic segmentation,” arXiv, arXiv:1806.01054, (2018).

21. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.770–778.

22. J. Dai and X. Tang, “ResFusion:deeply fused scene parsing network for RGB-D images,” IET Computer Vision. 12(8), 1171–1178 (2018). [CrossRef]

23. J. Wang, Z. Wang, D. Tao, S. See, and G. Wang, “Learning common and specifific features for rgb-d semantic segmentation with deconvolutional networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 664–679.

24. Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “LSTM-CF: Unifying context modeling and fusion with lstms for RGB-D scene labeling,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 541–557.

25. L. Ma, J. Stückler, C. Kerl, and D. Cremers, “Multi-view deep learning for consistent semantic mapping with rgb-d cameras,” in Proc. IROS598–605 (2017).

26. Y. Li, J. Zhang, Y. Cheng, K. Huang, and T. Tan, “Semantics-guided multi-level RGB-D feature fusion for indoor semantic segmentation,” in Proc. ICIP1262–1266 (2017).

27. H. Liu, W. Wu, X. Wang, and Y. Qian, “RGB-D joint modelling with scene geometric information for indoor semantic segmentation,” Multimedia Tools Appl. 77(17), 22475–22488 (2018). [CrossRef]

28. X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks for rgbd semantic segmentation,” In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5199–5208.

29. Y. He, W. C. Chiu, M. Keuper, and M. Fritz, “STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4837–4846.

30. M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, F. Huang, and R. Klette, “STFCN:spatio-temporal fully convolutional neural network for semantic segmentation of street scenes,” in Proc. ACCV493–509 (2016).

31. F. Liu, G. Lin, and C. Shen, “Discriminative training of deep fully connected continuous CRFs with task-specific loss,” IEEE Trans. on Image Process. 26(5), 2127–2136 (2017). [CrossRef]

32. D. Lin, G. Chen, D. Cohenor, et al., “Cascaded feature network for semantic segmentation of RGB-D images,” in Proc. ICCV1320–1328 (2017).

33. W. Wang and U. Neumann, “Depth-aware CNN for rgb-d segmentation,” In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 135–150.

34. X. Hu, K. Yang, L. Fei, and K. Wang, “ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation,” 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 2019, pp. 1440–1444.

35. L. Deng, M. Yang, T. Li, Y. He, and C. Wang, “RFBNet: deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation,” arXiv, arXiv:1907.00135, (2019).

36. J. Yuan, W. Zhou, and T. Luo, “DMFNet: Deep Multi-Modal Fusion Network for RGB-D Indoor Scene Segmentation,” IEEE Access 7, 169350–169358 (2019). [CrossRef]

37. Z. Xiong, Y. Yuan, N. Guo, and Q. Wang, “Variational Context-Deformable ConvNets for Indoor Scene Parsing,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 3991–4001.

38. X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), (2020).

39. D. Lin, R. Zhang, Y. Ji, P. Li, and H. Huang, “SCN: Switchable Context Network for Semantic Segmentation of RGB-D Images,” IEEE Trans. Cybern. 50(3), 1120–1131 (2020). [CrossRef]

40. W. Zhou, J. Yuan, J. Lei, and T. Luo, “TSNet: Three-stream Self-attention Network for RGB-D Indoor Semantic Segmentation,” IEEE Intell. Syst. 36(4), 73–78 (2021). [CrossRef]

41. A. Wang, J. Cai, J. Lu, and T. J. Cham, “Modality and component aware feature fusion for rgb-d scene classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 5995–6004.

42. X. Song, L. Herranz, and S. Jiang, “Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs,” in Thirty-first AAAI Conference on Artificial Intelligence, 2017, pp. 4271–4277.

43. D. Du, L. Wang, H. Wang, K. Zhao, and G. Wu, “Translate-to-recognize networks for rgb-d scene recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11836–11845.

44. W. Zhou, E. Yang, J. Lei, J. Wan, and L. Yu, “PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing,” IEEE Transactions on Multimedia.

45. T. Fang, Z. Liang, X. Shao, Z. Dong, and J. Li, “Depth Removal Distillation for RGB-D Semantic Segmentation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2405–2409.

46. H. Zhou, L. Qi, H. Huang, X. Yang, Z. Wan, and X. Wen, “CANet: Co-attention network for RGB-D semantic segmentation,” Pattern Recognition 124, 108468 (2022). [CrossRef]

47. Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 5108–5115, Sep. 2017.

48. Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett. 4(3), 2576–2583 (2019). [CrossRef]

49. S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: Rgb-thermal calibration, dataset and segmentation network,” arXiv, arXiv:1909.10980, (2019).

50. A. Dutta, B. Mandal, S. Ghosh, and N. Das, “Using thermal intensities to build conditional random fields for object segmentation at night,” In 2020 4th International Conference on Computational Intelligence and Networks (CINE), 2020, pp. 1–6.

51. Y. Lyu, I. Schiopu, and A. Munteanu, “Multi-modal neural networks with multi-scale RGB-T fusion for semantic segmentation,” Electron. Lett. 56(18), 920–923 (2020). [CrossRef]

52. Y. Sun, W. Zuo, P. Yun, H. Wang, and M. Liu, “FuseSeg: Semantic Segmentation of Urban Scenes Based on RGB and Thermal Data Fusion,” IEEE Transactions on Automation Science and Engineering, (2020).

53. W. Zhou, Y. Lv, J. Lei, and L. Yu, “Embedded Control Gate Fusion and Attention Residual Learning for RGB–Thermal Urban Scene Parsing,” IEEE Trans. Intell. Transport. Syst. (2023).

54. W. Zhou, J. Liu, J. Lei, J.-N. Hwang, and L. Yu, “GMNet: Graded-feature multilabel-Learning network for RGB-Thermal urban scene semantic segmentation,” IEEE Trans. on Image Process. 30, 7790–7802 (2021). [CrossRef]

55. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4700–4708, Jul. 2017.

56. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556, (2014).

57. W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding,” IEEE Trans. Intell. Veh., early access, April 5 (2022).

58. T. Gong, W. Zhou, X. Qian, J. Lei, and L. Yu, “Global contextually guided lightweight network for RGB-thermal urban scene understanding,” Eng. Appl. Artif. Intell. 117, 105510 (2023). [CrossRef]

59. W. Zhou, C. Liu, J. Lei, L. Yu, and T. Luo, “HFNet: Hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection,” Neurocomputing 490, 347–357 (2022). [CrossRef]

60. G. Xu, W. Zhou, X. Qian, L. Ye, J. Lei, and L. Yu, “CCFNet: Cross-Complementary Fusion Network for RGB-D Scene Parsing of Clothing Images,” J. Vis. Commun. Image Represent. 90, 103727 (2023). [CrossRef]

61. W. Zhou, C. Liu, J. Lei, and L. Yu, “RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images,” Sci. China Inf. Sci. 65(6), 160107 (2022). [CrossRef]

62. J. Jin, W. Zhou, R. Yang, L. Ye, and L. Yu, “Edge Detection Guide Network for Semantic Segmentation of Remote-sensing Images,” IEEE Geosci. Remote Sens. Lett. 205000505 (2023). [CrossRef]

63. J. Wu, W. Zhou, X. Qian, J. Lei, L. Yu, and T. Luo, “MFENet: Multitype fusion and enhancement network for detecting salient objects in RGB-T images,” Digital Signal Process. 133, 103827 (2023). [CrossRef]

64. W. Zhou, Y. Zhu, J. Lei, R. Yang, and L. Yu, “LSNet: Lightweight Spatial Boosting Network for Detecting Salient Objects in RGB-Thermal Images,” IEEE Trans. Image Process., (2023).

65. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv, arXiv:1704.04861, (2017).

66. X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” In Proc. CVPR510–519 (2019).

67. G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1925–1934.

68. Shuran Song, Samuel P Lichtenberg, and Jianxiong. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 567–576.

69. W. Zhou, J. Jin, J. Lei, and L. Yu, “CIMFNet: Cross-layer interaction and multiscale fusion network for semantic segmentation of high-resolution remote sensing images,” IEEE J. Sel. Top. Signal Process. 16(4), 666–676 (2022). [CrossRef]

70. W. Zhou and J. Hong, “FHENet: Lightweight Feature Hierarchical Exploration Network for Real-Time Rail Surface Defect Inspection in RGB-D Images,” IEEE Trans. Instrum. Meas. 72, 1–8 (2023). [CrossRef]

71. Z. Qiu, Y. Zhuang, H. Hu, and W. Wang, “Using Stacked Sparse Auto-Encoder and Superpixel CRF for Long-Term Visual Scene Understanding of UGVs,” IEEE Trans. Syst. Man Cybern, Syst. 50(4), 1331–1342 (2020). [CrossRef]

72. W. Zhou, L. Yu, Y. Zhou, W. Qiu, M. Wu, and T. Luo, “Local and global feature learning for blind quality evaluation of screen content and natural scene images,” IEEE Trans. on Image Process. 27(5), 2086–2095 (2018). [CrossRef]

73. W. Zhou, Q. Guo, J. Lei, L. Yu, and J.-N. Hwang, “ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol. 32(3), 1224–1235 (2022). [CrossRef]

74. J. Ma, W. Zhou, J. Lei, and L. Yu, “Adjacent bi-hierarchical network for scene parsing of remote sensing images,” IEEE Geosci. Remote Sens. Lett.

75. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. ICCV1026–1034 (2015).

Input	Operator	t	c	n	s	Output
480 × 640 × 3	conv2d	-	32	1	2	-
240 × 320 × 32	bottleneck	1	16	1	1	-
240 × 320 × 16	bottleneck	6	24	2	2	120 × 160 × 24
120 × 160 × 24	bottleneck	6	32	3	2	60 × 80 × 32
60 × 80 × 32	bottleneck	6	64	4	2	30 × 40 × 64
30 × 40 × 64	bottleneck, rate = 2	6	96	3	1	30 × 40 × 96
30 × 40 × 96	bottleneck, rate = 4	6	160	3	1	30 × 40 × 160

Models	mAcc	Pixel Acc.	mIoU
Long et al. [4]	46.1	65.4	34.0
Liu et al. [27]	51.7	70.3	41.2
Ma et al. [25]	51.78	70.66	40.07
Yuan et al. [19]	49.9	68.5	37.4
He et al. [29]	53.8	70.1	40.1
Lin et al. [32]	-	-	47.7
Liu et al. [31]	39.0	63.1	29.5
Fayyaz et al. [30]	42.6	62.1	30.9
Yuan et al. [36]	59.27	74.42	46.79
Hu et al. [34]	-	-	48.3
Wang et al. [33]	56.3	-	43.9
Xiong et al. [37]	63.5	-	50.7
Lin et al. [39]	-	-	50.7
Chen et al. [38]	-	77.9	52.4
Zhou et al. [40]	59.6	73.5	46.1
Fang et al. [45]	-	51.0	38.2
Zhou et al. [46]	64.6	77.1	51.5
Ours	64.9	78.6	52.6

Models	mAcc	Pixel Acc.	mIoU
Hazirbas et al. [13]	48.3	76.27	37.29
Jiang et al. [20]	60.3	81.3	47.8
Park et al. [15]	60.1	81.5	47.7
Li et al. [24]	48.1	-	-
Qi et al. [28]	57	-	45.9
He et al. [29]	41.2	65.5	32.9
Hu et al. [34]	-	-	48.1
Wang et al. [33]	53.5	-	42.0
Lin et al. [32]	-	-	48.1
Lin et al. [39]	-	-	49.5
Chen et al. [38]	-	82.5	49.4
Fang et al. [45]	-	48.9	39.5
Zhou et al. [46]	-	85.2	50.6
Ours	59.7	85.5	48.6

Model	FSM	FFM	Pixel acc.	Mean acc.	mIoU
Model A			74.9	62.5	48.8
Model B	√		77.8	64.6	51.1
Model C		√	78.0	64.4	51.4
W/o pretrain	√	√	75.4	60.5	47.5
Ours	√	√	78.6	64.9	52.6

Variants	mAcc	PixAcc	mIoU	Parameter(M)↓
Ours (VGGNet)	78.8	65.2	52.6	38.1
Ours (ResNet)	78.9	66.3	53.1	98.3
Ours (MobileNetV2)	78.6	64.9	52.6	12.03

FASFLNet: feature adaptive selection and fusion lightweight network for RGB-D indoor scene parsing

Abstract

1. Introduction

2. Related work

3.1 RGB-D scene parsing

3.2 RGB-T scene parsing

3. Proposed FASFNet

3.1 Backbone

3.2 Feature selection module (FSM)

3.3 Feature fusion module (FFM)

3.4 Concatenation and classifier

4. Experimental results and analyses

4.1 Experimental setup

4.2 Results on NYU dataset

4.3 Results on SUN dataset

4.4 Ablation study

4.5 Time-complexity analysis

5. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (5)

Tables (6)

Equations (1)

Optics Express

Models	Parameter(M)↓	FPS↑
Jiang et al. (ResNet50) [20]	82.0	26.5
Hu et al. (ResNet50) [34]	116.6	21.2
Park et al. (ResNet101) [15]	443.8	11.4
Chen et al. (ResNet101) [38]	110.6	16.6
Zhou et al. (VGGNet) [6]	41.9	19.1
Ours (MobileNetV2)	12.03	32