Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Object-based color constancy in a deep neural network

Open Access Open Access

Abstract

Color constancy refers to our capacity to see consistent colors under different illuminations. In computer vision and image processing, color constancy is often approached by explicit estimation of the scene’s illumination, followed by an image correction. In contrast, color constancy in human vision is typically measured as the capacity to extract color information about objects and materials in a scene consistently throughout various illuminations, which goes beyond illumination estimation and might require some degree of scene and color understanding. Here, we pursue an approach with deep neural networks that tries to assign reflectances to individual objects in the scene. To circumvent the lack of massive ground truth datasets labeled with reflectances, we used computer graphics to render images. This study presents a model that recognizes colors in an image pixel by pixel under different illumination conditions.

© 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. INTRODUCTION

While color perception is deeply grounded in trichromacy, it is not simply the excitation of the three retinal cone types that determines color appearance. Rather, the visual system does take the distribution of colors in the whole visual field into account to estimate the color of objects in a way that is mostly independent of the particular illumination. This is called color constancy, and it allows us to treat objects as if their color was a stable object property, similar to size [13]. In psychophysical experiments, it was shown that several different cues can contribute to achieve color constancy (for reviews, see [2,4,5]). Among others, the local contrast across edges is fairly constant across different illuminations, the global average color can be used as a normalization factor to cancel illumination, and the color of the brightest patch in a scene can provide information about the predominant illumination. These cues have also been used computationally to solve color constancy, such as in the gray-world [6] and white-patch algorithms [79].

While these algorithms are relatively simple and straightforward to implement, there are also more sophisticated techniques based on learning, such as the gamut mapping algorithm [1013], Support Vector Regression (SVR)-based algorithm [14], Bayesian framework [15,16], multi-layer perceptron neural networks [17], and convolutional neural networks (CNNs). In most learning methods, colors are estimated based on the spatial distribution of the image.

Recently, deep learning architectures using CNNs have been used to solve many hitherto intractable problems, such as object recognition [18,19]. In the realm of color vision, this has been applied to compare artificial neural architectures to physiological and psychophysical data [2022], to add missing color to grayscale images [23], or in approaching color constancy [24]. Typically, this is done by estimating the illuminant(s) in a scene, and then correcting the image to achieve a proper white balance [2530]. In our own previous work [24], we estimated the reflectance of one object in the scene that could be rendered using one of 1600 Munsell reflectances and unknown illumination.

Here, we extend this approach to estimate the reflectance of every pixel in an image. We developed a deep neural network (DNN) that can separately recognize the color in each small part of the image under different illumination conditions. The network tries to simultaneously solve two tasks. First, it needs to segment the scene into different objects, and second, it needs to assign a constant reflectance to all pixels belonging to the same object. For this purpose, we used a U-Net model [31] with ResNet [32] architecture, similar to what has been used before on segmentation alone.

One critical precondition for exploring color constancy in DNNs is the availability of huge datasets labeled with ground truth colors. While there are some datasets of hyperspectral images and controlled laboratory images available, their number is not large enough for training deep networks. Therefore, we used computer graphics to render a huge dataset of images instead.

2. MATERIALS AND METHODS

The main goal of this research is to develop a model for recognizing the color of every point of an input image in a way that is independent of lighting conditions. Input images are in different lighting conditions and contain objects of different materials.

A. Munsell and CIE-XYZ Coordinates

This research uses two-color coordinate systems. One is the Munsell system, named for the collection of chips put forward by Munsell [33,34] to evenly cover the whole color space. Hue, value, and chroma are the three coordinates used to index each Munsell chip. Value describes how light a Munsell chip appears and is a compressive function of chip luminance under CIE illuminant C [35]. In terms of surface reflectance, it corresponds roughly to the amount of light reflected by the Munsell chip. It ranges from one to nine, with one being the darkest and nine representing the lightest. The hue circle has 10 primary hues: red, yellow-red, yellow, green-yellow, green, blue-green, blue, purple-blue, purple, and red-purple. Each of these hues is divided into 10 intermediate hues, but for practical purposes, typically only four are used [36]. These are the 40 hues we use here as well. Chroma measures the distance to gray and represents the “purity” of the color. The chroma range is between zero and 16. As chroma increases (in steps of two), the surface reflectance spectrum becomes less flat, and the chip becomes more colorful. While the Munsell system naturally imposes a cylindrical coordinate system, the arrangement of the actual chips is not perfectly cylindrical, because of gamut limits and other physical constraints. High chromas cannot be achieved with certain hues and values. Therefore, the whole set of physical chips in the Munsell Book of Color Glossy Collection has just 1600 chips as opposed to ${9} \times {40} \times {9} = {3240}$ chips. When we use the term “color” in the following to describe our stimuli, we actually refer to the spectral reflectances of these 1600 Munsell chips.

While the Munsell chips at the output of our model are based on reflectances, the input to our model is based on the signals entering the eyes. For these input images, we used the CIE 1931 XYZ human color matching functions, so that we would be within a linear transformation of cone excitation values. Because of the nature of our networks, any linear transformation of the input signal will not affect our results.

B. Dataset

The physical-based spectral render engine called Mitsuba was used in this study to generate our dataset. Mitsuba is a physics-based simulation of light on surfaces, allowing for perceptually accurate renderings, developed for physics research. Mitsuba also can use physically measured spectral data such as spectra of color, light, and surfaces as input parameters. The multispectral characteristics of Mitsuba using the reflectance spectra of 1600 Munsell chips are used in our dataset. For illumination, we used the power spectra of 17 lights (Fig. 1). We used a neutral D65 illumination and 16 chromatic illuminations along four different directions in color space, two of them along the daylight locus and two of them in an orthogonal direction [37]. These were equidistant to the CIEL*a*b* gray point by five, 10, 15, and 20 $\Delta {\rm E}$ units in the chromaticity plane. For objects, a collection of mesh datasets put out by Evermotion [38] was used. There was a total of 2115 different meshes, which included both man-made and natural objects. For generating images, we used the Mitsuba spectral renderer with 20 wavelength channels and saved images in a three-dimensional floating point CIE-XYZ color space.

 figure: Fig. 1.

Fig. 1. Sample of a rendered image. Each scene layout has 17 illumination conditions, one D65 illumination, four reddish illuminations, four bluish illuminations, four greenish illuminations, and four yellowish illuminations, each in four different distances from D65.

Download Full Size | PDF

Our dataset contains three different scene layouts (Fig. 2), two of which were created in SketchUp and converted to Mitsuba. The first scene layout is very simple, with 11 cup-shaped objects of fixed colors added to the environment as color checkers. In each scene, the material of the cups was randomly changed, with one diffuse, plastic, metal, or glass material being used. The second scene layout has a background consisting of colorful cubes and spheres, whose colors change randomly in each scene. The third scene layout shows a kitchen, where the color of the walls and cupboards is randomly selected from desaturated colors whose chroma is less than four and changes in each scene.

 figure: Fig. 2.

Fig. 2. Some examples of rendered images containing similar objects in different scene layouts.

Download Full Size | PDF

Pixel-wise labels are generated by rendering each scene twice, once for the image and again for the label. In the rendering associated with making labels, instead of the color information of the pixels, each pixel represents the ID number of the object to which it relates. Through this output and the postprocessing that comes after it, image labels are created that say what color and material each pixel is.

The training dataset consists of 1900 objects randomly selected out of the 2115 objects, and the test database contains the remaining 215 objects. The reflectance spectra of 1600 glossy Munsell color chips were used for the objects; 330 of these colors are World Color Survey (WCS) colors, and we used these colors in the test dataset and the rest in the training database. There were 17 illuminations used in the test dataset, including the D65 illumination, and nine of these illuminations were used in the training dataset. Objects are made of diffuse, plastic, and metal materials. In one scene layout and specific illumination, every color was used once for each material. There is a fixed position for each object in each scene layout. The rotation of objects in one scene under different illuminations is the same. However, each object appeared more than once in the dataset, and each time the object appeared with a different rotation and color (Fig. 2).

In the training dataset, each scene layout includes 3810 images of various objects and colors, which have been rendered under nine illuminations. As a result, the training database has 34,290 images for each scene layout. There are three scene layouts in the dataset. So, in total, there are 102,870 images in the training dataset. The test dataset has 16,830 images under 17 different illuminations for each scene layout. Therefore, the test dataset contains 50,490 images. The objects and colors are the same for each specific scene in the different scene layouts, but the scale and location vary. Figure 2 shows two specific scenes in different scene layouts.

C. Model

A model capable of labeling every point in the input image is required for this project. This purpose can be achieved by using segmentation models. The U-Net models are widely used in semantic or instance segmentation. The input to this model is an image with one or multiple channels of color coordinates. The output shows which pixel or part of the input belongs to which object. Each object is represented by a layer in the output. Each pixel in the output corresponds to a portion or pixel of the input image. This model comprises an encoder and a decoder. Various DNN models can be used for these two parts. As an encoder, a Resnet with 50 layers was used in our model (Fig. 3).

The model aims to recognize which Munsell chip corresponds to the hue, value, and chroma of the input, independent of the illumination. The input of this model is an image in the CIE-XYZ color space. Therefore, the input image has three layers. The output image has 58 layers, consisting of three one-hot vectors. The length of the first vector is nine and represents the value. The second vector has 40 values and represents the hue. The size of the third one is nine and represents the chroma.

3. RESULTS

Unless otherwise specified, all results shown here are based upon three trained models, each trained with two scene layouts and tested with a different scene layout. Figure 4 illustrates some example inputs and outputs of the model during the test phase. The first and third columns are the input images to the model, the second column is the model’s output for the first column, and the fourth column is the model’s output for the third column. There is a difference in illumination between each row. There is a colored bar on the left side of each row corresponding to that illumination color. In the input images, the color code corresponding to the pixels above this object is entirely different from the other pixels of this object. However, the model is able to recognize all the pixels in this object as yellow uniformly. The color codes of the pixels related to the purple bag in these input images are also wholly different under different illuminations. However, the model is able to recognize the color of the pixels related to the bag relatively well despite a small error.

 figure: Fig. 3.

Fig. 3. Model architecture. U-Net model with Resnet 50 encoder and decoder.

Download Full Size | PDF

Our model has the ability to recognize color under illuminations that were never observed during training. During training, the model sees images under D65 lights as well as two different intensities of red, blue, green, and yellow. During testing, it also receives images under illuminations that have never been seen during training. The histogram error of the model is shown under three different illuminations in Fig. 5. Note that hue is circular between one and 100 and a circular error metric is used. For example, if the error is 2.5 for hue, it means that a neighboring Munsell chip has been detected for that color. The value (brightness) is between one and nine, and the chroma is between zero and 16. The error function for each object is defined in Eq. (1), where $p$ is predicted hue, value, or chroma for one pixel, and GT is ground truth color for one object. Hue total pixelwise error, value total pixelwise error, and chroma total pixelwise error are calculated separately using this formula:

$$\;{\rm Total\;pixelwise\;error}\; = \mathop \sum \limits_{i = 0}^n \left| {p{_{i}} - {\rm GT}} \right|.$$
 figure: Fig. 4.

Fig. 4. Some examples of rendered images containing similar objects in different scene layouts. The left image of each pair was rendered under colored illumination given by the colored bar at the left of each row. The right image of each pair shows the output of the model, which should be of a constant color for each object, and constant across all illuminations.

Download Full Size | PDF

 figure: Fig. 5.

Fig. 5. Histogram of error. To evaluate the model, (A) images with D56 illumination are used in the gray histogram, (B) images with light blue illumination are used in the pale blue histogram, and (C) images with blue illumination are used in the dark blue histogram. The illumination used in the bottom histogram was not in the illuminations of the training dataset. Note that the bin widths of the histograms correspond to the spacing of physical Munsell chips.

Download Full Size | PDF

The first row of Fig. 5 shows the error histogram of the model for two illuminations, both of which are part of both training and evaluation datasets. However, the second row shows the error histogram of the models for illumination, which does not exist in the training dataset. The results for this new illumination are as good as those for the existing illuminations in the training dataset. Thus, the model is capable of generalizing towards new illuminations.

To systematically test how well the model generalizes, we trained a version of the model with D65 illumination only (naive model) and compared it to models trained with nine different illuminations (color constancy model). When the model was trained only under D65 illumination and then tested under other illuminations, its performance deteriorated significantly. Based on this result, models trained under a single illumination situation will lack generalization ability. Figure 6 shows the results of the comparison between the naive model and color constancy. This figure shows the generalization ability of the models. According to this result, the model trained under just D65 illumination lacks generalization capability, and its error grows as the illumination color deviates from D65. We observed this not only for the model trained with D65 illumination. In general, when a model was trained with only one specific illumination, differences between test and training illuminations increase error.

 figure: Fig. 6.

Fig. 6. Mean of total pixelwise error for each group of illuminations. The purple lines represent the error of our model trained using nine different illuminations (color constancy model). The black line shows the error of the model trained only on D65 illumination (naive model).

Download Full Size | PDF

A total of 1600 colors were used in this study; 330 of these colors are from the WCS and are shown in Fig. 7(A). These 330 colors were used only in the test dataset. Figures 7(B)–7(D) show the average error, as specified by Eq. (1), for each of these colors. Figure 7(B) shows that the model has difficulty estimating the hue of very dark or very bright colors, shown in the top and bottom rows of Fig. 7(A). This matches observations on human observers, who also have difficulty distinguishing the hues of very dim or very bright colors. Errors for value and chroma are more evenly and unsystematically distributed across the WCS colors.

 figure: Fig. 7.

Fig. 7. Median of error for each WCS color. (A) Guide to show the color of each square. (B) Hue errors, (C) value errors, and (D) chroma errors. In (B)–(D), darker squares indicate more error, and lighter squares indicate less error.

Download Full Size | PDF

Since the object colors cover the whole Munsell space, we can investigate more systematically whether there is a systematic relation among hue, value, and chroma errors. In Fig. 8, the first row shows the relative mean square error for each hue, value, and chroma. The second row displays the relative value mean square error for each hue, value, and chroma. Also, the relative chroma mean square error for each hue, value, and chroma is shown in the third row. The percentage represents mean square errors relative to maximum mean square errors. The maximum error for hue is 50 since hue ranges from one to 100 and this value is circular. The maximum possible error for value is eight, and for chroma is 16. Two kinds of errors can be differentiated, because the model is supposed to assign a constant color across each object, and because it is supposed to assign a constant color across illuminations. We can decompose the total mean square error into these two components, which are shown in Fig. 8.

 figure: Fig. 8.

Fig. 8. Hue, value, and chroma relative mean square errors of color constancy model for each hue, value, and chroma. Percentages represent mean square errors relative to maximum mean square errors along each dimension. The gray portions of bars indicate within-objects MSE and remaining bars show object-wise MSE. The sum of these two parts is total pixelwise MSE. (A)–(C) Hue relative mean square error from different views: (A) percentage of hue MSE for each hue, (B) percentage of hue error for each value, and (C) percentage of hue error for each chroma. (D)–(F) Value relative mean square error from different views: (D) percentage of value error for each hue, (E) percentage of value error for each value, and (F) percentage of value error for each chroma. (G)–(I) Chroma relative mean square error from different views: (G) percentage of chroma error for each hue, (H) percentage of chroma error for each value, and (I) percentage of chroma error for each chroma. (J) Number of chips per chroma is not uniform; this histogram illustrates the number of Munsell chips for each chroma.

Download Full Size | PDF

In this figure, each bar has two parts: the lower part is non-gray, and the upper part is gray. The non-gray part of bars shows the relative object-wise mean square error [Eq. (2)], which is the error due to the mis-estimation of color. The gray part of bars shows the relative within-object mean square error [Eq. (3)], which is the error due to incorrect segmentation. The sum of these two parts is the relative pixel-wise mean square error [Eq. (4)]. In these formulas, $p$ is predicted hue, value, or chroma for one pixel, GT is hue, value, or chroma ground truth for one object, and $P$ is the mean of predicted hue, value, or chroma for one object:

$${\rm \;Objectwise\;MSE} = {({P - {\rm GT}} )^2},\;$$
$${\rm \;Within\;Object\;MSE} = \mathop \sum \limits_{i = 0}^n {({{p_{i}} - P} )^2},$$
$${\rm Total\;pixelwise\;MSE} = \mathop \sum \limits_{i = 0}^n {({{p_{i}} - {\rm GT}} )^2}.$$

In Fig. 8, each bar represents the relative mean square error of all objects in that group. For example, to show the mean square error of hue in chroma 12, the relative mean square error of hue is calculated on all the objects whose chroma is 12 and averaged over them. Not surprisingly, hue errors are largest for very dark or very light Munsell chips or for chips with low chroma, where they are also most difficult to identify in Fig. 7(A). Interestingly, the hue error does not seem to be constant across different hues. It is substantially larger for bluish hues in Fig. 8(A). In contrast, value and chroma errors are more evenly distributed across hues. With respect to segmentation, the absolute contribution of segmentation errors was quite stable across hue, value and chroma, with averages of 24% for value and 38% for chroma. It was higher for hue at 52%.

The test dataset contains 330 colors from the WCS that were absent from the training dataset. Twenty-five objects were randomly selected for each color, and their average colors were predicted using the model. In Fig. 9, each big square corresponds to one color from the WCS. There are 25 small squares in each large square, and each small square has a color predicted by the model for that object. It should be noted that this result was obtained on images consisting of objects, colors, and layouts that the model has never seen during training. This shows that the model has the potential to generalize well.

 figure: Fig. 9.

Fig. 9. Colors predicted by the model. The large squares represent WCS colors, and 25 objects of each color are randomly selected. The small squares represent the predicted color for those objects.

Download Full Size | PDF

4. DISCUSSION

The objective of this study was to develop a DNN model capable of assigning a constant color to each individual object in an input image, pixel by pixel and irrespective of illumination changes. We developed a model that at the same time segments the image into different objects and directly assigns a reflectance to each pixel. For this purpose, we extended a DNN model that is usually used for segmentation, to specify the hue, value, and chroma of each pixel as its output. For the types of rendered scene layouts we used, the model works fairly well—the average hue error is 3.24 Munsell steps out of 100, the average chroma error is 1.47 out of 16, and the average value error is 1.09 out of nine. These errors are at the level of neighboring Munsell chips in collections of physical samples, which have a spacing of 2.5 in hue, two in chroma, and one in value. The DNN error measures combine errors in segmentation, when different color coordinates are assigned to different pixels within an object, and errors in color correction, when the object is assigned a wrong color. The relative contribution of segmentation errors was 24% for value and 38% for chroma. It was higher for hue at 52%. We would have expected a better performance for hue, due to the relative stability of hue within each object, while chroma and luminance vary naturally due to shading and material properties.

DNN models are used widely now to solve a large variety of tasks, such as object recognition, that previously seemed intractable. However, one prerequisite for using DNNs is a huge dataset of training examples with ground truth information. While large databases of images exist and have been widely used to train DNNs [19,39,40], these images are of little use in training DNNs for color constancy. First of all, images are usually taken with consumer cameras and automatic white balance, which is already correcting for illumination. Second, information needs to be available about the reflectance of the objects, which requires at least multispectral imaging and an estimate of the illumination. While some of these images are available [4143], their number is too limited to train neural networks. Because of these issues, we chose to generate our own images using computer graphics, an approach that we used already in our earlier work [20].

Our approach is somewhat different from other neural network approaches. Often, rather shallow networks are used to extract the illumination in an image. In a second step, this estimate is then used to correct the input image. Here, we use the type of deep network that has previously been used in object recognition and image segmentation. We think that this might be more similar to how the visual system deals with color constancy. Human observers are rather insensitive to identifying illuminants or detecting illuminant changes. Instead, constancy might be achieved in the visual system through illumination-invariant detection of chromatic edges by double-opponent neurons (for review, see [44]), or by normalization across larger image regions (e.g., [45]).

So far, our model is severely limited by being trained only on computer generated images. In principle, it can be applied to photographs of natural scenes, but it is not possible to evaluate the color error due to lack of ground truth information. In the short term future, we plan to pre-train our model in a segmentation task [46], and then introduce transfer learning to the color constancy task. This might help to improve the segmentation for more natural inputs.

Funding

European Research Council Advanced Grant Color 3.0 (884116).

Acknowledgment

We are grateful to Alban Flachot for helpful discussions, and to Arash Akbarinia for computational support.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are publicly available at [47]. The full dataset including all images is 480 GB in size and may be obtained from the authors upon reasonable request.

REFERENCES

1. K. R. Gegenfurtner and D. C. Kiper, “Color vision,” Annu. Rev. Neurosci. 26, 181–206 (2003). [CrossRef]  

2. D. H. Foster, “Color constancy,” Vision Res. 51, 674–700 (2011). [CrossRef]  

3. M. Ebner, Color Constancy (Wiley, 2007), Vol. 7.

4. A. Hurlbert, “Challenges to color constancy in a contemporary light,” Curr. Opin. Behav. Sci. 30, 186–193 (2019). [CrossRef]  

5. H. E. Smithson, “Sensory, computational and cognitive components of human colour constancy,” Philos. Trans. R. Soc. B 360, 1329–1346 (2005). [CrossRef]  

6. G. Buchsbaum, “A spatial processor model for object colour perception,” J. Franklin Inst. 310, 1–26 (1980). [CrossRef]  

7. E. H. Land and J. J. McCann, “Lightness and retinex theory,” J. Opt. Soc. Am. 61, 1–11 (1971). [CrossRef]  

8. B. Funt and L. Shi, “The rehabilitation of MaxRGB,” in Color and Imaging Conference (Society for Imaging Science and Technology, 2010), Vol. 2010, pp. 256–259.

9. H. R. Vaezi Joze and M. S. Drew, “White patch gamut mapping colour constancy,” in 19th IEEE International Conference on Image Processing (2012), pp. 801–804.

10. D. A. Forsyth, “A novel algorithm for color constancy,” Int. J. Comput. Vis. 5, 5–35 (1990). [CrossRef]  

11. K. Barnard, “Improvements to Gamut mapping colour constancy algorithms,” in Computer Vision - ECCV, Lecture Notes in Computer Science (Springer, 2000), pp. 390–403.

12. G. D. Finlayson, S. D. Hordley, and I. Tastl, “Gamut constrained illuminant estimation,” Int. J. Comput. Vis. 67, 93–109 (2006). [CrossRef]  

13. A. Gijsenij, T. Gevers, and J. van de Weijer, “Generalized Gamut mapping using image derivative structures for color constancy,” Int. J. Comput. Vis. 86, 127–139 (2010). [CrossRef]  

14. Z. Tang, H. Liu, J. Yuan, C. Li, and Y. Zheng, “Estimating illumination chromaticity based on structured support vector machine,” in International Conference on Computer, Mechatronics and Electronic Engineering (CMEE) (2017), pp. 1–9.

15. C. Rosenberg, A. Ladsariya, and T. Minka, “Bayesian color constancy with non-Gaussian models,” in Advances in Neural Information Processing Systems (MIT, 2003), Vol. 16.

16. P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian color constancy revisited,” in IEEE Conference on Computer Vision and Pattern Recognition (2008), pp. 1–8.

17. R. Stanikunas, H. Vaitkevicius, and J. J. Kulikowski, “Investigation of color constancy with a neural network,” Neural Netw. 17, 327–337 (2004). [CrossRef]  

18. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Commun. ACM 60, 84–90 (2017). [CrossRef]  

19. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: a large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248–255.

20. A. Flachot and K. R. Gegenfurtner, “Color for object recognition: Hue and chroma sensitivity in the deep features of convolutional neural networks,” Vision Res. 182, 89–100 (2021). [CrossRef]  

21. I. Rafegas and M. Vanrell, “Color encoding in biologically-inspired convolutional neural networks,” Vision Res. 151, 7–17 (2018). [CrossRef]  

22. M. Engilberge, E. Collins, and S. Süsstrunk, “Color representation in deep neural networks,” in IEEE International Conference on Image Processing (ICIP) (2017), pp. 2786–2790.

23. R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision (Springer, 2016), pp. 649–666.

24. A. Flachot, A. Akbarinia, H. H. Schütt, R. W. Fleming, F. A. Wichmann, and K. R. Gegenfurtner, “Deep neural models for color classification and color constancy,” J. Vis. 22(4):17 (2022). [CrossRef]  

25. S. Bianco, C. Cusano, and R. Schettini, “Color constancy using CNNs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2015), pp. 81–89.

26. H.-H. Choi and B.-J. Yun, “Deep learning-based computational color constancy with convoluted mixture of deep experts (CMoDE) fusion technique,” IEEE Access 8, 188309 (2020). [CrossRef]  

27. Z. Lou, T. Gevers, N. Hu, and M. P. Lucassen, “Color constancy by deep learning,” in Proceedings of the British Machine Vision Conference (BMVC) (BMVA, 2015), pp. 76.1–76.12.

28. S. W. Oh and S. J. Kim, “Approaching the computational color constancy as a classification problem through deep learning,” Pattern Recognit. 61, 405–416 (2017). [CrossRef]  

29. O. Sidorov, “Artificial color constancy via GoogLeNet with angular loss function,” Appl. Artif. Intell. 34, 643–655 (2020). [CrossRef]  

30. J. Xiao, S. Gu, and L. Zhang, “Multi-domain learning for accurate and few-shot color constancy,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 3258–3267.

31. O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2015), pp. 234–241.

32. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXivarXiv:1512.03385 (2015).

33. A. H. Munsell, “A pigment color system and notation,” Am. J. Psychol. 23, 236–244 (1912). [CrossRef]  

34. D. Nickerson, “History of the Munsell color system and its scientific application,” J. Opt. Soc. Am. 30, 575–586 (1940). [CrossRef]  

35. G. Wyszecki and W. S. Stiles, Color Science (Wiley New York, 1982), Vol. 8.

36. P. Kay and T. Regier, “Resolving the question of color naming universals,” Proc. Natl. Acad. Sci. USA 100, 9085–9089 (2003). [CrossRef]  

37. S. Aston, A. Radonjić, D. H. Brainard, and A. C. Hurlbert, “Illumination discrimination for chromatically biased illuminations: implications for color constancy,” J. Vis. 19(3):15 (2019). [CrossRef]  

38. https://evermotion.org.

39. J. Mehrer, C. J. Spoerer, E. C. Jones, N. Kriegeskorte, and T. C. Kietzmann, “An ecologically motivated image dataset for deep learning yields better models of human vision,” Proc. Natl. Acad. Sci. USA 118, e2011417118 (2021). [CrossRef]  

40. M. N. Hebart, A. H. Dickter, A. Kidder, W. Y. Kwok, A. Corriveau, C. V. Wicklin, and C. I. Baker, “THINGS: a database of 1,854 object concepts and more than 26,000 naturalistic object images,” PLoS One 14, e0223792 (2019). [CrossRef]  

41. S. M. C. Nascimento, K. Amano, and D. H. Foster, “Spatial distributions of local illumination color in natural scenes,” Vision Res. 120, 39–44 (2016). [CrossRef]  

42. F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum,” IEEE Trans. Image Process. 19, 2241–2253 (2010). [CrossRef]  

43. Y. Monno, H. Teranaka, K. Yoshizaki, M. Tanaka, and M. Okutomi, “Single-sensor RGB-NIR imaging: high-quality system design and prototype implementation,” IEEE Sens. J. 19, 497–507 (2019). [CrossRef]  

44. R. Shapley and M. J. Hawken, “Color in the cortex: single- and double-opponent cells,” Vision Res. 51, 701–717 (2011). [CrossRef]  

45. S. G. Solomon and P. Lennie, “Chromatic gain controls in visual cortical neurons,” J. Neurosci. 25, 4779–4792 (2005). [CrossRef]  

46. A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation: state of the art,” Int. J. Multimed. Inf. Retr. 9, 171–189 (2020). [CrossRef]  

47. H. Heidari-Gorji and K. R. Gegenfurtner, “PixelwiseColorConstancy,” GitHub (2023), https://github.com/haamedh/PixelwiseColorConstancy.

Data availability

Data underlying the results presented in this paper are publicly available at [47]. The full dataset including all images is 480 GB in size and may be obtained from the authors upon reasonable request.

47. H. Heidari-Gorji and K. R. Gegenfurtner, “PixelwiseColorConstancy,” GitHub (2023), https://github.com/haamedh/PixelwiseColorConstancy.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (9)

Fig. 1.
Fig. 1. Sample of a rendered image. Each scene layout has 17 illumination conditions, one D65 illumination, four reddish illuminations, four bluish illuminations, four greenish illuminations, and four yellowish illuminations, each in four different distances from D65.
Fig. 2.
Fig. 2. Some examples of rendered images containing similar objects in different scene layouts.
Fig. 3.
Fig. 3. Model architecture. U-Net model with Resnet 50 encoder and decoder.
Fig. 4.
Fig. 4. Some examples of rendered images containing similar objects in different scene layouts. The left image of each pair was rendered under colored illumination given by the colored bar at the left of each row. The right image of each pair shows the output of the model, which should be of a constant color for each object, and constant across all illuminations.
Fig. 5.
Fig. 5. Histogram of error. To evaluate the model, (A) images with D56 illumination are used in the gray histogram, (B) images with light blue illumination are used in the pale blue histogram, and (C) images with blue illumination are used in the dark blue histogram. The illumination used in the bottom histogram was not in the illuminations of the training dataset. Note that the bin widths of the histograms correspond to the spacing of physical Munsell chips.
Fig. 6.
Fig. 6. Mean of total pixelwise error for each group of illuminations. The purple lines represent the error of our model trained using nine different illuminations (color constancy model). The black line shows the error of the model trained only on D65 illumination (naive model).
Fig. 7.
Fig. 7. Median of error for each WCS color. (A) Guide to show the color of each square. (B) Hue errors, (C) value errors, and (D) chroma errors. In (B)–(D), darker squares indicate more error, and lighter squares indicate less error.
Fig. 8.
Fig. 8. Hue, value, and chroma relative mean square errors of color constancy model for each hue, value, and chroma. Percentages represent mean square errors relative to maximum mean square errors along each dimension. The gray portions of bars indicate within-objects MSE and remaining bars show object-wise MSE. The sum of these two parts is total pixelwise MSE. (A)–(C) Hue relative mean square error from different views: (A) percentage of hue MSE for each hue, (B) percentage of hue error for each value, and (C) percentage of hue error for each chroma. (D)–(F) Value relative mean square error from different views: (D) percentage of value error for each hue, (E) percentage of value error for each value, and (F) percentage of value error for each chroma. (G)–(I) Chroma relative mean square error from different views: (G) percentage of chroma error for each hue, (H) percentage of chroma error for each value, and (I) percentage of chroma error for each chroma. (J) Number of chips per chroma is not uniform; this histogram illustrates the number of Munsell chips for each chroma.
Fig. 9.
Fig. 9. Colors predicted by the model. The large squares represent WCS colors, and 25 objects of each color are randomly selected. The small squares represent the predicted color for those objects.

Equations (4)

Equations on this page are rendered with MathJax. Learn more.

T o t a l p i x e l w i s e e r r o r = i = 0 n | p i G T | .
O b j e c t w i s e M S E = ( P G T ) 2 ,
W i t h i n O b j e c t M S E = i = 0 n ( p i P ) 2 ,
T o t a l p i x e l w i s e M S E = i = 0 n ( p i G T ) 2 .
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.