Method for optimizing channelized quadratic observers for binary classification of large-dimensional image datasets

M. K. Kupinski; E. Clarkson

doi:10.1364/JOSAA.32.000549

1. INTRODUCTION

Our work is motivated by a challenge that is common in many imaging applications: sorting image data between two classes of objects (e.g., signal present and signal absent) when linear classifiers do not perform well enough for the application. An optimal quadratic classifier requires either a training set of images from each class or prior knowledge of the first- and second-order statistics of the image data from each class. The first-order statistics are the average images from each class and the second-order statistics are the covariance matrices from each class. If a training set of images is available the first- and second-order sample statistics can be used. Optimal quadratic classifiers are difficult to compute in imaging applications because of the large number of measurements made by most imaging systems. A single image can contain a few million elements and the number of elements in the covariance matrix is equal to the square of this number. When working with the covariance matrix, storing it can be challenging, inverting it can be impractical, and accurately estimating it from finite training data can even be impossible. Our work addresses this big data problem by using a quadratic classifier on images that have been reduced in size by a linear transformation; we will refer to this as a channelized quadratic observer (CQO). This approach demands answering the following question: which linear transform is best for computing a quadratic classifier for a given imaging application? To address this question we have developed a new method for optimizing CQOs for binary classification of large-dimensional image datasets.

To introduce the detection method, begin by considering the relationship between an image and an object as

g = H f + n .

Here, $g$ is an $M \times 1$ vector of measurements made by an imaging system that is represented as a continuous-to-discrete operator $H$ ; the measurements of the continuous object $f$ are corrupted by measurement noise $n$ . We will consider post-processing signal detection. That is to say, the forward imaging model $H$ is fixed and can even be unknown since only the statistics of the image data will be used. We are interested in linear combinations of the image data of the form

v = Tg,

where

T

is an

L \times M

matrix and compression is achieved since

L < M

. Using terminology from the perception literature, each row of

T

is referred to as a channel [1]. In this paper,

v

will be called a channelized image or a channelized data vector. Mathematical observers for detection or classification tasks operate on image data

g

, or preferably channelized image data

v

, to form a scalar-valued decision variable. Objective assessment of image quality [2] quantifies the ability of an observer to use image data for performing the scientific task of interest, e.g., detection, classification, or estimation. Channelized data are preferable for mathematical observers since calculating a decision variable usually involves the estimation of parametric likelihoods [3]. In the channelized representation this estimation can be much more accurate given common constraints on finite training data [4,5]. Computational needs are lower because the inverse of a covariance matrix, required for likelihood evaluations, is now

L \times L

instead of

M \times M

.

In this work we will present gradient-based optimization methods for finding the solution to $T$ (once $L$ is selected) that maximizes detection task performance of the ideal observer (i.e., the likelihood ratio) given Gaussian statistics on the channel outputs, $v$ , for both classes. We will consider the first- and second-order statistics of each class to be different, in general, which leads to a quadratic relationship between the likelihood ratio and the image data; we call this a quadratic observer. When the second-order statistics are equal the ideal observer is linear and the optimal solution for $T$ is the Hotelling observer (i.e., a prewhitened match filter). This equal covariance assumption is valid when the two classes differ by the addition of a signal that is weak enough, relative to other sources of variability, so that it does not affect the covariance matrix. When the means are equal but the covariances are different we show a new result: that the same optimal $T$ solution is achieved using optimization with respect to the Bhattacharyya distance, Jeffrey’s divergence, and the area under the curve (AUC). This equal mean assumption is valid in ultrasound imaging [6 –8] and in many texture discrimination tasks.

The next section is devoted to a review of related work. Assumptions and notation are established in Section 3. Then we show an analytic gradient, with respect to the linear channels, for the following: Section 4) Kullback–Liebler (KL) divergence [9]; Section 5) the symmetrized KL divergence (also referred to as Jeffrey’s divergence (J) in information theory [10]); Section 6) the Bhattacharyya distance [11] (also called G(0) in [12]); and Section 7) the area under the ideal-observer receiver operating characteristic (ROC) curve, also known as the AUC [13,14]. We will see by the end of these Sections that the J and G(0) metrics are maximized at the same set of channels that maximizes the AUC when there is no signal in the mean. This results in an important surrogate figure of merit (SFOM) [15] for imaging applications since J is much easier to compute than the AUC.

In Section 8 we focus on the specific case of no signal in the mean and compute the Hessian of J, with $L$ number of linear channels. We use this Hessian to show that there are no more than $L + 1$ local maxima in this case, one of which is the global maximum. This is useful information for a gradient-based algorithm to find the global maximum.

In Section 9 the results of a simulation study are presented. Here, the AUC is computed for the ideal and Hotelling observers and compared to CQO performance. The CQO is computed in two different ways: 1) an iterative algorithm based on a gradient search to maximize J, and 2) an observer introduced in Section 5 that is based on an eigenanalysis of the covariance matrices and is optimal when there is no signal in the mean. The similar task performance between the J-based CQO (J-CQO) and the much less tractable Eigen-CQO indicates strong potential for the J-CQO method in more realistic imaging applications.

2. RELATED WORK

A. Channels in Imaging

For medical imaging applications image channels have been used to approximate Hotelling observers with channelized Hotelling observers (CHO) [16], both of which are linear observers. Channels have also been used to approximate the ideal observer, which is not necessarily linear in the image data [17,18]. These channelized ideal observers (CIOs) have been explored for both standard channels used for CHOs and channels derived from the imaging system operator [19,20].

In computer vision the interpretation of a single image is decomposed into subimage detection and classification tasks that are customized to the desired data product. Here, an image channel can be related to the original image data by both linear and nonlinear transforms; even the channel outputs themselves can be combined to further reduce the original data to features. A succinct review of different types of image channels and feature selection methods in this community is provided in [21].

Across imaging applications, the selection of channels can be motivated by maximizing the performance of the channelized observer; the channels are then referred to as efficient channels [2]. On the other hand anthropomorphic channels are designed to approximate human observer performance and are often based on properties of the human visual system [22 –24]. The kernel trick used for support vector machines (SVM) employs a nonlinear function of the data and a linear discriminant on that function’s output [25]. SVM seeks nonlinear channels, either efficient or anthropomorphic depending on the application, for a linear observer. This work concentrates on efficient linear channels for quadratic observers.

B. Eigenanalysis for Compression

In this paper, we consider the task of detection among two classes and study the eigenanalysis of the two covariance matrices $K_{1}$ and $K_{2}$ . When these covariance matrices are unequal the data is called heteroscedastic. Covariance matrix eigenanalysis is rarely practical for modern imaging systems since an image is comprised of several million elements. We will show a new and computational feasible approach for optimizing channel selection, which we denote J-CQO, and compare its detection task performance with an eigenanalysis approach, which we denote Eigen-CQO. We will also show that Eigen-CQO is optimal when the data is heteroscedastic and the mean images are equal. In [26] Fukunaga and Koontz were the first to suggest covariance matrix eigendecomposition for detection and classification tasks; this approach is called the Fukunaga and Koontz transform (FKT). The FKT uses a matrix $P$ to transform the data so that $P (K_{1} + K_{2}) P^{†}$ equals the identity matrix. This equality guarantees that both covariance matrices of the transformed data will have the same eigenvectors. Furthermore, the sum of the two eigenspectra, when eigenvalues associated with the same eigenvector are added, is equal to one. Consequently, the transformation by $P$ makes the strongest eigenvectors of one class the weakest eigenvectors of the other. In [27] the statistical properties of the FKT are studied and it is reported that, with certain eigenspectrum assumptions, the FKT is the optimal low-rank approximation to the optimal classifier for zero mean, heteroscadastic, and normally distributed data. FKT is widely used in pattern recognition; its adaptation is called a tuned basis function (TBF) in [28] for finding an optimal solution to $\tilde{T}$ to use in the quadratic test statistic, $g^{†} \tilde{T} g$ .

In Section 5, we will show that with the normality assumptions used in this paper, and when there is no signal in the mean, an optimal channel matrix for the J FOM can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l} + κ_{l}^{- 1}$ and we call this observer the Eigen-CQO. Using these same assumptions on the statistics of the data, it was shown in [29] that the Eigen-CQO maximizes the Bhattacharya distance between two classes. Multiple discriminant analysis (MDA), described in [30], is similar to Eigen-CQO but instead of $K_{1}$ and $K_{2}$ an intraclass and an extraclass scatter matrix are estimated from finite data; the imaging task is to distinguish among more than two classes. MDA is achieved when the scatter matrices are estimated from $N$ image samples, the FKT eigenspectrum is calculated from a QR decomposition on an $N \times M$ matrix, and the final classification decision is formed using a 1-nearest neighbor (1-NN) algorithm.

C. Information Metrics in Imaging

A SFOM is a quantity that correlates with ideal-observer task performance, as measured by the AUC or some other task-based figure of merit (FOM), but that is easier to compute [31]. Statistical distances and divergences used in information theory can be powerful SFOM in imaging if they are proven, analytically or empirically, to correlate with task performance. In [32] it is recognized that J is an alternative FOM to linear discriminant analysis (LDA), which is needed when second-order statistics are unequal among the classes. However, instead of optimizing J directly an upper-bound that is quadratic in the channels is used. Once a channel solution is obtained the classification decision is formed using a 1-NN algorithm and accuracy is reported, as opposed to the channelized ideal observer and ROC analysis used in this work. The KL-divergence and Chernoff distances were used in [33] to quantify the effect of compression when the matrix $T$ is populated with random entries and the statistics are described by uncorrelated normal distributions. These results describe upper- and lower-bounds for the J divergence, which are valid with high probability for random channel matrices. In this paper, we are interested in the exceptions to these bounds, i.e., the solutions to $T$ that produce large values of J and are unlikely samples from a random distribution.

D. Relation to Compressive Sensing

In the unrealistic limit of infinite computational resources, and perfect knowledge of the image statistics, efficient channels for mathematical observers are not necessary. As pointed out in [34] “Conventional sensing systems typically first acquire data in an uncompressed form (e.g., individual pixels in an image) and then perform compression for storage or transmission. In contrast, compressive sensing (CS) researchers would like to acquire data in an already compressed form, reducing the quantity of data that need be measured in the first place.” This work on channelized observers is a post-processing method that is complementary to CS. Given the large popularity of this area of research we will remark on the relationship to CS for the sake of clarity and context.

For CS we expand the object function as a finite series in terms of basis functions

f (r) = \sum_{n = 0}^{N} f_{n} ψ_{n} (r) .

The imaging equation [see Eq. (1)] is then replaced with

g = Hf + n .

The $M \times N$ matrix $H$ is called the sensing matrix or system matrix. The astonishing breakthrough of CS was based on optimal signal recovery by selecting $H$ to take advantage of sparsity constraints on $f$ [35]. This leads to asymptotic relationships between residuals of the recovered signal and stochastic models for $H$ , $f$ , and $n$ [36]. However, even in some of the early CS studies, it was recognized that improved signal recovery could be achieved when knowledge of the underlying signal of interest was used to select $H$ [37]. Research in this field has expanded to include evaluating candidate solutions to $H$ with respect to information-based metrics from communications theory, such as Shannon information [38]. It was pointed out in [39] that an information-theoretic metric is particularly attractive as it allows an upper-bound on the task-specific performance of an imager, independent of the algorithm used to extract the relevant information. Although this might be advantageous for hardware design, for post-processing of the image data an algorithm, i.e., mathematical observer, is required. Metrics related to estimation, classification, and detection tasks have also been investigated for CS applications; these metrics are then a function of both $H$ and the mathematical observer [40]. To compare multiple solutions for $H$ a constraint must be used, such as equivalent number of photon counts, so that the noise statistics are part of the trade-off with the measurement scheme and the compression ratio. Increasing the quantity of measurements of $f$ naturally transmits more information about the object, but this increase in the dimensionality of $H$ is penalized by a larger measurement noise contribution. The extreme case of a single measurement of $f$ integrates all available photons and suffers minimal measurement deviation due to noise. The optimal solution for $H$ will depend upon the FOM, the statistics of $f$ , and the statistics of $n$ . By contrast, this work concentrates exclusively on post-processing for detection; a test statistic is calculated from the image data via Eq. (9). We seek the optimal solution for $T$ , which depends upon the chosen FOM for the detection task and the statistics of $g$ . An important distinction between these two problems is that in post-processing $T$ is applied to $Hf$ and to $n$ . Thus, the noise statistics of the channelized image data depends on the channel matrix $T$ .

3. FORMULATION OF THE PROBLEM

As a first step, we will introduce some notation and describe the problem that we are considering. All vectors are column vectors unless otherwise specified. The $M$ -dimensional vector $g$ will represent the input to the classifier, which will classify this vector as either belonging to the population corresponding to the probability density function (PDF) $p r_{1} (g)$ or the population corresponding to the PDF $p r_{2} (g)$ . This vector may be a direct image, a reconstructed image, or the raw data being produced by an imaging system.

The ideal classifier uses the log-likelihood ratio

λ_{I} (g) = \ln [p r_{1} (g)] - \ln [p r_{2} (g)],

as a decision variable and compares the result to a threshold. If the decision variable is above the threshold, then the data vector is assigned to

p r_{1} (g)

, and otherwise it is assigned to

p r_{2} (g)

. This observer maximizes the AUC, as well as other task-based FOMs [2,14]. In imaging applications the dimension

M

of the vector

g

is very large. We will assume for

i = 1,2

that

p r_{i} (g)

is a PDF with mean

{\bar{g}}_{i}

and covariance matrix

K_{i}

. This creates two problems when we try to implement the ideal observer for the classification task. The first problem is computational; even for Gaussian PDFs we will have to invert two

M \times M

covariance matrices

K_{i}

, which may not be feasible if, for example, the input images contain millions of pixels. The second problem is that, if we are estimating the image statistics from data, which is often the case, we will need a very large number of samples to get reliable estimates. For example, in the Gaussian case, the number of samples needs to be at least

M

to get invertible estimates of the

K_{i}

, and typically needs to be an order of magnitude greater to get reliable estimates. This provides a motivation for trying to reduce the dimension of the data vector before implementing the ideal observer.

The data reduction will be implemented by an $L \times M$ dimensional matrix $T$ via the equation $v = Tg$ . We refer to the vector $v$ as the channelized data vector and the rows of $T$ as the channels for the data reduction. The number $L$ is the dimension of the channelized data and satisfies $L ≪ M$ . We will always assume that $T$ is a full rank matrix so that the channels are linearly independent. We will assume that the PDFs for the channelized data for the two populations are given by normal distributions

p r_{i} (v) = {[{(2 π)}^{L} \det ({TK}_{i} T^{†})]}^{- \frac{1}{2}} \times \exp [- \frac{1}{2} {(v - {T \bar{g}}_{i})}^{†} {({TK}_{i} T^{†})}^{- 1} (v - {T \bar{g}}_{i})],

for

i = 1,2

. The channelized data may be at least approximately Gaussian distributed even if the original data are not. In practice, almost all useful channel matrices are fully populated. Therefore, if the original data components are sufficiently independent from each other, then we can argue from the central limit theorem that the channel outputs will be approximately Gaussian distributed for almost all channel matrices [41]. In this case, we would expect the methods developed in this work to produce channel matrices that are very near optimal in terms of the FOMs discussed below.

Given a fixed set of $L$ channels the optimal classifier will compute the channelized log-likelihood

λ (T, g) = \ln [p r_{1} (Tg)] - \ln [p r_{2} (Tg)],

and compare this decision variable to a threshold. For computational purposes we define, for

i = 1,2

Q_{i} (v) = {(v - {T \bar{g}}_{i})}^{†} {({TK}_{i} T^{†})}^{- 1} (v - {T \bar{g}}_{i}) .

Then we have

λ_{CQO} (T, g) = \frac{1}{2} [Q_{2} (Tg) - Q_{1} (Tg) - \ln \det ({TK}_{1} T^{†}) + \ln \det ({TK}_{2} T^{†})],

where the notation denotes the CQO. The last two terms are usually dropped, since they do not depend on the data. We will not be doing this because we want to vary the channel matrix to search for an optimal set of channels for a given

L

and therefore we need an unmodified likelihood ratio. All of the FOMs we will be considering depend on moments of

λ_{CQO} (T, g)

, which we will be referring to as

λ (v)

.

For any function $f (v)$ we will use the notation ${〈 f (v) 〉}_{i}$ for the expectation of $f (v)$ in each class,

{〈 f (v) 〉}_{i} = \int_{R^{L}} f (v) p r_{i} (v) d^{L} v .

The following FOMs will be examined and optimal channels will be determined for each. The FOMs are listed in increasing order of the difficulty of the channel optimization calculations. The first two FOMs are Kullback–Liebler (KL) divergences,

F_{1} (T) = {〈 λ (v) 〉}_{1} = D_{K L} (p r_{1} ‖ p r_{2}),

F_{2} (T) = - 〈 λ (v) 〉_{2} = D_{K L} (p r_{2} ‖ p r_{1});

we will refer to these FOMs as KL1 and KL2, respectively. (Note that, from here onward,

p r_{i}

will refer to the corresponding PDF for the channelized data.) These two FOMs are measures of the difference between two PDFs and are used extensively in information theory. The third FOM is the symmetrized KL divergence, commonly referred to as J,

F_{3} (T) = {〈 λ (v) 〉}_{1} - {〈 λ (v) 〉}_{2} .

This symmetric measure of the difference between two PDFs has been related to the ideal-observer AUC in applications where ${\bar{g}}_{1} = {\bar{g}}_{2}$ , such as ultrasound imaging [6 –8]. This will be an important special case in our development, below. The fourth FOM is called G(0) and has also been related to ideal-observer detectability [12,42]. This FOM is given by

F_{4} (T) = - 4 \ln {〈 \exp [- \frac{1}{2} λ (v)] 〉}_{1}, = - 4 \ln {〈 \exp [\frac{1}{2} λ (v)] 〉}_{2}, = 4 D_{B} (p r_{1}, p r_{2}) .

In this expression $D_{B} (p r_{1}, p r_{2})$ is the Bhattacharyya distance between the two PDFs, which also has applications in information theory. The fifth and final FOM we will consider for optimizing the channel matrix is the AUC for the channelized ideal observer. It is not obvious, but it can be shown [12] that this FOM can be written as

F_{5} (T) = 1 - \frac{1}{4 π} \int_{- \infty}^{\infty} {| {〈 \exp [(\frac{1}{2} + i α) λ (v)] 〉}_{2} |}^{2} \frac{d α}{α^{2} + \frac{1}{4}} .

This is not the usual expression for the ideal-observer AUC, but it will be useful for our calculations. The optimization problems we are considering may now be stated.

Optimization Problem $j$ : With the normality assumption above, and for a given value of $L$ , what channel matrix $T$ will maximize $F_{j} (T)$ ?

A proper FOM for channel optimization should satisfy

F (MT) = F (T)

for any invertible

L \times L

matrix

M

. This is because the channels for

MT

are all linear combinations of the channels for

T

, and vice-versa. We will see below that the FOMs

F_{j} (T)

defined above all satisfy this requirement. This condition implies that the FOM can be defined as a function on the Grassmanian manifold

Gr (L, M)

[43]. Since this is a compact manifold and all the FOMs defined above are continuous functions, we know that there is a maximum for each

F_{j} (T)

. Methods have been developed for optimization on Grassmannians that use coordinate systems on the manifold itself [44]. However, the calculations described below do not follow this approach, but instead simply compute the gradient of

F_{j} (T)

with respect to

T

and set it equal to zero. In a sense, we may regard the entries of the matrix

T

as a set of homogeneous coordinates for the corresponding point on

Gr (L, M)

and we have found that, for the functions that we are optimizing, these coordinates are useful.

In all that follows we will define the signal in the mean as $s = {\bar{g}}_{1} - {\bar{g}}_{2}$ . If $K_{1} = K_{2} = K$ , then the ideal linear observer is equivalent to one that computes the test statistic

λ_{H} (g) = {(K^{- 1} s)}^{†} g,

and compares to a threshold. In the perception literature this is called the Hotelling observer and in other applications is called a prewhitened matched filter or the Fisher discriminant. This test statistic is a sufficient statistic for the detection task when the data are Gaussian and the covariances are equal. A single channel directly reduces the data to a scalar, i.e.,

L = 1

. A set of channels will be optimal, in this case, if some linear combination of them is equal to the Hotelling template

{(K^{- 1} s)}^{†}

. In other words, if there is an

L

-dimensional vector

c

such that

c^{†} T = {(K^{- 1} s)}^{†}

. Note that if

T

is replaced by

MT

as above, then we can replace

c

with

{(M^{†})}^{- 1} c

and this condition will still be satisfied. Thus, optimality is a property of a point on

Gr (L, M)

and does not depend on the specific matrix

T

used to represent that point. Having disposed of this special case, we will now assume for the remainder of this paper that

K_{1} \neq K_{2}

so that the ideal observer is a quadratic observer.

4. KL1 AND KL2

First, we compute the FOM, a computation presented in the Appendices. To describe the end result we introduce notation for the covariance matrices of the channelized data for each data class

C_{i} (T) = {TK}_{i} T^{†} .

Then we define the matrix function

R_{12} (T) = C_{2}^{- 1} (T) C_{1} (T),

where the inverse exists since

T

is full rank. Next, we define the scalar function

H_{2} (T) = s^{†} T^{†} C_{2}^{- 1} (T) Ts .

The KL1 FOM can now be written as

2 F_{1} (T) = - L + tr [R_{12} (T)] + H_{2} (T) - \ln \det [R_{12} (T)] .

The second term in this KL divergence is dominated by the largest eigenvalues of $R_{12} (T)$ . The third term is a Hotelling trace term in the channelized data space for the second-hypothesis covariance. The last term is dominated by the smallest eigenvalues of $R_{12} (T)$ if we assume that they are near zero. Now, we wish to maximize this expectation over all possible channel matrices of a given dimension. To do this we will compute the gradient $\nabla F_{1} (T)$ of $F_{1} (T)$ and set it equal to zero. This gradient is an $L \times M$ matrix-valued function of $T$ and is defined by the relation

{\frac{d}{d t} F_{1} (T + t E) |}_{t = 0} = tr [E^{†} \nabla F_{1} (T)],

which holds for all

L \times M

matrices

E

. We will use the notation

Δ_{1} (T) = \nabla F_{1} (T)

. From Appendix A we then have

Δ_{1} (T) = C_{2}^{- 1} (T) T (K_{1} + {ss}^{†}) [I - T^{†} C_{2}^{- 1} (T) {TK}_{2}] - C_{2}^{- 1} (T) {TK}_{2} + C_{1}^{- 1} (T) {TK}_{1} .

At present we do not have an analytic solution to $Δ_{1} (T) = 0$ unless $s = 0$ , which is a special case that we will examine below. The simplest iterative algorithm to try is

T^{(n + 1)} = T^{(n)} + ϵ Δ_{1} (T^{(n)}),

where the iterative process ends at

n = \tilde{n}

for some stopping criteria

T^{(\tilde{n})} \approx T^{(\tilde{n} + 1)}

. The decision variable for this CQO is then calculated by

λ_{CQO} (T^{(\tilde{n})}, g) .

We have successfully used this type of algorithm for the J-FOM; see the Simulations section for these results. Two interesting properties of the gradient, which follow from the invariance property of the FOM, are $Δ_{1} (T) T^{†} = 0$ and $Δ_{1} (MT) = {(M^{†})}^{- 1} Δ_{1} (T)$ . Note that the second property implies that the iteration depends explicitly on the starting matrix $T^{(0)}$ . If a different matrix ${MT}^{(0)}$ that represents the same subspace is used, then the iteration follows a different path on the Gassmannian $Gr (L, M)$ . This has not proved to be a problem in implementation so far, but an algorithm that does not have this property is

T^{(n + 1)} = T^{(n)} [I + ϵ {(T^{(n)})}^{†} Δ_{1} (T^{(n)})] .

This algorithm follows the same path in $Gr (L, M)$ no matter which channel matrix is used to represent the initial subspace.

When the signal is not in the mean $s = 0$ , an analytic solution to the gradient equation $Δ_{1} (T) = 0$ is available.

Proposition 1: With the normality assumptions used in this paper, and when there is no signal in the mean, an optimal channel matrix for the KL1 FOM $F_{1} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l} - \ln κ_{l}$ . If this channel matrix is denoted $T_{eig}$ then the decision variable for this Eigen-CQO is calculated by

λ_{CQO} (T_{eig}, g) .

Proof: With no signal in the mean the gradient equation can now be written as

[R_{12} (T) + I] {({TK}_{2} T^{†})}^{- 1} {TK}_{2} = [R_{12} (T) + I] {({TK}_{1} T^{†})}^{- 1} {TK}_{1} .

The eigenvalues of $R_{12} (T)$ are the same as the eigenvalues of

C_{2}^{\frac{1}{2}} (T) R_{12} (T) C_{2}^{- \frac{1}{2}} (T) = C_{2}^{- \frac{1}{2}} (T) C_{1} (T) C_{2}^{- \frac{1}{2}} (T),

which is symmetric and positive definite. These eigenvalues are therefore real and positive. This implies that

R_{12} (T) + I

is invertible and may be removed from the gradient equation. After doing this, taking the adjoint, and performing some additional matrix algebra we have

K_{2}^{- 1} K_{1} T^{†} = T^{†} R_{12} (T) .

Now let $M$ be an invertible $L \times L$ matrix. If the channel matrix $T$ produces the maximum value for $F_{1} (T)$ , then the matrix $MT$ does also. Setting the gradient equal to zero for this matrix gives

K_{2}^{- 1} K_{1} T^{†} M^{†} = T^{†} M^{†} [{(M^{†})}^{- 1} R_{12} (T) M^{†}],

which is indeed equivalent to the original gradient equation for

T

. Since, as noted above,

R_{12} (T)

is related to a symmetric positive definite matrix by a similarity transformation, it is diagonalizable and, as we said before, the eigenvalues are all positive real numbers. If we choose

M

to diagonalize

R_{12} (T)

, then we have

K_{2}^{- 1} K_{1} T^{†} M^{†} = T^{†} M^{†} Λ (T),

where

Λ (T)

is a diagonal matrix. Therefore, the columns of

{(MT)}^{†}

are eigenvectors of

K_{2}^{- 1} K_{1}

and the numbers on the main diagonal of

Λ (T)

are eigenvalues of

K_{2}^{- 1} K_{1}

. Now, we will replace

T

with

MT

in

F_{1} (T)

and find that, when the gradient is zero at

T

, we have

F (T) = F (MT) = \sum_{l} (κ_{l} - \ln κ_{l}),

where the sum is over the eigenvalues that appear in

Λ (T)

. Therefore, to maximize

F_{1} (T)

we should choose the rows of

T

to be the

L

eigenvectors

t_{l}

of

K_{2}^{- 1} K_{1}

that have the

L

largest values for

κ_{l} - \ln κ_{l}

. ▪

We will find similar answers for other FOMs, the only difference being which eigenvectors to choose. Note that the analytic solution depends on inverting $K_{2}$ and finding the eigenvectors for $K_{2}^{- 1} K_{1}$ . Since these are both $M \times M$ matrices, this may be computationally difficult. On the other hand, the iterative algorithms only require inverting $C_{1} (T)$ and $C_{2} (T)$ , which are $L \times L$ matrices. Thus, even when the signal in the mean is zero, we may want to use an iterative algorithm for practical reasons. There is a problem with local maxima for the iterative algorithms presented above and this will be discussed in detail for the J FOM, below.

With obvious notational changes, the results for the KL2 FOM can be written down from the KL1 FOM by interchanging $1 \leftrightarrow 2$ in the subscripts. We then have

2 F_{2} (T) = - L + tr [R_{21} (T)] + H_{1} (T) - \ln \det [R_{21} (T)] .

The gradient of $F_{2} (T)$ is given by

Δ_{2} (T) = C_{1}^{- 1} (T) T (K_{2} + {ss}^{†}) [I - T^{†} C_{1}^{- 1} (T) {TK}_{1}] - C_{1}^{- 1} (T) {TK}_{1} + C_{2}^{- 1} (T) {TK}_{2} .

We can use this in iterative algorithms, as we did for the KL1 FOM. For no signal in the mean, there is again an analytic solution to the optimization problem.

Proposition 2: With the normality assumptions used in this paper, and when there is no signal in the mean, an optimal channel matrix for the KL2 FOM $F_{2} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l}^{- 1} + \ln κ_{l}$ .

Proof: We only need to note that $K_{1}^{- 1} K_{2} = {(K_{2}^{- 1} K_{1})}^{- 1}$ and repeat the proof of Proposition 1 with the index change $1 \leftrightarrow 2$ . ▪

Note that the optimal KL1 and KL2 channel matrices both draw from the eigenvectors of $K_{2}^{- 1} K_{1}$ but may choose different channels due to the different emphasis on using the large as opposed to the small eigenvalues of this matrix.

5.

J

Since $F_{3} (T) = F_{1} (T) + F_{2} (T)$ the symmetrized KL divergence is given by

2 F_{3} (T) = - 2 L + tr [R_{12} (T)] + H_{2} (T) + tr [R_{21} (T)] + H_{1} (T) .

The gradient of this FOM is

Δ_{3} (T) = C_{1}^{- 1} (T) T (K_{2} + {ss}^{†}) [I - T^{†} C_{1}^{- 1} (T) {TK}_{1}] + C_{2}^{- 1} (T) T (K_{1} + {ss}^{†}) [I - T^{†} C_{2}^{- 1} (T) {TK}_{2}] .

This gradient can be put into either of the iterative algorithms and is the one used in the simulations, for reasons to be discussed below. When there is no signal in the mean the equation $Δ_{3} (T) = 0$ can again be solved analytically.

Proposition 3: With the normality assumptions used in this paper, and when there is no signal in the mean, an optimal channel matrix for the J FOM $F_{3} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l} + κ_{l}^{- 1}$ .

Proof: With no signal in the mean the equation $Δ_{3} (T) = 0$ can be written as

[R_{12} (T) - R_{21} (T)] {({TK}_{1} T^{†})}^{- 1} {TK}_{1} = [R_{12} (T) - R_{21} (T)] {({TK}_{2} T^{†})}^{- 1} {TK}_{2} .

We will assume that the matrix in the square brackets is invertible and check this assumption later. This leads to the familiar equation

K_{2}^{- 1} K_{1} T^{†} = T^{†} R_{12} (T) .

As in Proposition 1, this leads to the result that an optimal channel matrix may be constructed from $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ with corresponding eigenvalues $κ_{l}$ . The value of the J FOM at this matrix is

F_{3} (T) = \frac{1}{2} \sum_{l} (κ_{l} + \frac{1}{κ_{l}}) - L .

This expression leads to the last statement in the proposition.

Now we need to go back and check the invertibility assumption for the matrix $R_{12} (T) - R_{21} (T)$ . For positive $κ$ , the minimum value for $κ + κ^{- 1}$ occurs at $κ = 1$ . If the corresponding eigenvector ends up in the channel matrix $T$ , then $R_{12} (T) - R_{21} (T)$ will not be invertible, since this eigenvector would be a null vector of this matrix. However, there would be no point in including such a channel since it would not help to distinguish the two hypotheses. We can see this by writing the eigenvector equation for the channels as $K_{1} t_{l} = κ_{l} K_{2} t_{l}$ . We can use the properties of symmetric matrices to show that these eigenvectors may be chosen to satisfy the dual orthogonality properties,

t_{l} K_{2}^{†} t_{l^{'}} = δ_{l l^{'}}, t_{l} K_{1}^{†} t_{l^{'}} = κ_{l} δ_{l l^{'}} .

Thus, the optimal channelized vector components are independent under both hypotheses, since they are Gaussian distributed in both cases. Also, if $κ_{l} = 1$ , then the corresponding component has the same variance under both hypotheses. Thus, that component will not appear in the log-likelihood, and we do not need to use it. The end result is that we can remove this row from the channel matrix $T$ without any loss of information and reduce the number of channels by one. If, in the process of choosing eigenvectors that maximize $κ_{l} + κ_{l}^{- 1}$ , we are forced to choose one with $κ_{l} = 1$ , then we do not need $L$ channels at all. In this case there would be no loss of information in using $L - 1$ channels. Of course, there is also no harm in including such a channel and, in this sense, the statement of the proposition still holds. ▪

Note that the set of eigenvectors we choose from for optimizing J, is the same as for optimizing KL1 and KL2 but, again, we may make different choices for the $L$ eigenvectors to use for the channel matrix $T$ . We will see that the last two FOMs lead to the same choice of the $L$ eigenvectors for the optimal channel matrix as the J FOM.

6. G(0)

The expectation that appears in the G(0) FOM can be computed after some effort. We know that $F_{4} (T) = - 4 \ln I (T)$ , where

I (T) = \int_{R^{L}} {[p r_{1} (v) p r_{2} (v)]}^{\frac{1}{2}} d^{L} v .

We need to complete the square on the exponent that appears in the integrand. First, we define a symmetric matrix $C (T)$ by $2 C^{- 1} (T) = C_{1}^{- 1} (T) + C_{2}^{- 1} (T)$ and the vector $c (T)$ by $2 c (T) = C (T) [C_{1}^{- 1} (T) {T \bar{g}}_{1} + C_{2}^{- 1} (T) {T \bar{g}}_{2}]$ . Next, we define the function $H (T)$ by

4 H (T) = 2 c^{†} (T) C^{- 1} (T) c (T) - {\bar{g}}_{1}^{†} T^{†} C_{1}^{- 1} (T) {T \bar{g}}_{1} - {\bar{g}}_{2}^{†} T^{†} C_{2}^{- 1} (T) {T \bar{g}}_{2} .

Then we have the following relation,

{[v - c (T)]}^{†} C^{- 1} (T) [v - c (T)] - H (T) = \frac{1}{2} {(v - {T \bar{g}}_{1})}^{†} C_{1}^{- 1} (T) (v - {T \bar{g}}_{1}) + \frac{1}{2} {(v - {T \bar{g}}_{2})}^{†} C_{2}^{- 1} (T) (v - {T \bar{g}}_{2}) .

After performing the integral we have an expression for the G(0) FOM:

F_{4} (T) = - 2 \ln \det C (T) + \ln \det C_{1} (T) + \ln \det C_{2} (T) + H (T) .

Taking the gradient of this expression is very complicated and we will not pursue it further. However, when there is no signal in the mean $H (T) = 0$ and the gradient calculation simplifies considerably. This equal mean case was considered in [29] and includes a proof that the optimal channel matrix for the G(0) FOM $F_{4} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l} + κ_{l}^{- 1}$ .

This means that a channel matrix that is optimal for the G(0) FOM is also optimal for the J FOM, and vice versa. This happens even though these two SFOMs are clearly different. Both of these FOMs have been put forward as good approximations to the square of the ideal-observer detectability $d_{A} (T)$ [12], which is defined by the relation

F_{5} (T) = \frac{1}{2} + \frac{1}{2} \erf [\frac{1}{2} d_{A} (T)],

where

\erf [\cdot]

is the error function. It remains to be seen which of these two FOMs gives a better approximation to this detectability.

7. AUC

For the AUC FOM we will only consider the case where there is no signal in the mean. The result in this case is familiar by now.

Proposition 4: With the normality assumptions used in this paper, and when there is no signal in the mean, an optimal channel matrix for the AUC FOM $F_{5} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . These eigenvectors are chosen to have the $L$ largest values of $κ_{l} + κ_{l}^{- 1}$ .

Proof: We will assume that ${\bar{g}}_{1} = {\bar{g}}_{2} = 0$ . This involves no loss of generality since we could subtract the known mean channelized data vector from the channelized data vector as a preprocessing step that would not affect the channelized ideal observer (CIO) AUC. As noted above, the CIO AUC can be written as

F_{5} (T) = 1 - \frac{1}{4 π} \int_{- \infty}^{\infty} {| {〈 \exp [(\frac{1}{2} + i α) λ (v)] 〉}_{2} |}^{2} \frac{d α}{α^{2} + \frac{1}{4}} .

The expectation factors into two terms, one from the Gaussian normalization factors and one from the expectation integral

{〈 Λ^{\frac{1}{2} + i α} (v) 〉}_{2} = N (α, T) I (α, T) .

The normalization component is

N (α, T) = {[\frac{1}{\sqrt{{(2 π)}^{L} \det C_{1} (T)}}]}^{\frac{1}{2} + i α} \times {[\frac{1}{\sqrt{{(2 π)}^{L} \det C_{2} (T)}}]}^{\frac{1}{2} - i α} .

The square magnitude of this factor

{| N (α, T) |}^{2} = \frac{1}{{(2 π)}^{L}} [\frac{1}{\sqrt{\det C_{1} (T)}}] [\frac{1}{\sqrt{\det C_{1} (T)}}]

does not depend on

α

and therefore factors out of the

α

integral. The integral factor

I (α, T)

is given by

\int_{R^{L}} \exp {- \frac{1}{4} v^{†} [C_{1}^{- 1} (T) + C_{2}^{- 1} (T)] v - \frac{i α}{2} v^{†} [C_{1}^{- 1} (T) - C_{2}^{- 1} (T)] v} d v .

Now we replace $T$ with $MT$ . This does not change the CIO AUC since the variable $v$ in the integral will be replaced by $u = Mv$ and the Jacobian factor from the change of variable in $I (α, T)$ will cancel out with a $| \det (M) |$ factor from $N (α, T)$ . We may choose $M$ to diagonalize both channel covariance matrices with ${MC}_{2} (T) M^{†} = I$ and ${MC}_{1} (T) M^{†} = D (T)$ . The diagonal matrix $D (T)$ has the eigenvalues of $R_{12} (T)$ along the diagonal. These diagonal entries are also the eigenvalues of $C_{2}^{- \frac{1}{2}} (T) C_{1} (T) C_{2}^{- \frac{1}{2}} (T)$ , which is symmetric and positive definite. Therefore, the diagonal entries of $D (T)$ are real and positive. We will call these numbers $μ_{l}$ for $l = 1, \dots, L$ . Now we have for the normalization component

{| N (α, MT) |}^{2} = \frac{1}{{(2 π)}^{L} \sqrt{\prod_{l = 1}^{L} μ_{l}}} .

We may also write this equation as

{| N (α, MT) |}^{2} = \frac{1}{{(2 π)}^{L} \sqrt{\det R_{12} (T)}} .

For the integral factor we have

I (α, MT) = \int_{R^{L}} \exp [- \frac{1}{4} \sum_{l = 1}^{L} (\frac{1}{μ_{l}} + 1) u_{l}^{2} - \frac{i α}{2} \sum_{l = 1}^{L} (\frac{1}{μ_{l}} - 1) u_{l}^{2}] d^{L} u .

This is a product of one dimensional integrals and can be computed analytically with the result

I (α, MT) = π^{\frac{L}{2}} \prod_{l = 1}^{L} {[\frac{1}{4} (\frac{1}{μ_{l}} + 1) + \frac{i α}{2} (\frac{1}{μ_{l}} - 1)]}^{- \frac{1}{2}} .

We may also write this equation as

I (α, MT) = \frac{{(2 π)}^{\frac{L}{2}}}{\sqrt{\det [(\frac{1}{2} + i α) R_{21} (T) + (\frac{1}{2} - i α) I]}},

since

R_{21} (T) = R_{12}^{- 1} (T)

. Combining the results so far we have

{| {〈 Λ^{\frac{1}{2} + i α} (v) 〉}_{2} |}^{2} = {[P (α, T)]}^{- \frac{1}{2}}

with

P (α, T) = \det P (α, T)

and

P (α, T)

is the matrix given by

[(\frac{1}{2} + i α) R_{12} (T) + (\frac{1}{2} - i α) I] \times [(\frac{1}{2} + i α) R_{21} (T) + (\frac{1}{2} - i α) I] .

This matrix simplifies to

P (α, T) = (\frac{1}{4} + α^{2}) [R_{12} (T) + R_{21} (T)] + 2 (\frac{1}{4} - α^{2}) I .

We now have

F_{5} (T) = 1 - \frac{1}{4 π} \int_{- \infty}^{\infty} {[P (α, T)]}^{- \frac{1}{2}} \frac{d α}{α^{2} + \frac{1}{4}},

which we can also write as

F_{5} (T) = 1 - \frac{1}{4 π} \int_{- \infty}^{\infty} \prod_{l = 1}^{L} [(\frac{1}{4} + α^{2}) (μ_{l} + \frac{1}{μ_{l}}) {+ 2 (\frac{1}{4} - α^{2})]}^{- \frac{1}{2}} \frac{d α}{α^{2} + \frac{1}{4}} .

To compute the gradient of $F_{5} (T)$ we need the gradient of $P (α, T)$ . This can be found in the Appendices. The end result of setting $Δ_{5} (T) = 0$ is the familiar equation

K_{2}^{- 1} K_{1} T^{†} = T^{†} R_{12} (T) .

We may conclude, as before, that the $μ_{l}$ are eigenvalues of $K_{2}^{- 1} K_{1}$ and that we may take the rows of the channel matrix $T$ to be the corresponding eigenvectors.

We will replace $μ_{l}$ with $κ_{l}$ now and pick the $κ_{l}$ to minimize the integral expression

\int_{- \infty}^{\infty} \prod_{l = 1}^{L} [(\frac{1}{4} + α^{2}) (κ_{l} + \frac{1}{κ_{l}}) {+ 2 (\frac{1}{4} - α^{2})]}^{- \frac{1}{2}} \frac{d α}{α^{2} + \frac{1}{4}} .

The corresponding eigenvectors of $K_{2}^{- 1} K_{1}$ will give the rows of the optimum channel matrix. This integral expression can also be written as

\int_{- \infty}^{\infty} \prod_{l = 1}^{L} {[\frac{1}{4} \frac{{(κ_{l} + 1)}^{2}}{κ_{l}} + α^{2} \frac{{(κ_{l} - 1)}^{2}}{κ_{l}}]}^{- \frac{1}{2}} \frac{d α}{α^{2} + \frac{1}{4}} .

Now, define

y_{l} = \frac{4 κ_{l}}{{(κ_{l} + 1)}^{2}},

and write the integral expression we want to minimize by our choice of the

κ_{l}

as

\int_{- \infty}^{\infty} \prod_{l = 1}^{L} \sqrt{\frac{y_{l}}{1 + 4 α^{2} (1 - y_{l})}} \frac{d α}{α^{2} + \frac{1}{4}} .

Now, there are two facts we can use to solve our problem. A little algebra shows that $0 < y_{l} \leq 1$ since all of the $κ_{l}$ are positive. Within this range the function

f (y) = \frac{y}{1 + 4 α^{2} (1 - y)}

is monotonic, increasing for any

α

. Therefore, we should choose the eigenvalues

κ_{l}

that produce the

L

minimum values for

y_{l}

. But

\frac{1}{y_{l}} = \frac{1}{4} (κ_{l} + \frac{1}{κ_{l}} + 2) .

Therefore, we may conclude that to maximize

F_{5} (T)

we should choose the rows of

T

to be the eigenvectors of

K_{2}^{- 1} K_{1}

whose corresponding eigenvalues

κ_{l}

produce the

L

largest values for

κ_{l} + κ_{l}^{- 1}

. If we go back to Appendix B and examine the matrices

U_{1} (T)

and

U_{2} (T)

for this channel matrix we find that they are diagonal and invertible as long as

κ_{l} \neq 1

. If any of these eigenvalues are unity, then this situation can be dealt with as in Proposition 3. ▪

The optimal channel matrix for the AUC FOM is the same optimal channel matrix we arrived at for the J and G(0) FOMs. Thus, these quantities can be regarded as surrogate FOMs for the CIO AUC in the sense that they are optimized at the same channel matrix that optimizes the AUC FOM, which is the preferred FOM since it is directly related to task performance. On the other hand the J and G(0) FOMs are also easier to compute than the AUC FOM. The J FOM, in particular, is much easier to deal with than the AUC FOM and, since it leads to the same optimal channel matrix, we have used the J FOM in our simulations. This leads to an examination of the possibility of local maxima for the J FOM, which we examine in the next section.

8. LOCAL MAXIMA FOR J WITH NO SIGNAL IN THE MEAN

To determine the local maxima we need the Hessian for $F_{3} (T)$ . This Hessian evaluated at $T$ is a linear operator $D (T)$ from the space of $L \times M$ matrices to itself. If $E$ is an arbitrary $L \times M$ matrix, then $D (T)$ is determined uniquely as the operator that satisfies

{\frac{d^{2}}{d t^{2}} F_{3} (T + t E) |}_{t = 0} = tr [E^{†} D (T) E]

for all

E

. In this notation

D (T) E

is the

L \times M

matrix that results from the linear operator

D (T)

acting on

E

. Specifically, we are interested in the eigenvalues of the Hessian at the points where the gradient of

F_{3} (T)

is zero.

Proposition 5: With the normality assumptions used in this paper, and when there is no signal in the mean, any local maximum for the J FOM $F_{5} (T)$ can be constructed by using $L$ eigenvectors of $K_{2}^{- 1} K_{1}$ for the channels, with corresponding eigenvalues $κ_{l}$ . For an integer $N$ with $0 \leq N \leq L$ these eigenvectors must be chosen to have the $N$ largest values of $κ_{l}$ that are also $> 1$ , and the $L - N$ smallest values of $κ_{l}$ that are also $< 1$ . Thus, barring degeneracies, there are no more than $L + 1$ local maxima, one of which is the global maximum. ▪

Proof: We will fix the channel matrix $T$ and the perturbation matrix $E$ and write $C_{i} (T + t E) = C_{1} (t)$ and $R_{i j} (T + t E) = R_{i j} (t)$ . Then we have $F_{3} (t) = tr [R_{12} (t)] + tr [R_{21} (t)]$ with the relations $C_{2} (t) R_{12} (t) = C_{1} (t)$ and $C_{1} (t) R_{21} (t) = C_{2} (t)$ . The basic ingredients of the calculation are the following equations:

C_{j} (0) = {TK}_{j} T^{†},

C_{j}^{'} (0) = {EK}_{j} T^{†} + {TK}_{j} E^{†},

C_{j}^{''} (0) = 2 {EK}_{j} E^{†},

R_{i j}^{'} (0) = C_{j}^{- 1} (0) [C_{i}^{'} (0) - C_{j}^{'} (0) R_{i j} (0)],

R_{i j}^{''} (0) = C_{j}^{- 1} (0) [C_{i}^{''} (0) - C_{j}^{''} (0) R_{i j} (0) - 2 C_{j}^{'} (0) R_{i j}^{'} (0)],

F_{3}^{''} (0) = tr [R_{12}^{''} (0)] + tr [R_{21}^{''} (0)] .

With these equations we can compute $tr [E^{†} D (T) E]$ for any channel matrix $T$ and perturbation matrix $E$ .

When we are at a point where the gradient is zero, then, as noted above, we may assume that $K_{1} T^{†} = K_{2} T^{†} Λ$ where $Λ$ is a diagonal matrix. We also may assume the dual orthogonality conditions [see Eq. (41)] on the rows of $T$ , which we can write as $C_{1} (0) = Λ$ and $C_{2} (0) = I$ .

To proceed further it is useful to define the matrix $T_{0}$ as the $(M - L) \times M$ matrix whose rows form the set of eigenvectors of $K_{2}^{- 1} K_{1}$ that do not appear in the given channel matrix $T$ . We also define $Λ_{0}$ as the diagonal matrix with the corresponding eigenvalues $κ_{0 m}$ on the diagonal. Now we have the orthogonality relations $T_{0} K_{j} T^{†} = 0$ and ${TK}_{j} T_{0}^{†} = 0$ . We may also assume that these eigenvectors satisfy $T_{0} K_{1} T_{0}^{†} = Λ_{0}$ and $T_{0} K_{2} T_{0}^{†} = I$ . We may then write a decomposition of the perturbation matrix $E = BT + B_{0} T_{0}$ where $B$ is $L \times L$ and $B_{0}$ is $L \times (M - L)$ . Then $R_{12} (0) = Λ$ and $R_{21} (0) = Λ^{- 1}$ and

C_{1}^{'} (0) = B Λ + Λ B^{†}

C_{2}^{'} (0) = B + B^{†}

C_{1}^{''} (0) = 2 B Λ B^{†} + 2 B_{0} Λ_{0} B_{0}^{†}

C_{2}^{''} (0) = 2 {BB}^{†} + 2 B_{0} B_{0}^{†} .

Placing these in the formula above for $F_{3}^{''} (0)$ we find that all of the terms with $B$ in them cancel, as they should since a perturbation of the form $E = BT$ would give $F_{3} (t) = F_{3} (0)$ for all $t$ in an interval about 0. Therefore, we may restrict our perturbations to those of the form $E = B_{0} T_{0}$ , since these are the directions that move us away from the point corresponding to $T$ on the Grassmann manifold $Gr (L, M)$ . The final result is

tr [E^{†} D (T) E] = 2 tr [(Λ^{- 1} - Λ) (B_{0} B_{0}^{†} - Λ^{- 1} B_{0} Λ_{0} B_{0}^{†})] .

Using the orthogonality relations of the eigenvectors we can also write this as

tr [E^{†} D (T) E] = 2 tr [(B_{0} T_{0})^{†} (Λ^{- 1} - Λ) (B_{0} - Λ^{- 1} B_{0} Λ_{0}) T_{0} K_{2}] .

Therefore, the Hessian operator at a point on $Gr (L, M)$ corresponding to a channel matrix $T$ where the gradient of $F_{3} (T)$ is zero is given by

D (T) B_{0} T_{0} = (Λ^{- 1} - Λ) (B_{0} - Λ^{- 1} B_{0} Λ_{0}) T_{0} K_{2} .

We want to know the eigenvalues of this operator.

First, we will show that, without loss of generality, we may assume that $K_{2}$ is the identity matrix. If this is not the case define a new data vector by $g^{'} = K_{2}^{- \frac{1}{2}} g$ . Then, in the obvious notation, we have new covariance matrices $K_{2}^{'} = I$ and $K_{1}^{'} = K_{2}^{- \frac{1}{2}} K_{1} K_{2}^{- \frac{1}{2}}$ . If $T$ is a channel matrix, define $T^{'} = {TK}_{2}^{\frac{1}{2}}$ . Then we have for channel outputs $v = Tg = T^{'} g^{'}$ . It may also be easily verified that

F_{3}^{'} (T^{'}) = tr [(T^{'} K_{2}^{'} T^{' †})^{- 1} (T^{'} K_{1}^{'} T^{' †})] + tr [{(T^{'} K_{1}^{'} T^{' †})}^{- 1} (T^{'} K_{2}^{'} T^{' †})] = F_{3} (T),

which we would expect since the channel outputs are the same. Therefore,

T

is a local maximum for

F_{3} (T)

if and only if

T^{'}

is a local maximum for

F_{3}^{'} (T^{'})

. Also note that the eigenvalues of

K_{2}^{- 1} K_{1}

are the same as those for

{({K^{'}}_{2})}^{- 1} {K^{'}}_{1}

, although the eigenvectors are different.

Dropping the primes we will now assume that $K_{2} = I$ . Let us define an $L \times (M - L)$ matrix by ${[B_{0}^{p q}]}_{l m} = δ_{p l} δ_{q m}$ . We then have

D (T) B_{0}^{p q} T_{0} = (\frac{1}{κ_{p}} - κ_{p}) (1 - \frac{κ_{0 q}}{κ_{p}}) B_{0}^{p q} T_{0} .

We have a local maximum if and only if all of the eigenvalues of $D (T)$ are negative. Therefore, the necessary and sufficient conditions for a local maximum are:

1) For all $κ_{p} < 1$ we have $κ_{0 q} > κ_{p}$ for all $q$
2) For all $κ_{p} > 1$ we have $κ_{0 q} < κ_{p}$ for all $q$ .

Therefore, the $κ_{p}$ at a local maximum must consist of the $N$ largest eigenvalues of $K_{2}^{- 1} K_{1}$ that are also $> 1$ , and the $L - N$ smallest eigenvalues that are also $< 1$ , for some $N$ with $0 \leq N \leq L$ . Thus, barring degeneracies, there are no more than $L + 1$ local maxima, one of which is the global maximum. ▪

In principle running a gradient-based algorithm repeatedly from different starting points would eventually locate all of the local maxima, and thus be able to determine the global maximum. The corresponding channel matrix will also provide a global maximum for the G(0) and AUC FOMs.

9. SIMULATIONS

In this section, the AUC of the J-CQO calculated from a gradient-based search [see Eq. (25)] is compared to three conventional observers: ideal [see Eq. (5)], Hotelling [see Eq. (17)], and Eigen-CQO [see Eq. (27)] under several simulated imaging conditions which correspond to various differences in the first- and second-order statistics of the two classes. The two classes of images are simulated on a $50 \times 50$ pixel grid. Here, $M = 50^{2}$ and this number is selected to accommodate the computational time required for the Ideal, Hotelling, and Eigen-CQO which all require an $M \times M$ matrix inverse. By contrast the J-CQO only requires inversion of $L \times L$ matrices where $L$ is the number of channels.

The covariance matrix of each class is chosen to be circulant with a Gaussian kernel and parameterized by a correlation length as in

{[K_{i}]}_{n, m} = {(\frac{\bar{b}}{SNR})}^{2} \times \exp (- \frac{{((n_{x} - m_{x}) \mod 50)}^{2} + {((n_{y} - m_{y}) \mod 50)}^{2}}{2 σ_{i}^{2}})

where

σ_{i}

is the correlation length of the

i

th class,

n

and

m

are two-dimensional pixel indices, the SNR is set to 100 and the average value of the background

\bar{b}

is 25 through the simulation study. The mean of the first class is constant across the image plane, i.e., all elements of

{\bar{g}}_{1}

equal

\bar{b}

. The correlation length of the second class is always

σ_{2} = 0.3

pixels. This relatively simple model of the covariance matrices is chosen to provide an analytic solution for the Eigen-CQO decision variable.

The observers’ performance is compared for different choices of ${\bar{g}}_{2}$ and $σ_{1}$ . The case of signal present in the mean is simulated by

{\bar{g}}_{2} = {\bar{g}}_{1} + A_{s} s

where the support of

s

is one pixel in the middle of the image. The signal strength is expressed as contrast which is defined as

(A_{s} + \bar{b}) / \bar{b}

; a contrast of one indicates the signal is absent. Differences in the covariance matrices of the two classes are expressed as the difference in their correlation lengths,

Δ σ = σ_{1} - σ_{2}

. Example images from each class are displayed side by side in Fig. 1. In Fig. 1(a)

Δ σ = 0.2

pixels and in Fig. 1(b)

Δ σ = 0.5

pixels.

Fig. 1. Top image is a sample from the first class and the bottom image is a sample from the second class. The images look more similar in (a) since $Δ σ$ is smaller compared to (b). Movies for (a) and (b) are generated by repeatedly sampling from the ensemble (Media 1 and Media 2). (a) $Δ σ$ = 0.2 pixels. (b) $Δ σ$ = 0:5 pixels.

Download Full Size | PDF

The AUC is estimated from the percent correct of 1,000 independent sample images [2]. Five independent trials of the AUC estimates are calculated and the mean and standard deviation are displayed as error plots in Figs. 2, 4, and 5. The $k$ th sample image from the $i$ th class is computed from

g_{i}^{k} = {\bar{g}}_{i} + K_{i}^{1 / 2} w

where

w

is a zero mean image of uncorrelated white noise of unit variance.

Fig. 2. AUC comparison for (a) equal and (b) unequal mean images for the two classes. The AUC is calculated at various values of $Δ σ$ , the correlation length difference between the covariance matrices of the two classes. The number of channels is constant at $L = 25$ and the compression ratio $M / L = 100$ . (a) Signal absent in the mean, (b) signal present in the mean.

Download Full Size | PDF

Fig. 3. Eigenspectrum of $K_{2}^{- 1} K_{1}$ for six values of $Δ σ$ . The highest 25 values of $κ + 1 / κ$ are marked in red (in the top left part of each curve) to denote the Eigen-CQO selections. The number of channels is constant at $L = 25$ . $Δ σ$ = 0.1 pixels (a), 0.15 pixels (b), 0.2 pixels (c), 0.25 pixels (d), 0.30 pixels (e), 0.35 pixels (f).

Download Full Size | PDF

Fig. 4. Observer comparison for unequal mean images and varying the magnitude of the mean difference. The number of channels and second-order statistics are held constant; $L = 25$ and $Δ σ = 0.15$ pixels.

Download Full Size | PDF

Fig. 5. Observer comparison for (a) equal and (b) unequal mean images under both classes and varying the number of channels. The second-order statistics are constant; $Δ σ = 0.15$ pixels. (a) Signal absent in the mean, (b) signal present in the mean.

Download Full Size | PDF

In Fig. 2 the AUC is calculated as a function of correlation length differences between the two classes while the number of channels, $L$ , remains constant. This is repeated for both signal absent and present in the image means; see Figs. 2(a) and 2(b) respectively. In Fig. 2(a) all observers achieve an AUC of 0.5 as expected when $Δ σ = 0$ and $g_{1} = g_{2}$ since there is no difference between the two classes in this case. The AUC of the ideal observer saturates to 1.0 when a relatively small correlation length difference is introduced. The Hotelling template is zero when the means of the two classes are equal so its AUC remains at 0.5 and is invariant to the correlation length difference. As $Δ σ$ increases the performance of both the J-CQO and Eigen-CQO increase. The AUC of the Eigen-CQO yields better performance as predicted in Proposition 4 but in this simulation the difference is less than about 0.02 in AUC and the performance gap decreases as $Δ σ$ increases. Both observers saturate to an AUC of 1.0 when $Δ σ$ is sufficiently large.

As shown in Fig. 2(b) the Eigen-CQO is not necessarily better than the J-CQO when the image means are different. Here, the AUC is evaluated over the same range of $Δ σ$ but the AUC is not 0.5 at $Δ σ = 0$ since $g_{1} \neq g_{2}$ . At $Δ σ = 0$ all observers have approximately the same AUC of 0.68 except the Eigen-CQO which is close to 0.50. This is expected since the Eigen-CQO uses only second-order statistics of the image which are identical in this case. The Hotelling observer AUC is not expected to change as a function of $Δ σ$ since it only uses first-order statistics of the images (i.e., an affine transform) to calculate a decision variable. The change in Hotelling AUC in Fig. 2(b) is due to sample statistics from estimating the AUC using 5 independent trials of 1,000 images from each class. As expected the ideal observer’s AUC is equal to or greater than the AUC for the signal absent case since this task is easier. Again, the Eigen-CQO and J-CQO performance both increase with $Δ σ$ but when the signal is present in the mean the J-CQO performance is better for the more difficult tasks. In addition to this performance increase a critical advantage of J-CQO over Eigen-CQO is tractability of the calculation for high-dimensional images. Here, images are simulated on only a 50x50 grid to allow reasonable calculation times of the Eigen-CQO. For larger images calculating the channels for CQO from the gradient-based method would remain feasible while the $M \times M$ matrix inverse required for the Eigen-CQO would become infeasible.

To understand the computation of the Eigen-CQO, Fig. 3 presents the eigenvalues of $K_{2}^{- 1} K_{1}$ for several values of $Δ σ$ . Recalling Eq. (27) the channels for Eigen-CQO are computed by selecting the eigenvectors that correspond to the largest values of $κ + 1 / κ$ . Comparing the eigenspectrum in Fig. 3 for six $Δ σ$ values reveals that as the difference between the two classes increases the abundance of small eigenvalues that contribute to the $κ + 1 / κ$ rank-ordering also increases. If the number of channels were increased eventually larger eigenvalues would also be included in the $κ + 1 / κ$ rank-ordering. These trends are specific to the circulant Gaussian covariance matrices used in this example [see Eq. (86)]. Note that according to Eq. (41) each $κ_{l}$ is the ratio between the $l$ th channel output variances under each of the two classes. If these variances are equal then the channel output offers no discrimination when $g_{1} = g_{2}$ . The maximum discrimination occurs when this ratio is either large or small compared to one. In theory different combinations of eigenvalues from the high and low end of the eigenspectrum result in 26 (i.e., $L + 1$ ) local maxima as discussed in Section 8. In this simulation study independent random samples were used as the initial solution in the iterative algorithm (see Fig. 6 for both mean equal and unequal cases) and resulted in different values of J which confirms the presence of local maxima.

Fig. 6. J versus iteration number in Eq. (24) for (a) equal and (b) unequal means. Gradient term in Eq. (24) versus iteration number for (c) equal and (d) unequal means. The second-order statistics are constant; $Δ σ = 0.25$ pixels. Discrepancies for different runs (using independent random samples for the initial solution) of the iterative gradient-based search algorithm indicate the presence of local maxima. (a), (c) Signal absent in the mean. (b), (d) Signal present in the mean.

Download Full Size | PDF

To investigate the change in observer performance as a function of signal strength, Fig. 4 shows all observer’s AUC as the signal contrast changes. The number of channels for the J-CQO and the Eigen-CQO is held constant. Per our definition of signal contrast, 1.0 is equivalent to signal absent, hence the Hotelling AUC is 0.5 at this point. However at this point the J-CQO is about 0.81 since $Δ σ$ is held constant at 0.15 pixels. The J-CQO AUC is higher than Hotelling, both increase with signal contrast, and their values become increasingly similar as the contrast is increased. The AUC for both observers eventually saturates to 1.0 as the contrast is increased. The Eigen-CQO is constant with respect to signal contrast (within fluctuations due to sample statistics) since it uses only second-order statistics to calculate a decision variable and the second-order statistics are not changing. Another critical advantage of the J-CQO over the Eigen-CQO is that both first- and second-order statistics of the images are combined to calculate a decision variable.

Figure 5 is an investigation into the effect of number of channels on Eigen-CQO and J-CQO for both (a) signal absent and (b) present in the mean. The number of channels range from 1 to 50 which corresponds to a compression ratio of 2500 and 50, respectively. As expected, the ideal observer’s AUC is saturated to 1.0 and the Hotelling AUC is 0.50 when the signal is absent in the mean. In Fig. 5(a) for the signal absent case both the Eigen-CQO and J-CQO AUC increase with number of channels, the rate of this increase is approximately the same for both, and the Eigen-CQO performs slightly better as predicted in Proposition 4. A very weak signal contrast of 1.01 is used in Fig. 5(b) for the signal present case. Here, the J-CQO’s AUC is higher than the Eigen-CQO over the selected range of number of channels. As the number of channels increases the Eigen-CQO and J-CQO performance become increasingly similar. Since the J-CQO implementation requires selecting the number of channels prior to image processing the sensitivity of performance to number of channels will be an important quantity to determine experimentally when using the J-CQO in a new imaging application.

The channels of the J-CQO are calculated using a nonlinear iterative algorithm with an additive update term in Eq. (24). For both signal present and absent in the mean, Figs. 6(c) and 6(d) show the magnitude of the gradient, with respect to iteration number, multiplied by a scalar step-size factor $ϵ$ which is used to scale the additive update. The calculation was repeated three times for independent random initial values for the channel matrix entries. Smooth convergence toward a negligible gradient is observed for all three runs. Figs. 6(a) and 6(b) shows the scalar-valued quantity to be maximized (i.e., J) as a function of iteration number, also for signal absent and present and three independent runs each. All three runs converge quickly over this range of iterations but to slightly different values of J for each run. For the case of signal absent in the mean [Fig. 6(a)] the three trial runs of J-CQO are compared to the Eigen-CQO value of J which, as expected, is greater.

Figure 7(a) is the eigenspectrum of $R_{12}$ for the equal means case. Here, $L = 25$ and for the equal means case the 25 eigenvalues for each independent run are so nearly equal that they are plotted on top one another in the figure. As shown in Eq. (40) the value of $(1 / 2) (κ_{l} + (1 / κ_{l})) - L$ is equal to the contribution that the channel corresponding to the eigenvalue $κ_{l}$ makes to the J FOM. Therefore, the plot of $(κ_{l} + (1 / κ_{l}))$ versus $κ_{l}$ can be interpreted as rank-ordering the importance of each eigenvector of $R_{12}$ with respect to the J FOM. It is interesting to note from Eq. (40) that if one channel is added with a value of $(κ_{l} + (1 / κ_{l}))$ equal to 2 then the value of J would not change by adding that channel. In Fig. 7(a) none of these values are equal to 2 and therefore deleting any of the channels would decrease the value of J.

Fig. 7. Eigenspectrum of $R_{12}$ [see Eq. (19)] for (a) equal and (b) unequal means and $Δ σ = 0.25$ pixels and $L = 25$ . Discrepancies for different runs (using independent random samples as the initial solution) of the iterative gradient-based search algorithm indicate the presence of local maxima. When a signal in the mean is introduced an eigenvalue close to one appears in the eigenspectrum. The corresponding channel is the matrix-vector product of $T^{†}$ and the corresponding eigenvector. This channel is shown in (c) to demonstrate that the signal in the mean affects the J-CQO channels. (a) Signal absent in the mean and (b) signal present in the mean. (c) $T$ .

Download Full Size | PDF

Figure 7(b) is the eigenspectrum of $R_{12}$ for the unequal means case. When the two classes have different mean values the eigenspectrum changes, as compared to the equal means case, most notably by a single eigenvalue that appears close to one. The disagreement between each independent run is most notable in the selection of this eigenvalue closest to one. The corresponding channel is the matrix-vector product of $T^{†}$ and the eigenvector. For one of the independent runs this channel is given in Fig. 7(c) to show the dependence of the J-CQO channels on the difference in the means between the two classes, which is isolated in a single pixel at the center of the image.

In Fig. 8 the top row displays the channels solutions (i.e., the rows of the matrix $T$ ) for the J-CQO iterative algorithm for three values of the correlation length difference between the two classes with equal means. These solutions are nonunique since $T$ and $MT$ give the same value for J [see Eq. (16)]. To allow visual comparisons between different $Δ σ$ the bottom row of Fig. 8 shows the eigenvectors of $R_{12}$ [see Eq. (19)] back projected by $T^{†}$ . This corresponds to choosing the matrix $M$ that diagonalizes $R_{12}$ therefore the channel matrices in the top row produce the same value of $J$ as the corresponding channel matrices in the bottom row. The Eigen-CQO channels, which are optimal for the equal means case, are expected to be spatial frequency channels since the matrix $K_{1}^{- 1} K_{2}$ is stationary in this example [see Eq. (86)]. The appearance of the J-CQO channels in the bottom row of Fig. 8 are encouraging since they exhibit obvious frequency and orientation characteristics.

Fig. 8. Top row displays the channels (i.e., rows of the matrix $T$ ) for the J-CQO iterative algorithm for three values of the correlation length difference between the two classes with equal means. $L = 25$ and each row of the channel matrix is displayed as a single image in a $5 \times 5$ tiling. The bottom row shows the eigenvectors of $R_{12}$ [see Eq. (19)] back projected by $T^{†}$ . (a) J-CQO T ( $Δ σ = 0.2$ pixels), (b) J-CQO T ( $Δ σ = 0.5$ pixels), (c) J-CQO T ( $Δ σ = 1$ pixel), (d) backprojection of $R_{12}$ eigenvectors by $T^{†}$ ( $Δ σ = 0.2$ pixels), (e) backprojection of $R_{12}$ eigenvectors by $T^{†}$ ( $Δ σ = 0.5$ pixels), (f) backprojection of $R_{12}$ eigenvector by $T^{†}$ ( $Δ σ = 1$ pixel).

Download Full Size | PDF

10. CONCLUSIONS

In this paper, we have shown a promising new method for computing optimized channels for quadratic observers utilizing high-dimensional image data. We have assumed Gaussian statistics for the channelized image data, but the method for calculating channels is applicable in general. Gradient-based algorithms for determining the channels have been presented for five different information-based FOMs: KL1, KL2, J, G(0), and AUC. Analytic solutions for the optimum channels for each of the five FOMs have been derived for the case of equal mean data for both classes. The optimum channels for the J, G(0), and AUC FOMs under the equal mean condition are the same. This result is critical for implementing J as a surrogate FOM for AUC since the J FOM is much easier to compute. Since the algorithm is gradient-based, local maxima are an important consideration. For the J FOM, with the equal mean condition, we have shown that there are no more than $L + 1$ local maxima where $L$ is the number of channels. In principle all of the local maxima of the equal-mean J FOM could be located by using multiple random starting locations in the algorithm. In a simulation study we have calculated the performance of ideal and Hotelling observers and compared these to CQO performance. Optimal channel selection for the CQO has been calculated from two different methods: eigenanalysis and an iterative gradient-based algorithm. The gradient-based algorithm is an important new contribution since it makes optimal CQO implementation possible without inverting an $M \times M$ matrix for eigenanalysis.

Implementing the CQO in a new imaging application requires a set of channels and the first- and second-order statistics of channelized image data from both classes. A common technique to estimate the channelized first- and second-order statistics is from finite training data for both classes. Here, the dimensionality reduction from $M$ number of original measurements to $L$ number of channels is a critical advantage of the J-CQO method since estimating first- and second-order statistics from uncompressed image data would require larger sample sizes. Another important advantage of the CQO is that we only need to invert two $L \times L$ matrices instead of inverting two $M \times M$ matrices. In this simulation study we have shown excellent CQO performance for compression ratios ( $M / L$ ), on the order of 100. In practice, this compression ratio will depend upon the difficulty level of the task and performance requirements of the observer.

The J-CQO algorithm has the advantages of the CQO plus the property that the channels can be estimated using image samples from both classes. This can make binary classification possible in an exploratory imaging application where the relationship between the signal to be detected and the image data is not well understood. In such an application the J-CQO channel solution gives insight into which measurements are most useful for the detection task. The iterative solution for the J-CQO channels [see Eqs. (38) and (24)] does not require estimates of the image covariances $K_{1}$ and $K_{2}$ as an intermediate step. At each step in the iteration, training images from each class are channelized (i.e., compressed) using the candidate channel matrix solution. The sample means and sample covariance matrices for the channel outputs are computed. The sample cross-covariance matrices between the channel outputs and the original data are also computed (these are the matrices ${TK}_{1}$ and ${TK}_{2}$ ). The sample channelized means, sample cross-covariance matrices, and inverses of the sample channelized covariance matrices are then used to calculate the gradient of J with respect to the channel matrix $T$ . As long as the number of independent training images from each class exceeds the number of channels, the sample channelized covariance matrices will be full rank and therefore invertible. The novelty of this approach is that a prior description of the signal is not required. Instead, training data from each class can be used to estimate the channels.

We are currently working on analytic results for the case of nonequal means. Our simulation results showed that J-CQO performed better than Eigen-CQO for the case of nonequal means; see Figs. 2(b) and 5(b). In this situation, results in Fig. 7(c) indicate that to, achieve optimal-performance, Hotelling-type channels that are not eigenvectors of $K_{2}^{- 1} K_{1}$ are necessary. We are also developing analytic results and algorithms for optimizing channelized observer performance for estimation tasks. In this case, we use the Fisher information as a FOM. To meet the demands of a wider variety of imaging applications we will extend these CQO methods to classification tasks where there are more than two classes. Results in [45] indicate that human visual adaptation is nonlinear, this motivates testing the predictive power of J-CQO performance to human performance for texture discrimination.

APPENDIX A: KL1 AND KL2

To compute $F_{1} (T)$ , first we find

{〈 Q_{1} (v) 〉}_{1} = tr 〈 C_{1}^{- 1} (T) (v - {T \bar{g}}_{1}) {(v - {T \bar{g}}_{1})}^{†} 〉_{1} = L .

For the other expectation we need we have, after some computation,

{〈 Q_{2} (v) 〉}_{1} = tr {C_{2}^{- 1} (T) [C_{1} (T) + {Tss}^{†} T^{†}]} .

Using properties of the trace we then have

2 {〈 λ (v) 〉}_{1} = - L + tr [C_{2}^{- 1} (T) C_{1} (T)] + s^{†} T^{†} C_{2}^{- 1} (T) Ts - \ln \det [C_{2}^{- 1} (T) C_{1} (T)] .

This leads to the expression in the main text,

2 F_{1} (T) = - L + tr [R_{12} (T)] + H_{2} (T) - \ln \det [R_{12} (T)] .

To compute $\nabla F_{1} (T)$ we start with

D_{1} (T) = \frac{d}{d t} tr {[R_{12} (T + t E)] |}_{t = 0} .

After some computation we find that $D_{1} (T)$ is given by

2 tr {E^{†} [- R_{12} (T) C_{2}^{- 1} (T) {TK}_{2} + C_{2}^{- 1} (T) {TK}_{1}]} .

Next, we have

D_{2} (T) = \frac{d}{d t} \ln \det {[R_{12} (T + t E)] |}_{t = 0} .

A useful fact here is that

\frac{d}{d t} \ln \det [R_{12} (T + t E)] = tr [R_{12}^{- 1} (T + t E) \frac{d}{dt} R_{12} (T + t E)] .

After some computation we find that $D_{2} (T)$ is given by

2 tr {E^{†} [- C_{2}^{- 1} (T) {TK}_{2} + C_{1}^{- 1} (T) {TK}_{1}]} .

Finally, we have

D_{3} (T) = {\frac{d}{d t} H_{2} (T + t E) |}_{t = 0},

which eventually simplifies to

2 tr {E^{†} [- C_{2}^{- 1} (T) {Tss}^{†} T^{†} C_{2}^{- 1} (T) {TK}_{2} + C_{2}^{- 1} (T) {Tss}^{†}]} .

The gradient $\nabla F_{1} (T)$ is therefore

C_{2}^{- 1} (T) T (K_{1} + {ss}^{†}) [I - T^{†} C_{2}^{- 1} (T) {TK}_{2}] - C_{2}^{- 1} (T) {TK}_{2} + C_{1}^{- 1} (T) {TK}_{1} .

APPENDIX B: AUC

We calculate the following derivatives first:

\frac{d}{d t} {R_{12} (T + t E) |}_{t = 0} = C_{2}^{- 1} (T) [- ({EK}_{2} T^{†} + {TK}_{2} E^{†}) R_{12} (T) + ({EK}_{1} T^{†} + {TK}_{1} E^{†})],

and

{\frac{d}{d t} R_{21} (T + t E) |}_{t = 0} = C_{1}^{- 1} (T) [- ({EK}_{1} T^{†} + {TK}_{1} E^{†}) R_{21} (T) + ({EK}_{2} T^{†} + {TK}_{2} E^{†})] .

The calculation of the required gradient is now straightforward, if a bit tedious. We define

Q_{1} (α, T) = P^{- 1} (α, T) C_{1}^{- 1} (T) = {[C_{1} (T) P (α, T)]}^{- 1}

and

Q_{2} (α, T) = P^{- 1} (α, T) C_{2}^{- 1} (T) = {[C_{2} (T) P (α, T)]}^{- 1} .

These are symmetric matrices. We also define

S_{1} (α, T) = R_{21} (T) Q_{1} (α, T) = {[C_{1} (T) P (α, T) R_{12} (T)]}^{- 1}

and

S_{2} (α, T) = R_{12} (T) Q_{2} (α, T) = {[C_{2} (T) P (α, T) R_{21} (T)]}^{- 1} .

These are also symmetric matrices. Then, the gradient of $P (α, T)$ is $P (α, T)$ times the matrix function

2 (\frac{1}{4} + α^{2}) {[Q_{2} (α, T) - S_{1} (α, T)] {TK}_{1} + [Q_{1} (α, T) - S_{2} (α, T)] {TK}_{2}} .

Now define

U_{1} (T) = \int_{- \infty}^{\infty} {[P (α, T)]}^{- \frac{1}{2}} [Q_{2} (α, T) - S_{1} (α, T)] d α,

and

U_{2} (T) = \int_{- \infty}^{\infty} {[P (α, T)]}^{- \frac{1}{2}} [Q_{1} (α, T) - S_{2} (α, T)] d α .

Then, setting the gradient of AUC $(T)$ equal to zero is equivalent to

U_{1} (T) {TK}_{1} = - U_{2} (T) {TK}_{2} .

We will assume that $U_{1} (T)$ is invertible and check this assumption in the proof of Proposition 4. Then we have the familiar equation

K_{2}^{- 1} K_{1} T^{†} = - T^{†} U_{2} (T) U_{1}^{- 1} (T) .

This means that

- U_{2} (T) U_{1}^{- 1} (T) = R_{12} (T) .

ACKNOWLEDGMENTS

This material is based upon work supported by the National Science Foundation under CHE-1313892 and the National Institutes of Health under R01-EB000803 and P41-EB002035. The authors would like to thank Harrison Barrett and Luca Caucci for their valuable feedback. The comments by our anonymous reviewers improved this manuscript and we appreciate their work.

REFERENCES

1. D. J. Field, “Relations between the statistics of natural images and the response properties of cortical cells,” J. Opt. Soc. Am. A 4, 2379–2394 (1987). [CrossRef]

2. H. Barrett and K. Myers, Foundations of Image Science (Wiley, 2013).

3. C. K. Abbey, H. H. Barrett, and M. P. Eckstein, “Practical issues and methodology in assessment of image quality using model observers,” Proc. SPIE 3032, 182–194 (1997).

4. M. Kupinski, E. Clarkson, and J. Hesterman, “Bias in Hotelling observer performance computed from finite data,” Proc. SPIE 6515, 65150S (2007).

5. B. D. Gallas, “Variance of the channelized-hotelling observer from a finite number of trainers and testers,” Proc. SPIE 5034, 100–111 (2003).

6. N. Nguyen, C. Abbey, and M. Insana, “Objective assessment of sonographic quality I: Task information,” IEEE Trans. Med. Imaging 32, 683–690 (2013). [CrossRef]

7. C. Abbey, R. Zemp, J. Liu, K. Lindfors, and M. Insana, “Observer efficiency in discrimination tasks simulating malignant and benign breast lesions imaged with ultrasound,” IEEE Trans. Med. Imaging 25, 198–209 (2006). [CrossRef]

8. N. Nguyen, C. Abbey, and M. Insana, “Objective assessment of sonographic quality II: Acquisition information spectrum,” IEEE Trans. Med. Imaging 32, 691–698 (2013). [CrossRef]

9. S. Kullback and R. Leibler, “On information and sufficiency,” Ann. Math. Statistics 22, 79–86 (1951). [CrossRef]

10. H. Jeffreys, “An invariant form for the prior probability in estimation problems,” Proc. R. Soc. London A 186, 453–461 (1946). [CrossRef]

11. A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Bulletin Cal. Math. Soc. 35, 99–109 (1943).

12. H. Barrett, C. Abbey, and E. Clarkson, “Objective assessment of image quality. III. ROC metrics, ideal observers, and likelihood-generating functions,” J. Opt. Soc. Am. A 15, 1520–1535 (1998). [CrossRef]

13. W. Peterson, T. Birdsall, and W. Fox, “The theory of signal detectability,” Trans. IRE Prof. Group Inf. Theory 4, 171–212 (1954). [CrossRef]

14. C. E. Metz, “Receiver operating characteristic analysis: a tool for the quantitative evaluation of observer performance and imaging systems,” J. Am. Coll. Radiol. 3, 413–422 (2006). [CrossRef]

15. E. Clarkson, “Asymptotic ideal observers and surrogate figures of merit for signal detection with list-mode data,” J. Opt. Soc. Am. A 29, 2204–2216 (2012). [CrossRef]

16. K. J. Myers and H. H. Barrett, “Addition of a channel mechanism to the ideal-observer model,” J. Opt. Soc. Am. A 4, 2447–2457 (1987).

17. S. Park, M. A. Kupinski, E. Clarkson, and H. Barrett, “Efficient channels for the ideal observer,” Proc. SPIE 5372, 12–21 (2004).

18. S. Park, H. Barrett, E. Clarkson, M. Kupinski, and K. Myers, “Channelized-ideal observer using Laguerre-Gauss channels in detection tasks involving non-Gaussian distributed lumpy backgrounds and a Gaussian signal,” J. Opt. Soc. Am. A 24, B136–B150 (2007).

19. J. Witten, S. Park, and K. Myers, “Partial least squares: a method to estimate efficient channels for the ideal observers,” IEEE Trans. Med. Imaging 29, 1050–1058 (2010). [CrossRef]

20. S. Park, J. Witten, and K. Myers, “Singular vectors of a linear imaging system as efficient channels for the Bayesian ideal observer,” IEEE Trans. Med. Imaging 28, 657–668 (2009). [CrossRef]

21. P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” BMVC 2, 5 (2009).

22. H. Barrett, J. Yao, J. Rolland, and K. Myers, “Model observers for assessment of image quality,” Proc. Natl. Acad. Sci. USA 90, 9758–9765 (1993). [CrossRef]

23. J. Brankov, Y. Yang, L. Wei, I. El-Naqa, and M. Wernick, “Learning a channelized observer for image quality assessment,” IEEE Trans. Med. Imaging 28, 991–999 (2009).

24. D. J. Field, “What is the goal of sensory coding?” Neural Computation 6, 559–601 (1994). [CrossRef]

25. B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation 10, 1299–1319 (1998). [CrossRef]

26. K. Fukunaga and W. L. Koontz, “Application of the Karhunen-Loeve expansion to feature selection and ordering,” IEEE Trans, Comput. C-19, 311–318 (1970). [CrossRef]

27. X. Huo, “A statistical analysis of Fukunaga-Koontz transform,” IEEE Signal Process. Lett. 11, 123–126 (2004). [CrossRef]

28. A. Mahalanobis, R. R. Muise, S. R. Stanfill, and A. Van Nevel, “Design and application of quadratic correlation filters for target detection,” IEEE Trans. Aerosp. Electron. Syst. 40, 837–850 (2004). [CrossRef]

29. K. Fukunaga, Introduction to Statistical Pattern Recognition (Academic, 1990).

30. S. Zhang and T. Sim, “Discriminant subspace analysis: A Fukunaga-Koontz approach,” IEEE Trans Pattern Anal. Mach, Intell. 29, 1732–1745 (2007). [CrossRef]

31. E. Clarkson and F. Shen, “Fisher information and surrogate figures of merit for the task-based assessment of image quality,” J. Opt. Soc. Am. A 27, 2313–2326 (2010). [CrossRef]

32. F. De la Torre and T. Kanade, “Multimodal oriented discriminant analysis,” in Proceedings of the 22nd International Conference on Machine Learning (ACM, 2005), pp. 177–184.

33. T. Wimalajeewa, H. Chen, and P. K. Varshney, “Performance limits of compressive sensing-based signal classification,” IEEE Trans. Signal Process. 60, 2758–2770 (2012). [CrossRef]

34. W. R. Carson, M. Chen, M. R. Rodrigues, R. Calderbank, and L. Carin, “Communications-inspired projection design with application to compressive sensing,” SIAM J. Imag. Sci. 5, 1185–1212 (2012). [CrossRef]

35. E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory 52, 489–509 (2006). [CrossRef]

36. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52, 1289–1306 (2006). [CrossRef]

37. S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Signal Process. 56, 2346–2356 (2008). [CrossRef]

38. M. A. Neifeld, A. Ashok, and P. K. Baheti, “Task-specific information for imaging system analysis,” J. Opt. Soc. Am. A 24, B25–B41 (2007). [CrossRef]

39. A. Ashok, P. K. Baheti, and M. A. Neifeld, “Compressive imaging system design using task-specific information,” Appl. Opt. 47, 4457–4471 (2008). [CrossRef]

40. M. A. Davenport, M. B. Wakin, and R. G. Baraniuk, “Detection and estimation with compressive measurements,” Technical Report TREE 0610 (Department of ECE, Rice University, 2006).

41. A. DasGupta, Asymptotic Theory of Statistics and Probability (Springer, 2008).

42. E. Clarkson and H. Barrett, “Approximations to ideal-observer performance on signal-detection tasks,” Appl. Opt. 39, 1783–1793 (2000). [CrossRef]

43. D. Husemoller, Fibre Bundles (Springer, 1994).

44. A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM J. Matrix Anal. Appl. 20, 303–353 (1998). [CrossRef]

45. S. M. Smirnakis, M. J. Berry, D. K. Warland, W. Bialek, and M. Meister, “Adaptation of retinal processing to image contrast and spatial scale,” Nature 386, 69–73 (1997). [CrossRef]

Method for optimizing channelized quadratic observers for binary classification of large-dimensional image datasets

Abstract

1. INTRODUCTION

2. RELATED WORK

A. Channels in Imaging

B. Eigenanalysis for Compression

C. Information Metrics in Imaging

D. Relation to Compressive Sensing

3. FORMULATION OF THE PROBLEM

4. KL1 AND KL2

5.

6. G(0)

7. AUC

8. LOCAL MAXIMA FOR J WITH NO SIGNAL IN THE MEAN

9. SIMULATIONS

10. CONCLUSIONS

APPENDIX A: KL1 AND KL2

APPENDIX B: AUC

ACKNOWLEDGMENTS

REFERENCES

Supplementary Material (2)

Cited By

Figures (8)

Equations (111)

Journal of the Optical Society of America A