Cause-aware failure detection using an interpretable XGBoost for optical networks

Chunyu Zhang; Danshi Wang; Lingling Wang; Luyao Guan; Hui Yang; Zhiguo Zhang; Xue Chen; Min Zhang

doi:10.1364/OE.436293

1. Introduction

The rapid growth of data traffic has stimulated the development of highly flexible and large-scale optical networks [1,2]. Generally, network failures are caused by performance degradation of optical transmission lines, network equipment, etc., which may cause quality of service degradation or even service interruption leading to data loss [3,4]. Therefore, when the monitored object deteriorates, the service interruption caused by the failure of monitored object can be avoided by detecting the fault in time and isolating the fault. With the expansion of optical network scale, the challenge of intelligent operation and maintenance of optical network is not only to detect faults, but also to find out the failure cause, so as to achieve the goal of isolating faults and restoring faulty system as soon as possible. Currently, network operators perform failure detection based on threshold system. Due to the heterogeneity of equipment and the dynamics of network, it is difficult to develop a threshold system suitable for the whole network. Recently, machine learning (ML) methodology is widely used in the failure management of satellite [5–7], cellular [8–10], and optical communication systems [11–13]. The basic idea of ML applied to failure management is to automatically learn the decision model from the data rather than relying on expert experience by modeling specific tasks, including failure detection, prediction, diagnosis, location, recovery, etc.

With the recent developments of ML, NNs have shown considerable potential in the fields of optical performance monitoring [14,15], digital signal processing [16,17], network resource allocation [18,19], and failure management of optical networks [20,21]. In [22,23], a self-taught anomaly detection framework based on a hybrid unsupervised and supervised ML method is proposed, using unsupervised density-based algorithm for learning abnormal network behaviors and supervised deep neural networks for online abnormal detection, and the proposed scheme can achieve low false positive and false negative rates. In [24], by analyzing the BER of receiver, a ML method based on NN is proposed to perform optical equipment failure identification, which can identify whether the BER anomaly is caused by filter or amplifier failures. Moreover, different ML algorithms are compared in terms of the accuracy and complexity of the model when performing failure detection. In [25], a failure location method of optical network based on knowledge graph (KG) and graph neural network (GNN) is proposed, and the generalized GNN is used to perform relational reasoning on alarm KGs to identify the root alarms. In [26], we propose optical equipment failure prediction based on bi-directional gated recurrent unit, which can predict the operating state of optical transport network (OTN) equipment with high accuracy, and achieve low false negative and false positive rates. In [27,28], a cognitive failure detection architecture is proposed, and a NN-based classifier is used to perform failure detection for optical network infrastructure, which significantly improves the response time of proactive failure detection compared with the traditional threshold-based failure detection. In particular, the NNs can perform failure detection by learning from features implicitly, such as operating parameters of network infrastructure [27,28]. However, it is difficult to interpret or analyze why and how NNs work during execution because the NN serves as a black-box [29,30]. Therefore, it is necessary to find an alternative, more suitable technique for failure detection.

Among the ML methods applied in practice, the extreme gradient boosting (XGBoost) algorithm based on decision tree has attracted considerable research interest in industrial machinery, power system, industrial infrastructure domains [31–33] due to its good decision effect, fast computing speed, strong portability, cloud integration, etc. Especially, the feature importance ranking based on XGBoost can interpret the relationship between output results and input features, and help network operators understand the failure detection results. XGBoost is an algorithm based on integrated model, and it adopts classification and regression tree (CART) as the base learner [34]. Similar to NNs, XGBoost is connected by simple subunits to form a system with a large model complexity and strong learning ability. Unlike traditional NNs, the base learner of XGBoost is composed of the root node, branches, and leaf nodes; and XGBoost improves the performance of the model by greedily adding trees. When constructing the CART decision tree, the feature that can bring the greatest gain to loss function and its splitting point is selected as node to perform node splitting [34]. Moreover, the process of node splitting is parallel, which speeds up the computing speed of the model. In [35], we have proposed an interpretable failure cause analysis scheme, which uses supervised XGBoost algorithm. The proposed scheme uses supervised data to train the model of optical equipment, and the high-relevance features with the equipment failure are found based on the feature importance measurement parameter (i.e., weight) of XGBoost.

In this paper, based on our previous work [35], we make an effort to further investigate and extend this work by describing more comprehensively the cause-aware mechanism of XGBoost for optical equipment failure detection, and testing the feature importance ranking results under three common global feature importance measurement parameters. Especially, by analyzing the feature importance ranking of XGBoost under three different global feature importance measurement parameters, SHapley Additive exPlanations (SHAP) is applied to find the highest-correlation features with equipment failure by the ranking of feature importance based on SHAP value from a global perspective; at the same time, the relationship between input features and detected results is explained based on SHAP value from a local perspective. Moreover, two types of OTN boards with the characteristics of balanced and unbalanced data are studied to evaluate the performance of proposed scheme, and experimental results show that the failure detection scheme based on XGBoost can achieve low false negative and false positive rates.

2. Operating principle of failure detection using interpretable XGBoost

In this section, we first give the definition of XGBoost for failure detection in optical network. Then, we introduce the interpretable XGBoost for failure detection, including the principle of XGBoost for failure detection and XGBoost for the failure cause identification.

2.1 Problem definition

We propose a cause-aware failure detection scheme for optical network equipment based on interpretable XGBoost, including data pre-processing, model training, failure detection and analysis, as shown in Fig. 1(a). The main application scenario of proposed scheme is a software-defined (SDN) metropolitan area network, where the SDN architecture includes the infrastructure layer, the control layer and the application layer. In the infrastructure layer, the metropolitan area network is a mesh topology constructed by OTN nodes, and the equipment in infrastructure layer report performance data of equipment to SDN controller through the southbound interface. In the control layer, the SDN controller stores collected data to database through the southbound interface, and starts intelligent failure detection and identification of failure causes. Finally, the SDN controller sends control messages to OTN nodes according to the detection and identification results, and these nodes switch the service to the safe path to prevent service interruption.

Fig. 1. Cause-aware failure detection of equipment based on the interpretable XGBoost in optical network.

Download Full Size | PDF

XGBoost algorithm is used for failure detection and failure cause analysis of optical equipment (OTN boards). Figure 1(b) displays the construction process of the instances to XGBoost tree structure through node splitting, and thus identifying failure cause by explaining the relationship between input features and failure detection results. Failure detection is modeled as a two-class classification problem, and model can learn the mapping function between input features and labels based on physical parameters collected by equipment to establish an intelligent failure detection model. The physical parameters of equipment include environment temperature, laser bias current, laser temperature offset, input optical power, central wavelength, unavailable time, etc. The physical parameters of equipment are used as the input of model of two-class classification problem, and the output of classification is the operating state of equipment, which is normal or fault states. To quantify the cause awareness of failure detection, in this case, our goal is to find the input features most relevant to equipment failure, and then enable the identification of failure cause.

2.2 Interpretable XGBoost for failure detection

In this section, we first describe the principle of XGBoost for failure detection, and then elaborate the principle of XGBoost for failure causes identification based on the feature importance ranking in XGBoost.

2.2.1 XGBoost for failure detection

Failure detection of equipment is modeled as a two-class classification problem, and XGBoost is an algorithm that is used to perform two-class classification. XGBoost is an improved algorithm based on boosting algorithm. The essence of this algorithm is to greedily add a CART tree to fit the residual error between decision result of previous t-1 tree and true value of training sample. For a given instance, the decision rules in the tree will be used to classify it into leaves, and the corresponding leaf scores will be obtained. Moreover, in the training process, the instance space is divided in the feature dimension, and the instance will fall to a corresponding leaf in each tree. When we get k trees after training, the output score is the cumulative sum of the scores of each tree. In addition, the failure detection of equipment is modeled as a two-class classification problem, in which the leaf scores need to be transformed by the function corresponding to learning task (i.e., objective function) to obtain the output probability of corresponding category (i.e., normal or faulty) of the instance, and the leaf node represents the category of sample. Therefore, we can obtain the category of the sample from the leaf node, and the construction of tree is shown in Fig. 2. In XGBoost, the objective function selected based on failure detection is “binary: logistic”, and the conversion function of leaf score is shown in Eq. (1).

(1)$$ f(x)=\frac{1}{1+e^{(-x)}} $$

where x is the cumulative score of node j, and f(x) is the output probability of corresponding category (i.e., normal or faulty) of the instance.

Fig. 2. Construction of XGBoost tree.

Download Full Size | PDF

It can be concluded that the score mapped from the instance to the leaf node can obtain the category of the instance, that is, the operating state of equipment. How to get the score of leaf node? When the tree structure is determined, the leaf node score can be obtained by the derivation of objective function. The objective function of XGBoost is shown in Eq. (2).

(2a)$${\cal L}(\phi ) = \sum\limits_i {l\left( {\mathop {{y_i}}\limits^{ \wedge } ,{y_i}} \right)} + \sum\limits_k {\varOmega ({{f_k}} )}$$(2b)$$\varOmega (f ) = \gamma T + \frac{1 }{2 }\lambda {||\omega ||^{2} }$$

where l is a differentiable convex loss function, describing the degree to which the model fits data by measuring the difference between output ŷ_i and target y_i; the second term Ω is used to control the complexity of the model, and the leaf node weight ω of and the leaves number T are controlled by the hyperparametersγand λ, respectively. The f_k is the functional representation of tree structure q. Each f_k corresponds to an independent tree structure q and leaf node weight ω.

Define I_j= {i | q(x_i)= j} as an instance set of leaf node j, the objective function can be transformed into a function about the leaf node weight ω, as shown in Eq. (3).

(3)$$\begin{array}{l} {\widetilde {\cal L}^{(t )}} = \sum\limits_{i = 1}^n {\left[ {{g_i}{f_t}({{x_i}} )+ \frac{1}{2}{h_i}{f_t}^2({{x_i}} )} \right] + \gamma T + \frac{1}{2}} \lambda \sum\limits_{j = 1}^T {{\omega _j}^2} \textrm{ = }\sum\limits_{j = 1}^T {\left[ {\left( {\sum\limits_{i \in {I_j}} {{g_i}} } \right){\omega_j} + \frac{1}{2}\left( {\sum\limits_{i \in {I_j}} {{h_i} + \lambda } } \right){\omega_j}^2} \right] + \Omega ({{f_t}} )} + \gamma T\\ \textrm{ } \end{array}$$

where g_i and h_i are first and second order gradient statistics on the loss function.

For the fixed tree structure q(x), we can obtain the optimal solution of Eq. (3) about ω, that is, the minimum value of objective function at the leaf node j, as shown in Eq. (4).

(4a)$${\omega _j} = - \frac{{G_j}}{{{H_j} + \lambda }}$$(4b)$$Obj ={-} \frac{1}{2}\sum\nolimits_{j = 1}^T {\frac{{{G_j}^2}}{{{H_j} + \lambda }}} + \gamma T$$

where ω_j is weight of leaf node j, that is, score of leaf node j, Obj is value of object function.

As shown above, the leaf node score can be obtained when the tree structure is determined. How to identify a tree? In the actual training process, XGBoost will not enumerate all tree structures when the t-th tree is built, and the greedy method is used to perform node splitting. Starting from the 0-th tree, it tries to split each leaf node in the tree; after each split, the original leaf node continues to split into left and right sub-leaf nodes, and the instance set in the original leaf node will be dispersed to the left and right sub-leaf nodes according to the decision rules of the node; after a new node is split, we need to check whether the splitting will bring gain to the loss function. The definition of gain is shown in Eq. (5), and selects the feature and its splitting point which make the maximum value of gain as the node to split.

(5)$$Gain = Ob{j_{L + R}} - ({Ob{j_L} + Ob{j_R}} )= \frac{1}{2}\left[ {\frac{{{G_L}^2}}{{{H_L} + \lambda }} + \frac{{{G_R}^2}}{{{H_R} + \lambda }} - \frac{{{{({{G_L} + {G_R}} )}^2}}}{{{H_L} + {H_R} + \lambda }}} \right] - \gamma$$

where L and R are the subsets of left and right nodes after the instance set I splitting, G_j and H_j are the sum of first partial derivatives and the sum of second partial derivatives of the samples contained in the leaf node j, respectively.

Moreover, XGBoost performs node splitting in parallel, and the principle of node parallel splitting is shown in Fig. 1(b). XGBoost pre-sorted each feature according to its feature value, and then saved it as a block. Each block includes feature value, gradient values (h_i, g_i), and an index of feature value points to its corresponding the gradient value. Especially, the feature pre-sorting is carried out only once, and the gradient information can be obtained directly using the index when the nodes are split. Because each feature has been pre-stored as block, XGBoost supports the use of multiple threads to calculate the optimal splitting point of each feature in parallel, which not only greatly improves the splitting speed of nodes, but also facilitates the adaptive expansion of large-scale training sets.

2.2.2 XGBoost for the failure cause identification

It can be obtained from the failure detection of XGBoost that construction of XGBoost tree in failure detection is mainly related to the execution of node splitting by instance features. The node splitting in XGBoost explains the relationship between input features and failure detection results, and it can be quantitatively measured by the feature importance ranking by XGBoost algorithm. With the help of the feature importance of XGBoost, we can find the features are more relevant to equipment failure. The three common feature importance measurement parameters of XGBoost algorithm are weight, cover and gain [36]. Weight means the number of times a feature is used to split nodes in all trees. Cover represents that the average second order gradient of training data through these split points when a feature splits nodes. Gain represents the average of loss function optimization value brought by a feature when the feature is use for node splitting.

To understand how the features importance is obtained under the parameter weight, cover, gain, the calculation process of feature importance score based on three parameters is introduced in detail. Specifically, we calculate the weight, cover, gain of the laser bias current (LBC). It can be seen from Fig. 2 that LBC is split as a node three times, so the value of weight is three. We assume that the cover of LBC split as a node for the first time is cover₁, where the weight in the definition of cover is the sum of the second derivative (h_i) of samples passing through the node, and sum of the second derivative as a node for the second and third time are cover₂, cover₃, respectively. Here, we can get the cover of LBC in this tree, as shown in Eq. (6). We assume that the loss reduction of LBC as a node split for the first time is gain₁, which is calculated according to Eq. (5), and loss reduction as a node for the second and third time are gain₂ and gain₃, respectively. Therefore, we can get the gain of LBC in this tree, as shown in Eq. (7). Based on feature importance measurement parameters of XGBoost, the higher the feature importance score, the higher the correlation between the feature and equipment failure, enabling the identification of failure cause.

(6)$${\textit{cover}_{LBC}} = ({{\textit{cover}}_1 + {\textit{cover}_2} + {\textit{cover}}_3})/3$$

(7)$$gai{n_{LBC}} = ({gai{n_1} + gai{n_2} + gai{n_3}} )/3$$

3. Process of XGBoost in failure detection

In this section, we first describe the input and output of the problem (i.e., failure detection), then introduce training of the optimal failure detection model by adjusting parameters, and finally give the performance evaluation metrics of training model.

3.1 Input and output

The data used in this work came from an actual optical transmission network, and the data were collected from 18 nodes for 47 consecutive days with a time step of 1 day. Each sample consists of monitoring object, start time, end time, minimum performance value, average performance value, maximum performance value, and performance event. Performance events are divided into counting and analog performance events according to the properties of physical parameters based on the performance events, as shown in Table 1. The counting performance events mainly include various error parameters, and their values generally have no units. Analog performance events mainly include equipment working temperature, transmitted optical power, received optical power and other parameters, and numerical values have clear physical units. Generally, the count performance event is the direct reflection of transmission quality of OTN system, while the analog performance event is the indirect manifestation. In the monitored performance events, if unavailable time (UAS) > 80, 000, we assumed that the equipment was in the fault state and the corresponding sample was the fault sample, and the UAS means the amount of time when the number of erroneous bits in a certain time span exceeds a certain threshold. Furthermore, we denote fault sample and normal sample as “1” and “0”, respectively.

Table 1. OTN performance events

View Table | View all tables in this article

By analyzing the statistical characteristics of the sample features, it is found that the data are missing and redundant. After data preprocessing, 15 features of five performance events (average, maximum and minimum value of environmental temperature, laser bias current, laser temperature offset, input optical power, output optical power) are used as the input features of the XGBoost algorithm which are represented by F0-F14. Data preprocessing mainly includes data cleaning, feature selection and feature normalization. Data cleaning mainly deals with missing and redundant data, the principle of feature selection is to retain as many features as possible; feature normalization is to eliminate the influence of dimension, and the feature normalization method is Min-Max Scaling normalization. Therefore, after data preprocessing, the input feature can be used by the XGBoost algorithm to estimate the operating state of equipment.

3.2 Training the best failure detection model

Adjusting the parameters used as the training model algorithm will affect the performance of model when the input features are determined. For the XGBoost algorithm, the combination of a variety of hyperparameters is tested based on cross-validation to obtain the classifier with the best classification effect; in addition, the logistic regression (LR) algorithm with better explanation and the widely used support vector machine (SVM) algorithm are compared and analyzed.

XGBoost is generally composed of three types of parameters: general parameters, booster parameters, and task parameters. The general parameters are related to the boosters used for boosting, and the “gbtree” or “gblinear” are usually selected. The gblinear uses linear model for boosting, and gbtree uses tree-based model for boosting. When the objective function to be learned is a linear model, gblinear is usually selected, and gbtree is generally selected as the general parameter in most cases. Moreover, gbtree is better than gblinear from the perspective of model training results, so we choose gbtree as the general parameter. The booster parameters depend on the selected booster, which usually controls the effect of model. Booster parameters include “gamma”, “min_child_weight”, “max_depth”, “learning_rate”, “n_estimator”, etc. Gamma, min_child_weight and max_depth are mainly used to control overfitting. Task parameters depend on the learning scene by controlling objective function designed to perform classification or regression tasks.

Before adjusting the parameters, the “binary: logistic” is selected as objective function according to the learning task of failure detection, the booster selects gbtree, and other parameters are set as default parameters. During adjusting parameters, we first consider different number of trees (i.e., n_estimator), ranging from 10 to 200, with a step size of 10, and get the best value of n_estimator based on cross-validation; on the basis of the n_estimator value obtained in the previous step, the step size is reduced, the n_estimator and learning_rate are adjusted simultaneously based on grid search. The range of learning_rate is 0.01-0.2, and the step size is 0.01. Then, we consider the maximum depth of tree (i.e., max_depth), ranging from 1 to 10, with a step size of 1, and get the optimal value of max_depth based on cross-validation. Therefore, we obtain a set of optimal parameters based on cross-validation.

For SVM algorithm, kernel function and penalty factor (C) are important hyperparameters of SVM algorithm, so we will adjust kernel function and C simultaneously based on cross-validation. Kernel function is used to solve the problem of finding hyperplane under different data distributions, and we consider different kernel functions, which are chosen from the set {liner, polynomial, sigmoid, radial basis function}. C is the penalty coefficient, which is used to weigh the two unachievable goals of “correct classification of training samples” and “marginal maximization of decision function”. In the hope of finding a balance point, the range of C is 5-100 and the step size is 5. Through the combination of kernel function and C, the optimal parameters of SVM algorithm are found based on grid search. The LR algorithm has a good interpretability. For adjusting parameters of the LR algorithm, we consider that the penalty (penalty term) and C (reciprocal of regularization coefficient λ). The penalty can be regularized from {L1, L2}. The smaller the C is, the stronger the regularization is. We set the range of C to 0.1-1, and the step size is 0.1. Based on grid search, we simultaneously adjust penalty and C, and find the optimal parameters of the LR algorithm.

3.3 Performance evaluation metrics

In order to measure the performance of the different algorithms quantitatively, we measure the effect of failure detection based on performance evaluation metrics. When ML algorithm is applied to classification problems, the common evaluation metrics include accuracy, precision, recall, F1 score, false negative rate, false positive rate, etc. The description of these metrics when applied to equipment failure detection is shown in Table 2. The TN, TP, FN and FP is true negative, true positive, false negative and false positive, respectively. In failure detection, true negative means that when a real normal sample is correctly judged as normal sample; true positive means that a real fault sample is correctly judged as fault sample; false negative means that fault sample is misjudged as normal sample; false positive means that normal sample is misjudged as fault sample. Based on the features of the collected sample and the applicable scenarios of classification metrics, we use four evaluation metrics (accuracy, F1 score, false negative and false positive rates) to quantitatively analyze performance of the proposed algorithms.

Table 2. Definition of performance metrics used in failure detection

View Table | View all tables in this article

4. Results of failure detection based on XGBoost

In this section, we give the results of failure detection and feature importance ranking based on XGBoost. First, the detailed information of the original data and the data used in the XGBoost algorithm after data preprocessing are introduced. Then, the performances of two commonly used ML algorithms (i.e., LR and SVM) in failure detection are compared based on accuracy, F1 score, false negative and false positive rates. Finally, the importance ranking results of the input features are obtained based on the feature importance of XGBoost, and then the features related to equipment failure are found.

4.1 Datasets

The data used in the experiment is the performance data of OTN boards, which come from the metropolitan area network managed by a certain operator. We first analyze the data of OTN boards (type1) which is relatively balanced. The original data sample collected is a total of 50,823 sample in 47 days, and the sample performance data are not integrated together where performance events of a monitoring object on the same day are not in the same row, so it is necessary to perform data aggregation. The number of data sample is 14,515 after data aggregation, and the number of boards of the same type is 319. Compared with the data used in the previous work [35], to ensure the reliability of the data, the data is further processed. By using the Pandas library of Python to clean the data, we delete the board whose label (i.e., UAS) was missing higher than 2 days of 47 days, and then delete the duplicate recorded data. After data cleaning, the number of same type of boards used for XGBoost algorithm is 237 and the number of samples is 11,139. Moreover, by feature selection, the UAS used as label is reserved for counting performance events, and other error indicator feature columns are deleted; for analog performance events, we delete the performance feature columns with missing data more than half of the data collection period. After data cleaning, feature selection and feature normalization, a total of 11,139 samples from 237 same type boards in 47 days are used for failure detection and cause analysis.

At this time, there are five performance events used as input features after data preprocessing, which are environment temperature, input optical power, output optical power, laser bias current and laser temperature offset, and each performance event includes average value, maximum and minimum values, which are represented by F0 to F14. Therefore, there are 15 features as input features of XGBoost model, and the sample labels are fault state (“1”) or normal state (“0”), where the data amount of label “1” is 7134 and that of label “0” is 4005, which is shown in Table 3.

Table 3. Features of input and labels of output

View Table | View all tables in this article

4.2 Results of failure detection

After data pre-processing, the whole data set is divided into training set and test set, the proportion is 38:9, that is, the data of the first 38 days is used as the training set, and the data of the last nine days is used as the test set. The effects of SVM and LR are evaluated based on F1 score, accuracy, false negative and false positive rates.

To determine the optimal parameters of XGBoost and the comparison algorithm SVM, LR, we adjust the parameters on the training set based on cross-validation. More specifically, we divide the training set data into a training subset and a test set, with a proportion of 29:9, that is, the first 29 days of the training set are used for model training, and the next nine days are used for model verification. The best parameters of proposed algorithm are determined based on cross-validation, as shown in Table 4.

Table 4. Best parameters of LR, SVM, XGBoost

View Table | View all tables in this article

After determining the optimal parameters of the proposed algorithm, the XGBoost, SVM, LR algorithm will be retrained on the training set, and evaluate the performance of the three algorithms on a fixed test set. The effect of the training set is shown in Fig. 3.

Fig. 3. (a) Accuracy and F1 score, (b) false negative rate and false positive rate of different algorithms.

Download Full Size | PDF

As shown in Fig. 3, XGBoost has the best performance in accuracy, F1 score, false negative and false positive rates. Especially, it can be seen from the Fig. 3(a) that accuracy and F1 score based on XGBoost are higher than 10% by comparing with SVM and LR. As far as its accuracy is concerned, XGBoost obtains the best performance compared to the other two proposed algorithms. On the one hand, based on its integrated model, it uses iterative addition tree to fit the decision residuals, which improves the performance of weak classifier; on the other hand, XGBoost uses a similar “human” reasoning when performing classification tasks. Moreover, the false negative and false positive rates of XGBoost are also the lowest from Fig. 3(b), and the low false positives and false negatives can also reduce the economic losses caused by false positives and false negatives to network operators.

Moreover, SVM algorithm has achieved good performance in [3], while the detection performance of SVM algorithm on OTN board (type 1) is obviously lower than that of XGBoost algorithm, and the data used in the experiment and [3] are of the same type board which is not the same data. So, we verify the performance of the two algorithms on the data of [3]. The optimal SVM and XGBoost models can be obtained by model tuning based on Sec 3.2. Under the optimal SVM and XGBoost models, their detection performance on the training set is obtained, as shown in Fig. 4.

Fig. 4. Performance of failure detection of SVM and XGBoost.

Download Full Size | PDF

It can be seen from Fig. 4 that both SVM and XGBoost have achieved good classification performance on the data of [3]. The classification accuracy of SVM is 97.57% and the classification accuracy of XGBoost is 99.82%. To further explore the classification performance differences between the SVM algorithms in Fig. 3(a) and Fig. 4, by analyzing the data used in [3], we find that the data used in [3] are balanced, and the data used in our experiments are relatively balanced. Therefore, we oversampled the normal data in the experiment, and used the sampled data to evaluate classification performance of SVM, and found that the classification accuracy of SVM and XGBoost were 96.68% and 99.84%, respectively. Therefore, the classification performance of SVM is greatly influenced by data imbalance on OTN board, while the classification performance of XGBoost has little influence on this OTN board data set. Therefore, we still use the data without oversampling to evaluate the detection performance of XGBoost algorithm in the test set.

Based on XGBoost training model, we obtain its accuracy on the test set, where the accuracy means the probability that both normal and fault samples are correctly detected. By inputting the sample data of the last nine days into the trained XGBoost model, we can get the detection results of OTN board. Moreover, we counted the data of the last nine days in test set, and calculated the average detection accuracy of model by comparing the detection results of OTN board with the real board operating state. The statistical results and detection accuracy of the detected and the real operating state in the last nine days are shown in Fig. 5. It can be seen from Fig. 5 that the operating state of most boards is correctly detected, and the average accuracy of detection is higher than 99%. Therefore, the failure detection method based on XGBoost can realize intelligent failure detection.

Fig. 5. Number of detected boards compared with actual boards and detection accuracy based on XGBoost.

Download Full Size | PDF

4.3 Failure cause identification by feature importance ranking

As shown in above Sec. 4.2, the failure detection model based on XGBoost can achieve good detection performance. In addition, by the node splitting of XGBoost in tree construction, with the help of the common feature importance measurement parameters (i.e., weight, cover, gain) in XGBoost algorithm, the importance score for the input feature based on XGBoost algorithm can be obtained. In this section, we consider the feature scores of three feature importance measurement parameters, in order to find the high-correlation input features with OTN board failure and drive the identification of failure causes.

The input feature importance scores obtained by XGBoost based on weight, cover, gain are shown in Fig. 6. It can be seen in Fig. 6 that the importance ranking of input features is different under the parameters of weight, cover, gain. For the parameter weight, the highest feature score is the average value of the laser bias current (F3), and for the parameters cover and gain, the highest feature score is the minimum value of the environment temperature (F2). It can be concluded from expert experience that environment temperature is the first external factor to be checked if equipment fails. However, the output feature importance ranking of are different in the weight, cover, gain of XGBoost, and the inconsistent feature attribution cannot infer that the feature with the highest score of this method is the most important feature in failure detection.

Fig. 6. Input feature importance score based on (a) weight, (b) cover, (c) gain in XGBoost.

Download Full Size | PDF

Moreover, it can be seen from Fig. 6 that the feature importance score of F2 is the highest under the parameters cover and gain while the feature importance score is relatively low under the parameter weight. Generally, the cover and gain caused by root node performing node splitting are higher than those of leaf node. Therefore, we can conclude that F2 mostly performs node splitting as the root node. For gain and cover, combined with industry experience, we can conclude that the feature scores obtained by using these two parameters may find more important features for the failure detection of this kind of board. However, by the feature importance ranking results of weight, cover, gain in XGBoost, we cannot infer whether these input features are positively or negatively correlated with failure detection from the perspective of global samples (i.e., all the training samples), and we cannot interpret the relationship between input features and output results of each sample from the perspective of local sample (i.e., one of training samples).

5. Feature attribution consistency using SHAP

In this section, considering the inconsistency of feature attribution under the parameter weight, cover, gain of XGBoost, SHAP is introduced. Feature attribution inconsistency means that a feature plays a key role in the model, but a lower importance value is obtained in the calculation method of feature importance. For example, the F2 (minimum value of the environment temperature) is assigned a lower feature importance value under the parameter weight in XGBoost. However, SHAP ensures the consistency of feature attribution in theory. Specifically, SHAP obtains consistent feature attribution by calculating the contribution (SHAP value) of each feature to the output value of failure detection model based on XGBoost. Therefore, SHAP is introduced to assist XGBoost to perform failure cause identification, and the flow chart of introducing SHAP into the failure detection framework is shown in Fig. 7.

Fig. 7. Flow chart of SHAP assisting XGBoost to performing failure cause identification.

Download Full Size | PDF

5.1 Principle of feature attribution consistency using SHAP

SHAP is based on Shapley value, which is a game theory concept put forward by economist Lloyd Shapley. This method can help operators understand the detection result by allowing them to calculate how much each feature contributes to the failure detection result. Moreover, SHAP theoretically ensures the consistency of feature attribution because it satisfies local accuracy (the sum of feature importance should be equal to the overall importance of model features), consistency (if the model changes so that the marginal contribution of the eigenvalues increases or stays the same, the attribution value will also increase or stay the same) and missingness (missing values do not contribute to the feature importance), making SHAP a powerful tool to confirm the feature that is most relevant to equipment failure.

The core idea of SHAP is to calculate the marginal contribution of the input feature to the output of model by getting the SHAP value of each feature. By measuring the influence of the input feature on the classification value of XGBoost according to the SHAP value, the influence of the input feature on classification value of model can be obtained not only from the global point of view but also from the local point of view. SHAP is represented as an additive feature attribution method, and all features are regarded as “contributors”. For each training sample, the model outputs a classification value, and SHAP value is the “contribution” of each feature in the sample to the classification value. Assuming that the i-th sample is x_i, and the j feature of the i-th sample is the x_i,j, the corresponding classification value is y_i, the baseline of the whole model (i.e., the mean value of the output value of all samples using XGBoost) is y_base, and the relationship between the SHAP value and the y_i is shown in Eq. (8).

(8)$${y_i} = {y_{base}} + f({{x_{i,1}}} )+ f({{x_{i,2}}} )+ f({{x_{i,j}}} )+ \cdots $$

The f (x_i,j) is the SHAP value of x_i,j. Intuitively, f(x_i,₁) is the contribution value of the first feature in the i-th sample to the final output value y_i. When f(x_i,₁) > 0, it shows that the feature increases the output value, that is, the contribution to the output value is positive; on the contrary, it shows that the feature reduces the output value, which has a negative effect on output value. The specific calculation method of SHAP value is shown in Eq. (9).

(9)$$f({{x_{i,j}}} )= \sum\limits_{S \subset N\backslash \{j \}} {\frac{{|S |\textrm{!}({M - |S |- 1} )\textrm{!}}}{{M\textrm{!}}}} [{{f_x}({S \cup \{j \}} )- {f_x}(S )} ]$$

where N is the set of all the features in the training set, and its dimension is M where M=15 in our paper. S is a subset extracted from N, and its dimension is $|S |$ where $|S |$ ≤ 15 in our paper. The f_x (S) means that only the feature set S is used, and the average value of the sample is calculated according to the structure of the tree; f_x (S ∪{j}) means that on the basis of the feature set S, the feature j is added, and the average value of samples is calculated according to the tree structure. ${{[{|S |\textrm{!}({M - |S |- 1} )\textrm{!}} ]} / {({M\textrm{!}} )}}$ is the weight of the difference between the values of samples with and without features under its corresponding feature subset S.

The SHAP value of feature j can be obtained based on Eq. (9). Under feature sets N, a variety of feature combinations can be extracted to form a subset S. Therefore, the SHAP value of j is a comprehensive score under enumerating all possible feature subsets, considering the influence of other features on j except feature j itself. When f(x_i,j) = 0, it means that j is not in the decision path, that is, j will not affect the decision value.

5.2 Feature attribution based on SHAP value

The ranking of global and local feature importance based on SHAP value are shown in Fig. 8(a) and Fig. 8(b), respectively. The horizontal axis of Fig. 8(a) represents the average of absolute values of the SHAP values of all training samples of each feature, which can reflect the importance of features from a global perspective. It can be seen from Fig. 8(a) that the SHAP value of the average environment temperature (F0) is the largest, that is, after the environment temperature feature is removed, the change of the output amplitude of model is the largest. So, it can be concluded that the environment temperature is the most important feature in failure detection of the OTN board (type1). Therefore, if it is detected that this OTN board fails, it is first necessary to check the environment temperature.

Fig. 8. Feature importance ranking (a) from a global perspective, (b) from a local perspective.

Download Full Size | PDF

For randomly selected sample that is correctly classified as a failure, the influence of input features on the model output values of local samples can be obtained from Fig. 8(b), in which the red representation increases the classified value, that is, it plays a positive role in the classified value of this sample, while the blue is on the contrary. Features such as the average value of the environment temperature (F0) and the input optical power (F9) increase the classified value, while features such as the average of laser temperature offset (F6) reduces the classified value, which obtain the opposite effect. So, the average of the environment temperature (F0) obtained by introducing SHAP is still the most important feature of the OTN board (type1) from a local perspective. With the help of industry experience, the environment temperature anomaly is generally caused by the fan failure, so if it is detected that this OTN board fails, the cause of the failure may be the fan failure; at the same time, we can get whether the input features play a positive or negative role in the classified value from the perspective of local sample.

6. Failure detection performance under unbalanced data

The data used for failure detection analysis in OTN board (type1) are relatively balanced. Considering the diversity of existing network data, we consider the failure detection performance (i.e., F1 score) of XGBoost algorithm under data imbalance, and collect another type of OTN board (type2) in the same network environment. After the original data collected are preprocessed, a total of 36,425 sample of 775 boards of the same type in 47 days are used for failure detection. The number of fault data and normal data are 8,420 and 28,005, which are recorded as “1” and “0”, respectively. In addition, the input features used as the failure detection model are the average, maximum and minimum values of environment temperature, input optical power, output optical power, laser bias current, laser temperature offset, which are represented by F0 to F14. The representation and sequence of F0-F14 are consistent with Table 3 in Sec 4.1.

To evaluate the failure detection performance of XGBoost algorithm in the case of data imbalance, we choose F1 score as the evaluation metric. At the same time, SHAP is introduced to confirm the features most related to OTN board failure and then enables the identification of failure cause. Similarly, we divide the data into training subset, verification set and test set, with a ratio of 29:9:9, and obtain the optimal failure detection model based on cross-validation. Based on the optimal detection model, we give the confusion matrix of the test set, as shown in Table 5. Moreover, we can get the accuracy, F1 score, false negative and false positive rates of this kind OTN board based on the confusion matrix, which are 99.48%, 98.87%, 2.06%, 0.06%, respectively. By comparing the relatively balanced OTN board (type1) performance metrics analyzed in Sec. 4, F1 score, accuracy, and false negative rate detected by XGBoost algorithm are declined under unbalanced data, but the F1 score is still higher than 98%. Therefore, it can be concluded that using XGBoost for failure detection can still achieve high F1 score under the condition of data unbalance.

Table 5. Confusion matrix of the test set

View Table | View all tables in this article

In addition, the ranking of feature importance after applying SHAP is shown in Fig. 9. It shows the global feature importance ranking of OTN boards (type2) under the condition of unbalanced and balanced data, where Fig. 9 is composed of all samples, and each sample is represented by a point. At this time, the color from blue to red represents the feature value from low to high, and the abscissa is the size of SHAP value. Moreover, each feature has two dimensions from the perspective of features, the horizontal dimension represents the SHAP value of the feature on a sample, and the color represents the feature value, where the color from blue to red represents that the feature value is increasing. It can be concluded from Fig. 9 that F1 (minimum value of environment temperature) has the greatest contribution to XGBoost output result, that is, F1 has the greatest influence on XGBoost decision results, followed by the minimum value of input optical power (F11). So, environment temperature and input optical power are relatively important features in the failure detection of this kind OTN board. Moreover, the higher the F1, the greater its SHAP value, and the maximum SHAP value of F1 is near 2.1. This also means that under the premise of correct classification, a sample with a larger F1 value means a higher probability of correct classification. Similarly, the smaller the minimum input optical power (F11), the higher the probability of correct classification. Moreover, we can also observe the influence of some important outliers on the detected values of the model. It can be seen from Fig. 9(a) that although the average value of the output optical power (F12) is not the most important feature in the global range, the output optical power is the most important feature for partial samples.

Fig. 9. Feature importance ranking (a) unbalance data set, (b) balance data set after under-sampling from a global perspective (all sample points).

Download Full Size | PDF

Moreover, based on the SHAP value, we can find the feature attribution of two types of OTN boards, namely F0 and F1. Considering that the feature attribution of two types of boards is different, and the data of two types of boards are relatively balanced and unbalanced, where OTN board data (type1) are relatively balanced and OTN board (type2) data are unbalanced. To explore whether the difference of feature attribution is affected by the unbalanced characteristics of OTN boards, we output the feature importance ranking of OTN board under the condition of balanced data. Specifically, we kept all the failure data and undersampled the normal data, which made the OTN board (type2) data tend to be balanced, and the balanced failure data and normal data were both 8,420. After data processing, the feature importance ranking of OTN board based on SHAP value is obtained on the balanced data, as shown in Fig. 9(b). It can be seen from Fig. 9 that the sorting of feature importance based on SHAP value has a slight difference under the condition of data balance and imbalance, and the F1 (maximum value of environment temperature) has the greatest contribution to the output value of XGBoost model. Therefore, it can be concluded that data imbalance will not affect the feature attribution in this OTN board.

By the ranking of feature importance based on SHAP value, if it is detected that that this OTN board fails, the features such as environment temperature and input optical power should be detected. In addition, with the help of industry experience, the environment temperature anomaly is usually caused by the fan, and the abnormal input optical power is usually related to the receiving port of the board, so if this kind of board is abnormal, the cause of failure may be the fan or the receiving port of the board.

7. Conclusion

In this paper, a cause-aware failure detection scheme based on interpretable XGBoost was proposed for failure detection. Compared with NNs, XGBoost improved the interpretability of failure detection based on the ranking of feature importance so that inferring the possible failure causes. The experimental results showed that the proposed scheme achieved the higher detection accuracy and F1 score by comparing with SVM and LR, as well as lower false negative and false positive rates; and by the ranking of feature importance of XGBoost, the high-correlation features with equipment failure were found. Moreover, considering the inconsistency of feature attribution under the parameters weight, cover, gain of XGBoost, SHAP was applied to get consistent feature attribution by obtaining the contribution (i.e., SHAP value) of each input feature on the detection result of XGBoost, and the feature most related to the two types of OTN boards failure were confirmed by the ranking of feature importance based on SHAP value, which were average value of the environment temperature (F0) and maximum value of the environment temperature (F1). With the help of industry experience, the temperature abnormal is usually caused by the fan. Hence, it can be deduced that the failure cause of two types of OTN boards may be the fan failure. Moreover, considering the diversity of existing network data, we evaluated the detection performance of two types of OTN boards, where their data were balanced and unbalanced, respectively. Experimental results showed that the average detection accuracy and F1 score of the proposed scheme based on XGBoost were both higher than 98%. Therefore, the proposed scheme can not only effectively realize the failure detection but also enable the identification of the failure cause.

Funding

National Natural Science Foundation of China (No. 61871415, No. 61975020); State Key Laboratory of Information Photonics and Optical Communications (No. IPOC2020ZT05); the Key Laboratory Fund (No. 6142104190207).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. P. Lu, L. Zhang, X. Liu, J. Yao, and Z. Zhu, “Highly-Efficient Data Migration and Backup for Big Data Applications in Elastic Optical Inter-Data-Center Networks,” IEEE Netw. 29(5), 36–42 (2015). [CrossRef]

2. Z. Zhu, W. Lu, L. Zhang, and N. Ansari, “Dynamic Service Provisioning in Elastic Optical Networks with Hybrid Single-/Multi-Path Routing,” J. Lightwave Technol. 31(1), 15–22 (2013). [CrossRef]

3. Z. Wang, M. Zhang, D. Wang, C. Song, M. Liu, J. Li, L. Lou, and Z. Liu, “Failure prediction using machine learning and time series in optical network,” Opt. Express 25(16), 18553–18565 (2017). [CrossRef]

4. J. Borland, “Analyzing the internet collapse,” MIT Technol. Rev., 2008.

5. S. K. Ibrahim, A. Ahmed, M. A. E. Zeidan, and I. E. Ziedan, “Machine learning techniques for satellite fault diagnosis,” Ain Shams Eng. J. 11(1), 45–56 (2020). [CrossRef]

6. L. Bai, C. Wang, Q. Xu, S. Ventouras, and G. Goussetis, “Prediction of channel excess attenuation for satellite communication systems at Q-band using artificial neural network,” IEEE Antennas Wirel. Propag. Lett. 18(11), 2235–2239 (2019). [CrossRef]

7. T. Liu, K. Kang, and H. Sun, “Fault prediction for satellite communication equipment based on deep neural network,” in Proc. ICVRIS, 176–178 (2018).

8. D. Mulvey, C. H. Foh, M. A. Imran, and R. Tafazolli, “Cell fault management using machine learning techniques,” IEEE Access 7, 124514–124539 (2019). [CrossRef]

9. Y. Kumar, H. Farooq, and A. Imran, “Fault prediction and reliability analysis in a real cellular network,” in Proc. 13th Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), pp. 1090–1095, 2017.

10. F. Ahmed, J. Erman, Z. Ge, A. X. Liu, J. Wang, and H. Yan, “Detecting and localizing end-to-end performance degradation for cellular data services,” in Proc. IEEE INFOCOM, pp. 1–9, 2016.

11. D. Wang, Z. Zhang, M. Zhang, M. Fu, J. Li, S. Cai, C. Zhang, and X. Chen, “The Role of Digital Twin in Optical Communication: Fault Management, Hardware Configuration, and Transmission Simulation,” IEEE Commu. Mag. 59(1), 133–139 (2021). [CrossRef]

12. B. Shariati, M. Ruiz, J. Comellas, and L. Velasco, “Learning From the Optical Spectrum: Failure Detection and Identification,” J. Lightwave Technol. 37(2), 433–440 (2019). [CrossRef]

13. F. Musumeci, C. Rottondi, G. Corani, S. Shahkarami, F. Cugini, and M. Tornatore, “A tutorial on machine learning for failure management in optical networks,” J. Lightwave Technol. 37(16), 4125–4139 (2019). [CrossRef]

14. Z. Wan, Z. Yu, L. Shu, Y. Zhao, H. Zhang, and K. Xu, “Intelligent optical performance monitor using multi-task learning based artificial neural network,” Opt. Express 27(8), 11281–11291 (2019). [CrossRef]

15. D. Wang, M. Wang, M. Zhang, Z. Zhang, H. Yang, J. Li, J. Li, and X. Chen, “Cost-effective and data size–adaptive OPM at intermediated node using convolutional neural network-based image processor,” Opt. Express 27(7), 9403–9419 (2019). [CrossRef]

16. Q. Xiang, Y. Yang, Q. Zhang, and Y. Yao, “Joint and accurate OSNR estimation and modulation format identification scheme using the feature-based ANN,” IEEE Photonics J. 11(4), 1–11 (2019). [CrossRef]

17. S. T. Ahmad and K. P. Kumar, “Radial basis function neural network nonlinear equalizer for 16-QAM coherent optical OFDM,” IEEE Photonics Technol. Lett. 28(22), 2507–2510 (2016). [CrossRef]

18. Y. Zhao, B. Yan, D. Liu, Y. He, D. Wang, and J. Zhang, “SOON: self-optimizing optical networks with machine learning,” Opt. Express 26(22), 28713–28726 (2018). [CrossRef]

19. L. Xu, F. Qian, Y. Li, Q. Li, Y. W. Yang, and J. Xu, “Resource allocation based on quantum particle swarm optimization and RBF neural network for overlay cognitive OFDM system,” Neurocomputing 173, 1250–1256 (2016). [CrossRef]

20. K.S. Mayer, J.A. Soares, R.P. Pinto, C.E. Rothenberg, D.S. Arantes, and D.A. Mello, “Soft Failure Localization Using Machine Learning with SDN-based Network-wide Telemetry,” in Proc. Eur. Conf. Opt. Commun., Brussels, Belgium, 2020.

21. H. Yang, B. Wang, Q. Yao, A. Yu, and J. Zhang, “Efficient hybrid multi-faults location based on hopfield neural network in 5G coexisting radio and optical wireless networks,” IEEE Trans. Cognit. Commun. Netw. 5(4), 1218–1228 (2019). [CrossRef]

22. X. Chen, B. Li, R. Proietti, Z. Zhu, and S. B. Yoo, “Self-Taught Anomaly Detection With Hybrid Unsupervised/Supervised Machine Learning in Optical Networks,” J. Lightwave Technol. 37(7), 1742–1749 (2019). [CrossRef]

23. X. Chen, B. Li, M. Shamsabardeh, R. Proietti, Z. Zhu, and S. B. Yoo, “On real-time and self-taught anomaly detection in optical networks using hybrid unsupervised/supervised learning,” in Proc. Eur. Conf. Opt. Commun., Rome, Italy, 2018.

24. S. Shahkarami, F. Musumeci, F. Cugini, and M. Tornatore, “Machine-Learning-Based Soft-Failure Detection and Identification in Optical Networks,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, 2018, Paper M3A.5.

25. Z. Li, Y. Zhao, Y. Li, S. Rahman, X. Yu, and J. Zhang, “Demonstration of Fault Localization in Optical Networks Based on Knowledge Graph and Graph Neural Network,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, 2020, Paper Th1F.5.

26. C. Zhang, D. Wang, L. Wang, J. Song, S. Liu, J. Li, L. Guan, Z. Liu, and M. Zhang, “Temporal data-driven failure prognostics using BiGRU for optical networks,” J. Opt. Commun. Netw. 12(8), 277–287 (2020). [CrossRef]

27. D. Rafique, T. Szyrkowiec, H. Grießer, A. Autenrieth, and J. P. Elbers, “Cognitive Assurance Architecture for Optical Network Fault Management,” J. Lightwave Technol. 36(7), 1443–1450 (2018). [CrossRef]

28. D. Rafique, T. Szyrkowiec, H. Grießer, A. Autenrieth, and J. P. Elbers, “TSDN-enabled network assurance: A cognitive fault detection architecture,” in Proc. Eur. Conf. Opt. Commun., Gothenburg, Sweden, 2017.

29. R. Agarwal, N. Frosst, X. Zhang, R. Caruana, and G. E. Hinton, “Neural additive models: Interpretable machine learning with neural nets,” arXiv preprint arXiv:2004.13912, 2020.

30. D. P. Kuttichira, S. Gupta, C. Li, S. Rana, and S. Venkatesh, “Explaining black-box models using interpretable surrogates,” In Pacific Rim international conference on artificial intelligence, Springer, Cham, 2019.

31. J. Xie, Z. Li, Z. Zhou, and S. Liu, “A Novel Bearing Fault Classification Method Based on XGBoost: The Fusion of Deep Learning-Based Features and Empirical Features,” IEEE Trans. Instrum. Meas. 70, 1–9 (2021). [CrossRef]

32. M. Chen, Q. Liu, S. Chen, Y. Liu, C. H. Zhang, and R. Liu, “XGBoost-Based Algorithm Interpretation and Application on Post-Fault Transient Stability Status Prediction of Power System,” IEEE Access 6, 21020–21031 (2018). [CrossRef]

33. D. Zhang, L. Qian, B. Mao, C. Huang, B. Huang, and Y. Si, “A Data-Driven Design for Fault Detection of Wind Turbines Using Random Forests and XGboost,” IEEE Access 7, 13149–13158 (2019). [CrossRef]

34. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 785–794, 2016.

35. C. Zhang, D. Wang, C. Song, L. Wang, J. Song, L. Guan, and M. Zhang, “Interpretable learning algorithm based on XGBoost for fault prediction in optical network,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, 2020, Paper Th1F. 3.

36. J. Friedman, T. Hastie, and R. Tibshirani, “The Elements of Statistical Learning,” Springer Series in Statistics, Springer, Berlin, Germany, 2001.

Evaluation metrics	Definition	Meaning in failure detection
Accuracy	$Accuracy = \frac{T N + T P}{T N + T P + F N + F P}$	Proportion of the number of correctly judged fault and normal samples to all samples.
Precision	$Precision = \frac{T P}{F P + T P}$	Proportion of the number of correctly judged fault samples to all samples that were judged as fault.
Recall	$Recall = \frac{T P}{F N + T P}$	Proportion of the number of correctly judged fault samples to all fault samples.
F1 score	$F1 s core = \frac{2 T P}{2 T P + F N + F P}$	Harmonic mean of the precision and recall.
False negative rate	$False negative rate = \frac{F N}{F N + T P}$	Proportion of the number of samples that were misjudged as fault to all fault samples.
False positive rate	$False positive rate = \frac{F P}{T N + F P}$	Proportion of the number of samples that were misjudged as fault to all normal samples.

	Performance events	Average	Maximum	Minimum	Unit
Feature	Environment temperature	F0	F1	F2	°C
	Laser bias current	F3	F4	F5	mA
	Laser temperature offset	F6	F7	F8	°C
	Input optical power	F9	F10	F11	dBm
	Output optical power	F12	F13	F14	dBm
	Unavailable time	UAS			s
Label	Unavailable time (UAS) > 80,000			Fault state (“1”)
Label	Unavailable time (UAS) < 80,000			Normal state(“0”)

Confusion matrix		Detected label
Confusion matrix		1	0
True label	1	1572	33
True label	0	3	5367
F1 score		0.9887

Evaluation metrics	Definition	Meaning in failure detection
Accuracy	$Accuracy = \frac{T N + T P}{T N + T P + F N + F P}$	Proportion of the number of correctly judged fault and normal samples to all samples.
Precision	$Precision = \frac{T P}{F P + T P}$	Proportion of the number of correctly judged fault samples to all samples that were judged as fault.
Recall	$Recall = \frac{T P}{F N + T P}$	Proportion of the number of correctly judged fault samples to all fault samples.
F1 score	$F1 s core = \frac{2 T P}{2 T P + F N + F P}$	Harmonic mean of the precision and recall.
False negative rate	$False negative rate = \frac{F N}{F N + T P}$	Proportion of the number of samples that were misjudged as fault to all fault samples.
False positive rate	$False positive rate = \frac{F P}{T N + F P}$	Proportion of the number of samples that were misjudged as fault to all normal samples.

	Performance events	Average	Maximum	Minimum	Unit
Feature	Environment temperature	F0	F1	F2	°C
	Laser bias current	F3	F4	F5	mA
	Laser temperature offset	F6	F7	F8	°C
	Input optical power	F9	F10	F11	dBm
	Output optical power	F12	F13	F14	dBm
	Unavailable time	UAS			s
Label	Unavailable time (UAS) > 80,000			Fault state (“1”)
Label	Unavailable time (UAS) < 80,000			Normal state(“0”)

Cause-aware failure detection using an interpretable XGBoost for optical networks

Abstract

1. Introduction

2. Operating principle of failure detection using interpretable XGBoost

2.1 Problem definition

2.2 Interpretable XGBoost for failure detection

2.2.1 XGBoost for failure detection

2.2.2 XGBoost for the failure cause identification

3. Process of XGBoost in failure detection

3.1 Input and output

3.2 Training the best failure detection model

3.3 Performance evaluation metrics

4. Results of failure detection based on XGBoost

4.1 Datasets

4.2 Results of failure detection

4.3 Failure cause identification by feature importance ranking

5. Feature attribution consistency using SHAP

5.1 Principle of feature attribution consistency using SHAP

5.2 Feature attribution based on SHAP value

6. Failure detection performance under unbalanced data

7. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (5)

Equations (11)

Optics Express

Classification	Performance events
Counting performance events	Unavailable time (UAS), forward error correction (FEC)…
Analog performance events	Center wavelength, center wavelength shift, working temperature, transmitted optical power, received optical power…

ML algorithm	Parameter	Value
LR	Penalty	“L2”
LR	Regularization coefficient(C)	0.9
SVM	Kernel function	“radial basis function”
SVM	penalty factor (C)	50
XGBoost	Number of trees (n_estimator)	100
	Learning_rate	0.1
	Depth of the number (max_depth)	7