Heart rate estimation from facial videos with motion interference using T-SNE-based signal separation

Hequn Wang; Xuezhi Yang; Xuezhi Yang; Xuenan Liu; Dingliang Wang

doi:10.1364/BOE.457774

1. Introduction

Heart rate (HR) is one of the essential indicators of cardiovascular health [1]. Furthermore, heart rate detection is a significant way to obtain the health of the human body, analyze changes in human emotions, and detect human stress. Heart rate monitoring daily can effectively prevent cardiovascular disease. Traditional heart rate measurement techniques are based on contact sensors, such as electrocardiogram (ECG) probes and pulse oximeters. Although the contact device measures the heart rate more accurately, direct contact with the skin may cause skin irritation for some people [2–4].

Therefore, the study of non-contact heart rate measurement is of great significance. Remote photoplethysmography (RPPG) is the primary non-contact heart rate measurement method for extracting pulse signals from facial video. The change of blood volume in the capillaries underneath the skin leads to a slight change in the skin color, which consumer-level cameras can capture. These small changes can realize heart rate detection. In recent years, some RPPG core algorithms have been proposed. In 2008, Verkruysse et al. first proposed remote photoplethysmography [5]. They pointed out that consumer-level cameras can measure PPG pulse signals by using ambient light as an illumination source. They also believed that the green channel has more robust heart rate information than other channels. Blind source separation (BSS) has been a more popular method in recent years. In 2010, Poh et al. proposed the independent component analysis (ICA) method [6]. In 2011, Lewandowska et al. proposed the principal component analysis (PCA) method [7]. Both methods can extract periodic pulse signals from the RGB signals. When the BSS processes some periodic signals, such as head circular motion or running, the method has a significant error in estimating the heart rate. In 2013, De Haan and Jeanne et al. proposed the CHROM method [8], which calculates the pulse signal as a linear combination of the chrominance signals, assuming a standardized skin color to white balance the camera. However, this method does not perform well when dealing with distorted signals due to insufficient light. In 2016, Wang et al. proposed the POS method [9], which extracts the pulse signal using the skin color orthogonal plane in the RGB space normalized in time, and uses the skin reflection model to convert the three-dimensional raw RPPG signals into the two-dimensional signals. However, in the case of extensive head motion interference, the noise signal caused by the face angle change is enormous. The orthogonal plane of POS cannot deal with these noises. In 2019, Yu et al. proposed rPPGNet [10], which is composed of a spatiotemporal convolution network, attention mechanism, and partition constraint module. The attention mechanism helps to adaptively select skin regions. Partition constraint is to help the model better learn RPPG features. Deep learning can extract color change features from facial video, and build a more refined model to predict heart rate more accurately by learning color change features. However, due to the limitation of data sets, the robustness of the model needs to be tested. In 2020, Wang et al. proposed the singular spectrum analysis (SSA) method [11]. The baseline offset and high-frequency random noise are removed through the preprocessing process, and the singular spectrum of the signal includes the base components of the signal. An anti-interference base expression model of the facial pulse signal selects the base components and simultaneously removes the motion interference base components and aperiodic irregular noise. When dealing with continuous motion interference, this method cannot obtain accurate periodic components and extract the base component of the facial pulse wave, so its performance is poor.

This paper proposes a novel method to solve the above problems. Our method is based on the conditional probability distribution model, which uses the similarity between data points to separate the pulse signal from the raw RPPG signals. A more accurate pulse signal is extracted to estimate the heart rate. It has an excellent anti-interference effect. The framework of TSS is shown in Fig. 1. Firstly, after consumer-level cameras record the facial video, the face area of the subjects is selected through face tracking to remove the rigid motion interference and the background interference. The raw RPPG signals are extracted by spatial pixel averaging for each video frame. A self-adaptive characteristic matrix generated by T-SNE is then used to separate the reconstructed RGB signals. T-SNE expresses the similarity between data points according to conditional probability. The similarity between the data points of the pulse signal is very high. In contrast, the similarity between the data points of the pulse signal and the motion noise signal is very low. The characteristic matrix can effectively remove the motion interference noise. Finally, according to the spectral characteristics of the signal, the pulse signal will be selected. The TSS is tested on a self-collected dataset and two public datasets (UBFC-RPPG [12] and VIPL-HR [13]). The experimental results show that the proposed method significantly outperforms other state-of-the-art methods in heart rate estimation in the motion interference scenario.

Fig. 1. Overall framework of the proposed method. Firstly, the RGB channels are extracted by face tracking. Then the pulse signal will be extracted using the TSS method. Finally, the heart rate is estimated by Temporal filtering and Fourier transform.

Download Full Size | PDF

2. Method

In this section, the framework of the proposed method will be introduced in detail. We use the following mathematical conventions. Vectors and matrices are represented in boldface characters, and row vectors are represented by $\boldsymbol v$.

2.1 Face detection and tracking

After using consumer-level cameras to record facial video, the face area of the subject is selected through face tracking to reduce the influence of background noise. Previous studies have shown that both sides’ forehead and cheek areas can get better pulse signals [14]. In this paper, we use the open-source face detector SeetaFace [15] to detect the face area of each frame and detect five facial landmarks (two eyes’ centers, nose tip, and two mouth corners). To make full use of the ROI area containing the pulse wave signal in the face, we continue to divide the face area. According to the coordinates of the five facial landmarks, the face rectangle will be divided into several parts. As shown in Fig. 1, the blue rectangle contains the middle part of the eyes and mouth, which is the part of the both sides cheeks. Due to the difference in hair length of each person, we discard the forehead as the ROI area.

Finally, the output will get a stable facial video, and the raw RPPG signals $\boldsymbol {H}\left ( N \right )$ are obtained through spatial pixel averaging, as shown in Eq. (1):

(1)$$H\left( N \right)=\left\{ R\left( N \right),G\left( N \right),B\left( N \right) \right\}=\left\{ \frac{\sum\nolimits_{x,y\in ROI}{ROI\left( x,y,N \right)}}{M} \right\}$$

where $\boldsymbol {ROI}\left ( x,y,N \right )$ denotes the ROI area of the Nth frame image, and $\boldsymbol {M}$ is the total number of pixels.In addition, the raw RPPG signals are also the raw RGB signals.

In Fig. 1, the red rectangle selects the face region, the green dots mark the five facial landmarks, the yellow line divides the face region into several parts according to the five feature points, and the blue rectangle is the selected ROI area. The experimental results show that using the face detector SeetaFace as the preprocessing method to remove the non-face area can effectively remove the background noise, and the best ROI area can be obtained by face area segmentation.

2.2 TSS method

A novel method (TSS) that uses the T-SNE algorithm for signal separation is proposed to extract the pulse signal from the raw RPPG signals. The main steps of the proposed TSS method include a self-adaptive characteristic matrix generation, raw RGB signals reconstruction, and pulse signal vector selection.

T-distributed stochastic neighbor embedding (T-SNE) [16–19] is a statistical method for nonlinear dimensionality reduction of data. In the low-dimensional space, the T-SNE algorithm employs a student t-distribution with one degree of freedom to calculate the similarity between two points; in the high-dimensional space, the similarity between two points is measured by Gaussian probability distribution. Before introducing the T-SNE algorithm, we first explain the SNE algorithm. The basic principle of SNE is first to select the nearest neighbor samples through the complexity factor. The high-dimensional Euclidean distance between adjacent data points is transformed into conditional probability to represent the sample similarity. Finally, the Kullback-Leibler divergence objective function calculates the low-dimensional data. Suppose $\boldsymbol {{x}_{i}}$, $\boldsymbol {{x}_{j}}$ are the two points of N high-dimensional data. The conditional probability $\boldsymbol {{p}_{j|i}}$ is used to represent the similarity between the data $\boldsymbol {{x}_{i}}$ and $\boldsymbol {{x}_{j}}$, reflecting the probability that $\boldsymbol {{x}_{i}}$ picks $\boldsymbol {{x}_{j}}$ as its nearest neighbor. Mathematically, the conditional probability $\boldsymbol {{p}_{j|i}}$ is given by:

(2)$${{p}_{j|i}}=\frac{\exp \left( -\left\| {{x}_{i}}- \right.{{\left. {{x}_{j}} \right\|}^{2}}/2{{\sigma }_{i}}^{2} \right)}{\sum_{k\ne i}{\exp \left( -\left\| {{\left. {{x}_{i}}-{{x}_{k}} \right\|}^{2}}/ \right.2{{\sigma }_{i}}^{2} \right)}}$$

where $\boldsymbol {{\sigma }_{i}}$ is the Gaussian variance centered on datapoint $\boldsymbol {{x}_{i}}$ and is determined by the binary search algorithm. The area with high data density should choose smaller $\boldsymbol {{\sigma }_{i}}$.

In the low-dimensional space, the datapoints $\boldsymbol {{x}_{i}}$ and $\boldsymbol {{x}_{j}}$ are mapped to $\boldsymbol {{y}_{i}}$ and $\boldsymbol {{y}_{j}}$. Likewise, the similarity between $\boldsymbol {{y}_{i}}$ and $\boldsymbol {{y}_{j}}$ is measured by $\boldsymbol {{q}_{j|i}}$ :

(3)$${{q}_{j|i}}=\frac{\exp \left( -\left\| {{y}_{i}}- \right.{{\left. {{y}_{j}} \right\|}^{2}} \right)}{\sum_{k\ne i}{\exp \left( -\left\| {{\left. {{y}_{i}}-{{y}_{k}} \right\|}^{2}} \right. \right)}}$$

However, the tail of the Gaussian distribution is low, the fitting result of the Gaussian distribution deviates from the position where most samples are located. In contrast, the T distribution is not sensitive to abnormal points due to the heavy-tailed distribution, and the fitting results are more reasonable. Therefore, T distribution is introduced based on SNE. This processing solves the crowding problem of SNE data points, and the overall characteristics of the data are better captured. In low-dimensional space, the joint probability distribution $\boldsymbol {{p}_{ij}}$ is defined by Eq. (4); in high-dimensional space, the joint probability distribution $\boldsymbol {{q}_{ij}}$ is defined by Eq. (5):

(4)$${{p}_{ij}}=\frac{{{p}_{i|j}}+{{p}_{j|i}}}{2}$$

(5)$${{q}_{ij}}=\frac{{{\left( 1+\left\| {{y}_{i}}- \right.{{\left. {{y}_{j}} \right\|}^{2}} \right)}^{{-}1}}}{\sum_{k\ne l}{{{\left( 1+\left\| {{\left. {{y}_{k}}-{{y}_{l}} \right\|}^{2}} \right. \right)}^{{-}1}}}}$$

A natural measure of the faithfulness with which $\boldsymbol {{q}_{ij}}$ models $\boldsymbol {{p}_{ij}}$ is the Kullback-Leibler divergence. The best-simulated point is obtained by minimizing gradient descent’s Kullback-Leibler divergence of all data points. The cost function $\boldsymbol C$ is defined as:

(6)$$C=KL\left( P||Q \right)=\sum_{i}{\sum_{j}{{{p}_{ij}}\log \frac{{{p}_{ij}}}{{{q}_{ij}}}}}$$

where T-SNE is aiming at minimizing the cost function $\boldsymbol C$, and it is typically minimized by descending along the gradient. The gradient of the Kullback-Leibler divergence is given by:

(7)$$\frac{\delta C}{\delta {{y}_{i}}}=4{{\sum_{j}{\left( {{p}_{ij}}-{{q}_{ij}} \right)\left( {{y}_{i}}-{{y}_{j}} \right)\left( 1+||{{y}_{i}}-{{y}_{j}}|{{|}^{2}} \right)}}^{{-}1}}$$

Eq.(2)-(7) realizes the dimensionality reduction of high-dimensional data by T-SNE. Our method is based on such a model. Assuming that $\boldsymbol {H}\left ( N \right )$ is statistically independent and non-gaussian, it is defined that the observation signal $\boldsymbol {Z}\left ( N \right )$ is a linear mixture of the raw RPPG signals $\boldsymbol {H}\left ( N \right )$:

(8)$$Z(N)=W\cdot H(N)$$

where $\boldsymbol {W}$ is the mixing matrix to be obtained, and $\boldsymbol {Z}\left ( N \right )$ is composed of the pulse signal and the noise signal. At the same time, Eq. (8) is also a traditional blind source separation model.

The innovation of the proposed method is that we use conditional probability distribution as the basis of our research and use it to express the similarity between data points. On the contrary, most previous research methods build models based on the color change to extract pulse signals. Combining our experiments, if the acquired raw RPPG signals contain the jump signal caused by motion interference, we regard the jump value as an abnormal point. The remaining data are pulse signals. The similarity between the points of pulse data is very high. However, the similarity between the data of pulse signal and the data points caused by motion interference is tiny. Therefore, we can use T-SNE to process the raw RPPG signals, and finally generate a self-adaptive characteristic matrix, which represents the mapping of high-dimensional data to low-dimensional data. The distance of points with higher similarity in this space will be smaller; the distance of points with smaller similarity in this space will be farther. In addition, T- SNE also uses gradient descent to avoid their distances being too far away. The mapping can obtain the characteristics of the entire data and effectively deal with outliers.

So, the first step of our method is obtaining the adaptive mixing matrix $\boldsymbol {W}$. Based on Maaten’s previous research [20], T-SNE can be realized in MATLAB with very little code, and the T-SNE code has been open-sourced. The raw RPPG signals $\boldsymbol {H}\left ( N \right )$ are the input signal, which will be preprocessed by PCA [16] before running T-SNE. This preprocessing mainly speeds up the computation of pairwise distances between the data points and suppresses some noise without severely distorting the interpoint distances. Finally, the self-adaptive characteristic matrix $\boldsymbol {W}$ will be got.

Next step, the inverse matrix $\boldsymbol {{W}^{T}}$ of the self-adaptive characteristic matrix is multiplied with the raw RGB signals $\boldsymbol {H}\left ( N \right )$ to reconstruct RGB signals. The RGB signals reconstruction matrix is defined by $\boldsymbol M$. The row vector of $\boldsymbol M$ consists of the pulse signal and noise interference signal.

Finally, because the pulse signal vector index is unknown in the reconstruction matrix $\boldsymbol M$, we need to use a method to select the pulse signal vector quickly. Fig. 2(b) is the power spectrum of TSS signal, and Fig. 2(c) is the power spectrum of noise signal. Based on the previous research [21,22], through the analysis of the spectral power, we found that the spectral power of the pulse signal is concentrated in a small frequency band of the heart rate (yellow rectangle), and the spectral power of the noise is distributed on the passband of the filter (black rectangle). Therefore, we define an indicator $\boldsymbol Q$ to evaluate the quality of such spectrum structure. The evaluating indicator $\boldsymbol Q$ is the ratio of the power near the heartbeat frequency to the noise power in the filter passband, $\boldsymbol Q$ is defined as:

(9)$$Q=\frac{\int_{hr\text{-}a}^{hr+a}{(k(f))}df}{\int_{{{B}_{1}}}^{{{B}_{2}}}{(k(f))}df-\int_{hr\text{-}a}^{hr+a}{(k(f))}df}$$

where $\boldsymbol {k}\left ( f \right )$ denotes the power spectral density (PSD), [${hr}$-a, ${hr}$+a] denotes a small area around the heart rate (hr), which is the width of the yellow rectangle in Fig. 2, where we set a to 0.5. [$\boldsymbol {B}_{1}$, $\boldsymbol {B}_{2}$ ] is the passband of the bandpass filter ([0.8 Hz,4Hz]). Through a large amount of experimental data, it has been found that the row vector with the largest evaluating indicator $\boldsymbol Q$ is the pulse signal row vector.

Fig. 2. TSS method. (a) The proposed method separates the raw RPPG signals into the TSS signal and noise signals. The G channel contains complex motion interference. The TSS signal is the pulse signal selected according to the indicator Q. The noise signal is one of several noise signals; (b) Power spectrum of TSS signal; (c) Power spectrum of noise signal.

Download Full Size | PDF

After completing the above steps, the pulse signal row vector can be obtained. As shown in Fig. 2, we extract the pulse signal from a series of very complex interference signals. The specific algorithm flow is shown in Algorithm 1.

$\textbf {Algorithm 1:}$ T-SNE-based Signal Separation

$\textbf {Require:}$ $\boldsymbol {{T}_{o}}$, $\boldsymbol {{T}_{i}}$, $\boldsymbol {{T}_{p}}$, A video sequence containing N frames

$\textbf {Initialization:}$ $\boldsymbol {H}\left ( N \right )$ =[R(N),G(N),B(N)], $\textbf {PSD}$ = zeros (1, N), $\textbf {fr}$ = 30fps

$\textbf {1:}$ $\boldsymbol {W}$ = $\textbf {tsne}$($\boldsymbol {H}\left ( N \right )$,[], $\boldsymbol {{T}_{o}}$, $\boldsymbol {{T}_{i}}$, $\boldsymbol {{T}_{p}}$);$\leftarrow$ matrix $\boldsymbol {W}$ with dimension 3$\times$ ${{T}_{o}}$

$\textbf {2:}$ $\boldsymbol {M}$ = $\boldsymbol {{W}^{T}}$ $\times$ $\boldsymbol {H}\left ( N \right )$;$\leftarrow$ matrix $\boldsymbol {M}$ with dimension ${{T}_{o}}$ $\times$ $N$

$\textbf {3:}$ $\textbf {for}$ i = 1,2,3…, ${{T}_{o}}$ $\textbf {do}$

$\textbf {4:}$ $\quad$ $\textbf {PSD}$ [i] = $\boldsymbol {Q}$(abs(fft((ideal_passing($\boldsymbol {M}$(i, :),0.8,4,fr),N)))); $\leftarrow$ evaluating indicator $\boldsymbol {Q}$

$\textbf {5:}$$\textbf {end for}$

$\textbf {6:}$ [ _, $\textbf {index}$] = $\max$($\textbf {PSD}$);

$\textbf {7:}$ $\boldsymbol {T}$=$\boldsymbol {M}$[$\textbf {index}$,] $\leftarrow$ pulse signal row vector $\boldsymbol {T}$

$\textbf {8:}$ $\boldsymbol {O}$=normalization($\boldsymbol {T}$); $\leftarrow$ temporal normalization

$\textbf {Output: The pulse signal }$ $\boldsymbol {O} $

In this paper, we propose a novel method to deal with motion interference. The TSS first uses a self-adaptive characteristic matrix generated by T-SNE for raw RGB signals reconstruction and then uses an indicator to select the pulse signal quickly. In the experimental setup, we will detail the dimensionality size $\boldsymbol {{T}_{o}}$ of the final dimensionality reduction of T-SNE. The proposed TSS method is more accessible to implement and understand than other state-of-the-art methods. In addition, good experimental results can also be obtained.

3. Experiments

This section will introduce the experiments. The experiments in this paper are tested on a self-collected dataset and two public datasets, namely UBFC-RPPG [12] and VIPL-HR [13]. The remainder of this section is structured as follows. Section 3.1 will introduce the experimental setup for evaluating the proposed TSS method and the four statistical indicators for evaluating the experimental results. In sections 3.2–3.3, we will introduce the experimental results of the self-collected dataset. Section 3.4 will introduce the experimental results of the two public datasets.

3.1 Experimental setup

This section will introduce the experimental setup for evaluating the proposed TSS method. Firstly, we conduct experiments in a laboratory illuminated by fluorescent lights. A consumer-level webcam (Logitech C920) is used to record facial video with 30 frames per second, and the resolution is 640 $\times$ 480 pixels. Video is stored in AVI format under RGB color space. At the same time, the PPG sensor and finger clip instrument were used to measure real-time heart rate. The PPG sensor obtains the ground-truth pulse signal from the fingers at the frequency of 200Hz. 25 subjects (17 males and 8 females) participated in this experiment. The video duration varied according to the experimental scenario. We collected 100 videos as the data set of this experiment, including stationary experiment, head movement experiment, and exercise recovery experiment. It is worth noting that we used the fitness equipment in the cardiac rehabilitation center to cooperate with collecting videos in the exercise recovery experiment. As shown in Fig. 3, it is the background figure of the experimental setup and the screenshots of the subject’s head movement.

Fig. 3. Experimental setup diagram and screenshots of the subject’s head movement.

Download Full Size | PDF

In order to evaluate the accuracy of the proposed method, four statistical metrics were used to evaluate the error between the heart rate value ${HR}_{video}$ estimated by our method and the real-time ground-truth heart rate value ${HR}_{gt}$ measured by PPG sensor and finger clip instrument. These are Mean Absolute Error (MAE), Standard Deviation (SD), Average Root Square Error (RMSE) and Pearson correlation coefficient ($\rho$). The measuring error was computed as ${{HR}_{error} = {HR}_{video}-{HR}_{gt}}$, Eq. (10)–(13) constituted the evaluation metrics [22]. Because the real-time ground-truth heart rate value ${HR}_{gt}$ changes dynamically during recording video, we used a 10-second sliding window (9-second overlap) to get the real-time heart rate in the experiment. When estimating heart rate, it is necessary to perform temporal filtering and Fourier transform on the pulse signal. We manually set the parameters of the band-pass filter. Based on previous studies [23], it is found that the range of ordinary people’s heart rate is [48,240]bpm, so the band of the band-pass filter was set to [0.8Hz, 4Hz]. The frequency $(fr)$ corresponding to the peak in the spectral power is the heartbeat frequency, so the estimated heart rate value was calculated as $H{{R}_{video}}=60\times fr$.

(10)$$MAE=\frac{1}{N}\sum_{n=1}^{N}{\left| H{{R}_{error}} \right|}$$

(11)$$SD=\sqrt{\frac{1}{n-1}\sum_{n-1}^{N}{{{\left( H{{R}_{error}}-MAE \right)}^{2}}}}$$

(12)$$\text{RMSE=}\sqrt{\frac{1}{N}\sum_{n=1}^{N}{H{{R}_{error}}^{2}}}$$

(13)$$\rho =\frac{\operatorname{cov}(H{{R}_{video}},H{{R}_{gt}})}{\text{std}(H{{R}_{video}})\text{std}(H{{R}_{gt}})}$$

As shown in Algorithm 1, we need to set three parameters ($\boldsymbol {{T}_{o}}$, $\boldsymbol {{T}_{i}}$, and $\boldsymbol {{T}_{p}}$) in the T-SNE method. $\boldsymbol {{T}_{o}}$ [16] denotes the number of dimensions mapped to low-dimensional data. The value range of parameter $\boldsymbol {{T}_{o}}$ can be [50, 100]. This interval is obtained from our extensive experimental analysis. If $\boldsymbol {{T}_{o}}$ is set too small, the pulse signal extraction will be insufficient, and the signal still contains much noise; if $\boldsymbol {{T}_{o}}$ is set too large, multiple pulse signal vectors will be generated to estimate the heart rate. The indicator $\boldsymbol Q$ of multiple pulse signal vectors are almost close, resulting in increased computational cost. The parameter $\boldsymbol {{T}_{i}}$ is related to the input signal preprocessing, where the input data will be reduced to $\boldsymbol {{T}_{i}}$ by PCA [16], which is beneficial to speed up the T-SNE operation. In this paper, we set $\boldsymbol {{T}_{i}}$ to 3. The parameter $\boldsymbol {{T}_{p}}$ [16] represents the perplexity of the Gaussian distributions employed in the high-dimensional space and also represents a smooth measure of the effective number of neighbors. The performance of SNE is relatively robust to changes in the perplexity, and typical values are between 5 and 50. In this paper, we want to obtain as many features as possible in the normal pulse signal to remove noise. So we choose a lower perplexity (the value of $\boldsymbol {{T}_{p}}$ is 6), which means that we only consider the 6 nearest neighbors when fitting each data point to the target distribution.

In addition, the TSS method compared with other state-of-the-art methods that are ICA [6], CHROM [8], POS [9], SSA [11] and rPPGNet [10]. Notably, we use the MAHNOB-HCI dataset as the training set for rPPGNet. The MAHNOB-HCI dataset [24] includes 527 facial videos with corresponding physiological signals from 27 subjects (12 male and 15 female). Then, we directly test the pre-trained model on the self-collected dataset and two public datasets. The Bland-Altman plot was used to evaluate the consistency between the estimated heart rate ${HR}_{video}$ sequence and the ground-truth heart rate ${HR}_{gt}$ sequence, as shown in Fig. 7. The two dotted lines represent the confidence range [$\mu$-1.96$\sigma$, $\mu$+1.96$\sigma$], and only the points between the dotted lines are considered highly reliable.

3.2 Experimental results in the motion interference scenario

In this section, we mainly conduct experiments in a mixture of stationary and motion interference scenarios, and the duration of each video is about 30 seconds. The head motion includes left-right, up-down, and circular motion in the motion interference experiment. In order to test the anti-interference performance of our proposed method, we gradually increase the complexity of the motion, which is mainly divided into three situations: the first, we add some sudden movement to the stationary experiment. Sudden movement is a sudden tilt of the head within a second, which is displayed in the pulse wave diagram as jump signal; the second, we increase the length of the head movement time. The head movement is continuous (left-right or up-down motion). The subject’s head remains still for the first 15 seconds, and the subject’s head moves for the next 15 seconds, which can provide a good contrast effect; the third, we mainly conduct experiments in the scene of head circular motion. The difference between circular motion and continuous motion is that the signal generated by circular motion is periodic. It is found that the frequency of the head movement may coincide with the frequency of the filter, which will affect the results of the experiment, so the frequency of head motion will be gradually adjusted, which can also have a good contrast effect. This part of the experiment is also a more challenging experiment.

Table 1 shows the experimental results of the motion interference scenario. Our method is significantly better than other methods from the data, in which the TSS has the slightest error in the estimating heart rate. Next, we will analyze the experimental results from the above three situations respectively:

Table 1. Results of six methods in motion interference scenario

View Table | View all tables in this article

In the first situation, as shown in Fig. 4, the red rectangle is the jump signal that is abnormal points in the RGB signals. In the data of sudden movement of Table 1, the proposed TSS method reached 1.08 bpm on the MAE metric, and the Pearson correlation coefficient $\rho$ was as high as 0.95, improving the $\rho$ by 2% compared with the second-best rPPGNet method. In addition, the error values of SSA and POS methods are relatively similar, but the four metrics of the ICA method in this scenario are slightly inferior to other methods.

Fig. 4. Pulse diagrams of the sudden motion experiments in motion interference scenario. Red rectangle is the jump signal generated by the sudden motion.

Download Full Size | PDF

In the second situation, the raw RGB signals are chaotic in the experiment of continuous head movement. As shown in Fig. 5, the red rectangle is the signal caused by motion interference. In the continuous movement data of Table 1, our method has reached 2.39 bpm on the MAE metric, and the Pearson correlation coefficient $\rho$ of the TSS method reached 0.86, improving the $\rho$ by 5% compared with the second-best rPPGNet method. The MAE metric of the SSA method is 4.03 bpm. The POS is better than the CHROM and ICA in these evaluation metrics, where the MAE metric has increased by 1.95 bpm compared to the ICA. As shown in Fig. 5, our method has a better processing effect when comparing the PPG signal with the pulse signal extracted by TSS.

Fig. 5. Pulse diagrams of the continuous motion experiments in motion interference scenario. Red rectangle is the signal generated by continuous head movement.

Download Full Size | PDF

In the third situation, different circular motion frequencies are used for comparison experiments in the experiment of head circular movement. In the circular movement data of Table 1, our method still achieves amazingly experimental results in this scenario, which can better extract the pulse signal from the periodic signal. The MAE metric of TSS is 3.19 bpm, and the Pearson correlation coefficient $\rho$ of TSS reaches 0.71. As shown in Fig. 6, our method separates the pulse signal from the periodic signal. The rPPGNet method is inferior to TSS but better than other methods, and its MAE metric is 4.03 bpm. In addition, the SSA method and the POS method are better than the CHROM method. However, the performance of ICA in this scenario is inferior. Each evaluation metric is far inferior to other methods, which reached an astonishing 12.03 bpm on the MAE metric. The data also proves that the ICA method is unsuitable for processing periodic signals.

Fig. 6. Pulse diagrams of the circular motion experiments in motion interference scenario. Green rectangle is the signal generated by high-frequency head circular motion; red rectangle is the signal generated by low-frequency head circular motion.

Download Full Size | PDF

Fig. 7. Bland-Altman diagrams of our proposed method and other state-of-the-art methods in the self-collected dataset.

Download Full Size | PDF

3.3 Experimental results in exercise recovery scenario

In the exercise recovery scenario, the experiment was carried out in the cardiac rehabilitation center. The subjects first exercised with bicycle fitness equipment or elliptical trainer, which caused a massive change in heart rate in a short time. The time of heart rate recovery varies from person to person, so we recorded about 10 seconds of video to study the details of the heart rate recovery, which is of great significance to the study of heart rate variability (HRV) [25,26]. Our experiments found that the subject’s heart rate drops by about 10 bpm every 10 seconds when the heart rate value reaches about 150 bpm.

Table 2 shows the experimental results of the exercise recovery scenario. Our method achieves 4.89 bpm on the MAE metric, with the slightest error among all methods, and the Pearson correlation coefficient $\rho$ reaches 0.89. The rPPGNet is the second-best method with an MAE metric of 5.34 bpm. In addition, the SSA method is slightly better than the POS method, but we are surprised to find that the CHROM method has the worst effect in this scenario, in which the MAE metric reaches 10.84 bpm. The main reason is that the subject is short of breath after the exercise recovery, which causes much noise in the red channel and blue channel, and CHROM cannot handle unstable signals. Furthermore, as shown in Fig. 8, our method performs well in this scenario.

Fig. 8. Pulse diagrams in exercise recovery scenario.

Download Full Size | PDF

Table 2. Results of six methods in exercise recovery scenario

View Table | View all tables in this article

3.4 Experimental results on two public datasets

The UBFC-RPPG dataset [12] contains about 50 subjects divided into two scenarios. In the first scenario, participants were asked to sit still, but some videos present significant movement (especially at the beginning of the sequence). The scenario is composed of 8 videos. In the second scenario, 42 subjects were asked to play a time-sensitive mathematical game that aimed at augmenting their heart rate while allowing more movement. Each video is synchronized with a pulse oximeter finger clip sensor (Contec Medical CMS50E) for the ground truth. The videos are recorded at 30 frames per second with a resolution of 640 $\times$ 480 pixels. Each video is about a minute long. In this dataset, we discard subjects 11, 18, 20, and 24 because their data were partially missing.

VIPL-HR [13] is a public data set that records facial videos of various scenes for non-contact heart rate estimation. It contains 3130 videos of 107 subjects (752 near-infrared videos and 2378 visible light videos). These videos were recorded by three consumer-level cameras (Logitech C310, RealSense F200 and the front camera of the HUAWEI P9 smartphone) in nine different scenes, including head movement and uneven lighting. The ground-truth heart rate value was recorded by a pulse oximeter (CONTEC CMS60C BVP sensor). The head movement videos recorded by Logitech C310 under the "v2 source1" folder of this data set are used as the videos of our experiment. The frame rate of these videos is 25fps, the resolution is 960 $\times$ 720 pixels, and the video duration ranges from 20 seconds to 60 seconds. Each video contains different scenes of stationary, continuous head movement, and circular head movement. These videos are highly challenging.

Table 3 shows the experimental results on two public datasets, where our method also outperforms other state-of-the-art methods. In the UBFC dataset, the TSS performs the best among the four statistical metrics, in which the MAE indicator reaches 1.64 bpm and the Pearson correlation coefficient $\rho$ is 0.94. The rPPGNet is slightly inferior to our method, in which the MAE metric reaches 1.86 bpm, and the Pearson correlation coefficient $\rho$ is 0.93. SSA and POS are better than CHROM and ICA. However, the ICA performs the worst, in which the MAE is 3.89 bpm, and the Pearson correlation coefficient is 0.79. Interestingly, the video in this dataset emulates a normal human-computer interaction scenario, and the heart rate simultaneously changes dynamically. Our method performs well when dealing with the videos with dynamic changes in heart rate, which is the same conclusion as the exercise recovery scenario in the self-collected dataset. Also, it validates that the proposed method is relatively stable.

Table 3. Results of six methods in two pubilc datasets

View Table | View all tables in this article

In the VIPL-HR dataset, the MAE metric of the TSS method reaches 4.76 bpm, and the Pearson correlation coefficient $\rho$ of our method is as high as 0.75. The rPPGNet method still outperforms other methods with the MAE metric of 5.32 bpm. The performance of the POS method on this dataset is better than that of the SSA method. The main reason is that the head movement in these videos is continuous and complicated, which cannot be processed well by SSA. The experimental conclusion is consistent with the conclusion in their paper. In addition, the CHROM method performed the most poorly, reaching 8.93 bpm on the MAE metric. The main reason is that the raw G channel in the dataset is terrible. The interval of the estimated heart rate of the raw G channel is [50,70] bpm. By comparing the VIPL-HR data set with the self-collected data set, we have another conclusion that the video recorded by the Logitech C920 webcam is better than the video recorded by the C310. The Logitech C920 webcam can obtain better skin color changes of the face, so the MAE metric of the VIPL-HR data set is higher than the self-collected data set in the motion interference scenario.

4. Discussion

Comprehensively analyzing the experimental results on a self-collected dataset and public datasets, the advantage of our method over other methods is that our method is relatively stable. Our method is based on the blind source separation model, but the T-SNE algorithm performs well compared to traditional blind source separation methods.

Firstly, ICA and PCA are linear data dimensionality reduction methods. T-SNE is a nonlinear data dimensionality reduction method. T-SNE is sensitive to the selection of parameters, such as the $\boldsymbol {{T}_{p}}$ parameter, which affects the processing effect of our method. This parameter can be understood as the number of valid neighbors near a point. If we set the parameter very low in the T-SNE algorithm, it means that we want to extract the local features of the data; if we set this parameter very high, it means that we want to extract the global features of the data. So, the T-SNE algorithm has high flexibility. In contrast, the PCA method selects the direction with the most significant variance of the projected direction data as the main feature and extracts irrelevant attributes from the original data. However, this processing mainly extracts the data’s overall features and loses the data’s local features. The ICA method finds the most critical independent components in the signal, but the independent components are usually random, which makes the ICA method challenging to decompose the correct signal.

Secondly, according to the characteristics of the signal, the shortcomings of ICA and PCA lead to their poor performance when dealing with periodic signals. For example, in the experiment on head circle movement, it’s difficult to separate the pulse signal by these two methods. On the contrary, T-SNE can separate the pulse signal well, which also verifies the stability of our method.

The anti-motion interference ability of the TSS method is mainly due to the data dimensionality reduction of the T-SNE method. T-SNE is used to generate an adaptive feature matrix, which can well separate the pulse signal. This conclusion is validated by the self-collected and public data experimental results. Compared with other advanced methods, it achieves good results in various complex motion interference scenarios. But T-SNE also has some drawbacks. Since T-SNE uses gradient descent optimization to minimize the objective function, it is essential to choose appropriate parameter values. Otherwise, this will affect the processing speed of T-SNE and the effect of signal separation. For example, suppose we manually set the $\boldsymbol {{T}_{p}}$ parameter to a significant value of 50. When we run T-SNE, we find that the time to generate the adaptive matrix will be significantly longer because the amount of data in each processing process becomes larger. In addition, for the same data, the self-adaptive matrix generated by each run of T-SNE is also different. These are the points we need to pay attention to when using T-SNE.

5. Conclusion

In this paper, we propose a novel method for heart rate estimation that has good anti-motion interference performance. The primary process of the method: firstly, The face is detected and tracked to suppress head movements. Then TSS decomposes the raw RPPG signals into pulse-related vectors and noise vectors using the T-SNE algorithm. Finally, we select the vector with the most significant spectral peak as the pulse signal for heart rate measurement. TSS is tested on the self-collected dataset and two public datasets. The experimental results show that TSS significantly outperforms other state-of-the-art methods.

Furthermore, based on the principle of the T-SNE algorithm, further experiments show that the combination channels of RGB channels and HSV channels can also be used as the raw input signal. Although the proposed method achieves good results in heart rate estimation, it is still a big challenge for complex light interference and more violent sports interference scenes. At the same time, this is also the further work we need to solve.

Funding

Anhui Major Projects of Science and Technology (201903c08020010); Fundamental Research Funds for the Central Universities (PA2021GDSK0071).

Disclosures

The authors declare that there are no conflicts of interest related to this article

Data Availability

The self-collected dataset cannot be shared at this time due to privacy reasons, but can be made available on reasonable request. The public datasets underlying the results presented in this paper are available in Ref. [12,13,24].

References

1. J. Allen, “Photoplethysmography and its application in clinical physiological measurement,” Physiol. Meas. 28(3), R1–R39 (2007). [CrossRef]

2. I. Pavlidis, J. Dowdall, N. Sun, C. Puri, J. Fei, and M. Garbey, “Interacting with human physiology,” Comput. Vis. Image Underst. 108(1-2), 150–170 (2007). [CrossRef]

3. W. Liu, X. Fang, Q. Chen, Y. Li, and T. Li, “Reliability analysis of an integrated device of ECG, PPG and pressure pulse wave for cardiovascular disease,” Microelectron. Reliab. 87, 183–187 (2018). [CrossRef]

4. W. Zhong, K. J. Cruickshanks, C. R. Schubert, C. M. Carlsson, R. J. Chappell, B. E. Klein, R. Klein, and C. W. Acher, “Pulse wave velocity and cognitive function in older adults,” Alzheimer disease and associated disorders 28(1), 44–49 (2014). [CrossRef]

5. W. Verkruysse, L. O. Svaasand, and J. S. Nelson, “Remote plethysmographic imaging using ambient light,” Opt. Express 16(26), 21434–21445 (2008). [CrossRef]

6. M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation,” Opt. Express 18(10), 10762–10774 (2010). [CrossRef]

7. M. Lewandowska, J. Rumiński, T. Kocejko, and J. Nowak, “Measuring pulse rate with a webcam—a non-contact method for evaluating cardiac activity,” in 2011 Federated Conference on Computer Science and Information systems (FedCSIS), (IEEE, 2011), pp. 405–410.

8. G. De Haan and V. Jeanne, “Robust pulse rate from chrominance-based RPPG,” IEEE Trans. Biomed. Eng. 60(10), 2878–2886 (2013). [CrossRef]

9. W. Wang, A. C. den Brinker, S. Stuijk, and G. De Haan, “Algorithmic principles of remote PPG,” IEEE Trans. Biomed. Eng. 64(7), 1479–1491 (2017). [CrossRef]

10. Z. Yu, W. Peng, X. Li, X. Hong, and G. Zhao, “Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 151–160.

11. D. Wang, X. Yang, X. Liu, J. Jing, and S. Fang, “Detail-preserving pulse wave extraction from facial videos using consumer-level camera,” Biomed. Opt. Express 11(4), 1876–1891 (2020). [CrossRef]

12. S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, and J. Dubois, “Unsupervised skin tissue segmentation for remote photoplethysmography,” Pattern Recognition Lett. 124, 82–90 (2019). [CrossRef]

13. X. Niu, H. Han, S. Shan, and X. Chen, “Vipl-hr: A multi-modal database for pulse estimation from less-constrained face video,” in Asian conference on computer vision, (Springer, 2018), pp. 562–576.

14. S. Fallet, V. Moser, F. Braun, and J.-M. Vesin, “Imaging photoplethysmography: What are the best locations on the face to estimate heart rate?” in 2016 Computing in Cardiology Conference (CinC), (IEEE, 2016), pp. 341–344.

15. X. Niu, S. Shan, H. Han, and X. Chen, “Rhythmnet: end-to-end heart rate estimation from face via spatial-temporal representation,” IEEE Trans. on Image Process. 29, 2409–2423 (2020). [CrossRef]

16. L. Van der Maaten and G. Hinton, “Visualizing data using T-SNE,” J. Mach. Learning Res. 9, 2579–2605 (2008).

17. L. Van Der Maaten, “Accelerating T-SNE using tree-based algorithms,” The J. Mach. Learning Res. 15(9), 3221–3245 (2014).

18. D. Kobak and P. Berens, “The art of using T-SNE for single-cell transcriptomics,” Nat. Commun. 10(1), 5416 (2019). [CrossRef]

19. A. Gisbrecht, A. Schulz, and B. Hammer, “Parametric nonlinear dimensionality reduction using kernel T-SNE,” Neurocomputing 147, 71–82 (2015). [CrossRef]

20. L. Van Der Maaten, “Learning a parametric embedding by preserving local structure,” in Artificial intelligence and statistics, (PMLR, 2009), pp. 384–391.

21. M. Kumar, A. Veeraraghavan, and A. Sabharwal, “Distanceppg: Robust non-contact vital signs monitoring using a camera,” Biomed. Opt. Express 6(5), 1565–1588 (2015). [CrossRef]

22. X. Liu, X. Yang, D. Wang, and A. Wong, “Detecting pulse rates from facial videos recorded in unstable lighting conditions: An adaptive spatiotemporal homomorphic filtering algorithm,” IEEE Trans. Instrum. Meas. 70, 1–15 (2021). [CrossRef]

23. X. Li, J. Chen, G. Zhao, and M. Pietikainen, “Remote heart rate measurement from face videos under realistic situations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2014), pp. 4264–4271.

24. M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal database for affect recognition and implicit tagging,” IEEE Trans. Affective Comput. 3(1), 42–55 (2012). [CrossRef]

25. M. Javorka, I. Zila, T. Balharek, and K. Javorka, “Heart rate recovery after exercise: relations to heart rate variability and complexity,” Braz. J. Med. Biol. Res. 35(8), 991–1000 (2002). [CrossRef]

26. V. Cornelissen, B. Verheyden, A. Aubert, and R. Fagard, “Effects of aerobic training intensity on resting, exercise and post-exercise blood pressure, heart rate and heart-rate variability,” J. Hum. Hypertens. 24(3), 175–182 (2010). [CrossRef]

Conditions	Method	MAE (bpm)	SD (bpm)	RMSE (bpm)	$ρ$
Sudden movement	ICA	3.03	5.46	5.97	0.76
	CHROM	2.98	5.06	5.49	0.74
	POS	2.12	4.95	4.96	0.84
	SSA	1.93	4.84	4.77	0.88
	rPPGNet	1.35	4.21	4.55	0.93
	TSS	1.08	3.43	4.04	0.95
Continuous movement	ICA	6.57	11.36	11.27	0.41
	CHROM	5.23	10.15	10.21	0.62
	POS	4.62	9.79	10.17	0.58
	SSA	4.03	9.21	9.05	0.68
	rPPGNet	2.91	5.89	7.44	0.81
	TSS	2.39	5.56	6.04	0.86
Circular movement	ICA	12.03	21.22	22.96	0.31
	CHROM	9.51	16.01	18.79	0.46
	POS	6.26	13.28	14.46	0.31
	SSA	5.81	10.47	10.04	0.51
	rPPGNet	4.03	9.57	9.98	0.63
	TSS	3.19	7.13	7.58	0.71

Method	MAE (bpm)	SD (bpm)	RMSE (bpm)	$ρ$
ICA	9.53	18.73	19.47	0.70
CHROM	10.84	23.67	23.51	0.66
POS	7.26	16.27	16.63	0.76
SSA	6.23	11.01	12.62	0.81
rPPGNet	5.34	10.85	11.72	0.85
TSS	4.89	10.04	10.94	0.89

Dataset	Method	MAE (bpm)	SD (bpm)	RMSE (bpm)	$ρ$
UBFC-RPPG	ICA	3.89	7.78	9.04	0.79
	CHROM	3.61	6.81	8.26	0.83
	POS	2.73	6.62	6.74	0.86
	SSA	2.35	5.98	6.04	0.89
	rPPGNet	1.86	4.03	4.27	0.93
	TSS	1.64	3.59	3.89	0.94
VIPL-HR	ICA	7.21	13.27	14.72	0.61
	CHROM	8.93	14.82	14.34	0.57
	POS	5.94	11.66	10.59	0.67
	SSA	6.59	12.59	13.76	0.65
	rPPGNet	5.32	10.68	11.79	0.71
	TSS	4.76	10.94	11.03	0.75

Conditions	Method	MAE (bpm)	SD (bpm)	RMSE (bpm)	$ρ$
Sudden movement	ICA	3.03	5.46	5.97	0.76
	CHROM	2.98	5.06	5.49	0.74
	POS	2.12	4.95	4.96	0.84
	SSA	1.93	4.84	4.77	0.88
	rPPGNet	1.35	4.21	4.55	0.93
	TSS	1.08	3.43	4.04	0.95
Continuous movement	ICA	6.57	11.36	11.27	0.41
	CHROM	5.23	10.15	10.21	0.62
	POS	4.62	9.79	10.17	0.58
	SSA	4.03	9.21	9.05	0.68
	rPPGNet	2.91	5.89	7.44	0.81
	TSS	2.39	5.56	6.04	0.86
Circular movement	ICA	12.03	21.22	22.96	0.31
	CHROM	9.51	16.01	18.79	0.46
	POS	6.26	13.28	14.46	0.31
	SSA	5.81	10.47	10.04	0.51
	rPPGNet	4.03	9.57	9.98	0.63
	TSS	3.19	7.13	7.58	0.71

Method	MAE (bpm)	SD (bpm)	RMSE (bpm)	$ρ$
ICA	9.53	18.73	19.47	0.70
CHROM	10.84	23.67	23.51	0.66
POS	7.26	16.27	16.63	0.76
SSA	6.23	11.01	12.62	0.81
rPPGNet	5.34	10.85	11.72	0.85
TSS	4.89	10.04	10.94	0.89

Heart rate estimation from facial videos with motion interference using T-SNE-based signal separation

Abstract

1. Introduction

2. Method

2.1 Face detection and tracking

2.2 TSS method

3. Experiments

3.1 Experimental setup

3.2 Experimental results in the motion interference scenario

3.3 Experimental results in exercise recovery scenario

3.4 Experimental results on two public datasets

4. Discussion

5. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (8)

Tables (3)

Equations (13)

Biomedical Optics Express