1. Introduction
Currently, robust and trustful mechanisms are required to protect our privacy, in particular when such information allows access to valuable goods or restricted places. In this direction, the digital methods for personal recognition are fundamental, and the most common practice is still using a Personal Identification Number (PIN) or merely a password.
Password-based approaches usually are related to something familiar to the subject/individual or it is written in somewhere and also encrypted, then digitally stored. This scenario is susceptible to several attacks in an attempt to steal important data. Due to these facts, more efficient ways to recognize an individual digitally have been investigated in the literature. In this context, biometrics-based techniques are the most promising path.
Biometrics approaches overcome such limitations encountered by PIN/password approaches because they use the human physiological or behavioral characteristics, which may have a high uniqueness factor and are more difficult to copycat/fake.
Governments and corporations finance several researchers, making biometrics a highly researched topic. The findings already have real usage, from simple smartphone access to terrorist identification in public spaces. Several biometric techniques have almost perfect accuracy in controlled environments [
1], although this is not the case in unconstrained environments as in the mobile scenario [
2] (also known as in the wild environments). Furthermore, individuals are often away from surveillance cameras, subject to different light sources, and may even be wearing accessories to deceive biometric systems purposely (wearing glasses, hats, make-up) in a real scenario. According to Neves et al. [
3], there are no biometric systems capable of working with standard surveillance systems (a real scenario) currently. Thus, the unconstrained scenario is the most researchable one in the literature, and there are three significant patchworks to improve biometrics in these environments: improve anti-spoofing techniques, add robustness to digital modality representation, and by using multimodal systems. In this work, we focus on the last one.
A multimodal system is the one that uses two or more biometric modalities [
4]. A biometric modality is a characteristic that can be used to identify or differentiate an individual. Some examples of biometric modalities are the face, the fingerprint, the iris, the voice, vital signals, and the gait. A great reason to use a multimodal system is that it is not possible to define which is the “best” biometric modality beforehand because each modality has a scenario where it works better. Moreover, from the ensemble classification theory, a diversity of sources can improve recognition rates. Based on this fact, each biometric modality can counterbalance the pros and cons of the others and may result in the enhancement of the performance.
In the literature, multimodal datasets are not so numerous as unimodal ones. Nonetheless, it is possible to find a few standard contemplating modalities (the most common), such as West Virginia University Biometrics (WVU) [
5] with six modalities: fingerprint, face, iris, palm-print, and geometry from the hand and voice; Mobile biometrics (MOBIO) dataset [
6] with face and voice; and Multimodal Database Captured with a Portable Handheld Device (MobBIO) [
7] with face, eye, and voice, among others [
8,
9]. However, to the best of our knowledge, there are no datasets that mix the so-called standard modalities (face, eye, fingerprint) with novel and unusual biometrics, such as the ones provided by vital signals (for instance, from the heartbeat signal). We believe that biometrics from vital signs can contribute to the robustness of multimodal systems, especially for authentication in a mobile scenario [
10]. Nonetheless, such investigation is not yet possible due to the lack of multimodal datasets. Taking this fact into account, our motivation for chimeric datasets arises.
A chimeric dataset is built by creating a set of chimeric individuals where each modality comes from a different dataset. More specifically, a chimeric individual is created by selecting the samples of a given individual from each kind of modality. The number of possible different combinations of the chimeric dataset (a set of chimeric individuals) is exponential to the number of subjects of each considered unimodal modality.
According to Wayman [
11], the modalities of an individual are correlated with each other, and therefore, chimeric datasets are not the ideal condition. However, in the absence of multimodal datasets in the literature, chimeric datasets are the only option to investigate and understand the potential impacts of using unusual/new modalities together to enhance biometric systems.
It is worth mentioning that there is no standard protocol in the literature guiding the creation of chimeric datasets. It is still a point of discussion how to combine multiple biometric datasets. Thus, the present work addresses the chimeric datasets, creation for biometric purposes. The experiments handled in this work reinforce the use of multimodal systems for authentication in mobile devices, such as smartphones and tablets. Results show the advantages of multimodal systems, which in this work is a combination of three modalities described by deep representations: face, eye, and ECG. The most common manner to create a chimeric dataset is by randomly associating individuals among different datasets.
Usually, an approach based on random association does not take into account the nature of the different modalities or the specific degree of difficulty of each modality. This fact may cause one modality to stand out over others. This scenario may offer an optimist evaluation of an algorithm and be far from reality. Based on this fact, it is necessary to define the categorization of the individuals’ features. Based on this scenario, a classification scheme aiming to categorize the individuals according to their features was proposed in Doddington et al. [
12]. Even though the categorization came from the work proposed in [
12], our approach relies on the work proposed by Ross et al. [
13]. Based on
Doddington Zoo, the authors analyzed whether it was better or not to fuse two or more modalities. The authors also created several chimeric individuals to support their claims. Our approach uses the same concepts proposed in Ross et al. [
13], aiming to add a constraint criterion to allow the creation of more homogeneous chimeric datasets. The objective of the present work is to promote a fair and controlled benchmark creation to compare methods.
We conduct several experiments addressing different scenarios aspiring to understand the effects of different types of fusion with a systematic and reproducible protocol. Furthermore, new periocular and face recognition models for verification in the open-gallery scenario are proposed, achieving a new State-Of-The-Art (SOTA) on the Face Recognition Grand Challenge (FRGC) dataset.
Therefore, we can summarize the contributions of this article as: (i) a fair and reproducible chimeric dataset creation protocol, (ii) investigation of the fusion of two traditional modalities (face and eye) with a biometric modality from the heart/off-the-person ECG, (iii) a more feasible and real application-prone method, and (iv) new SOTA results for the FRGC dataset on periocular recognition with verification mode.
This work is organized as follows.
Section 2 shows works in the literature dealing with chimeric datasets. A review of the considered datasets and the state-of-the-art methods is described in
Section 3.
Section 4 presents the proposed methodology to create the chimeric dataset. The experiments are presented along with the results and discussion in
Section 6. Finally, conclusions are highlighted in
Section 7.
2. Related Chimeric Datasets’ Approaches
Due to the lack of public datasets and benchmarks at the moment, evaluating multimodal systems is a challenge. Besides, there are no datasets that consider all possible biometrics’ combinations. However, if one wants to study an unusual combination of modalities, there are a couple of alternatives. One is the creation of a new dataset, which is expensive and often a difficult task. Another option is to create a chimeric dataset. In this section, we describe a few works related to chimeric datasets regardless of biometric usage.
In [
14], a chimeric dataset with two modalities was proposed: fingerprint and ECG. The ECG signal was used both for liveness and recognition, although the focus of that work was liveness detection. One significant contribution relied on the stop criterion proposed to reduce the size of the sample signal. The authors reported an EER of 3% for liveness detection with thirty seconds of the acquired signal. The chimeric dataset was randomly built, pairing one subject from the ECG dataset with another subject in the fingerprint dataset. The ECG dataset was collected by the authors in BioSec lab at the University of Toronto, while the fingerprint came from the LivDet2015 dataset [
15]. Five hundred different subjects were created, and the process was repeated five times due to the randomness of the dataset construction.
Another chimeric approach involving the on-the-person ECG signal was the one proposed by Singh et al. [
16], which also employed fingerprint and face for the recognition task. The ECG data came from the Physionet dataset, while the fingerprint and face data came from NIST. The authors investigated a fusion in a score level scenario, focusing on the weighted sum rule. The reported EER, for the unimodal scenario, was 10.80% for the ECG, 4.52% for the face, and 2.12% for the fingerprint. After fusion, the EER significantly dropped to 0.22%.
Barra et al. [
17] proposed a new chimeric dataset by merging the ECG signal with six bands of EEG signals. Their recognition approach rested on fiducial points of the ECG signal along with features extracted from the EEG spectrum. Five random non-overlapped segments from a total of 12 s of each signal were employed, in which one segment was randomly chosen for the probe set, while the remaining were used as a gallery. One individual from the ECG dataset was arbitrarily combined with another individual from the EEG dataset, resulting 52 chimeric individuals. The authors also explored different EEG channels. The fusion was performed in a score level scenario with Euclidean distance as the metric distance. Three fusion rules were investigated: the sum, the product, and the weighted sum. The best result, reported on five-fold cross-validation, was the one considering the weight sum and EEG delta band, achieving an EER of 0.928%.
One behavioral biometry, very common in the literature, is through the handwritten signature. In [
18] was proposed an approach for feature-level fusion applying the signature and face. Wavelet-based features were used to represent both modalities. The chimeric dataset was created by randomly associating a face with a signature, which resulted in 30 chimeric individuals. A Hamming distance classifier was employed for the genuine or impostor decision. The reported accuracy was of 97.5% for the Olivetti Research Ltd (ORL) dataset (face) plus the Caltech dataset (signature) and 98.88% for the ORL Ucoer dataset (signature). According to the authors, the fusion of the modalities delivered better effectiveness than considering the modalities alone.
Two of the most promising modalities in the biometric scenario, the iris and the face, were merged in [
19] on a score level fusion using a weighted sum rule. The weights were empirically chosen and were specific for each subset of data. The authors reported an accuracy of 99.4% and used three well-established datasets in the literature: Universiti Teknologi Malaysia Iris and Face Multimodal Datasets (UTMIFM), UBIRIS Version 2.0 (UBIRIS v.2), and ORL face. The UTMIFM is a dataset with both modalities and the chimeric dataset built on UBIRIS v.2 and ORL stochastically.
Due to the different nature of chimeric datasets, in the literature, there is no standard protocol to collaborate with the creation of such datasets. Furthermore, the reported process of creating the chimeric dataset was unclear for the methods cited above and had a lack of details, and most of the authors did not perform a proper statistical test.
In this work, we propose a comprehensible and reproducible protocol for the creation and evaluation of a chimeric dataset in
Section 4.
4. Multimodal Chimeric Dataset Creation and Evaluation Protocol
In this section, the methodology to create a chimeric dataset from multiple datasets is presented. We also recommend how to evaluate and make a fair comparison of methods using a chimeric dataset. The overview of the methodology is shown in
Figure 4. The process can be split into five steps: (1) cleaning and filtering the data (preprocessing), (2) feature extraction, (3)
Doddington Zoo criterion, (4) chimeric individual creation and modalities fusion, and (5) feature matching and decision. The first three steps were executed independently for each modality, while the latter two were responsible for the fusion and the subject recognition.
4.1. Cleaning and Filtering Data
This step was executed over all unimodal datasets individually, as illustrated in
Figure 4. This step included cleaning (or filtering), segmentation, and data normalization towards preparing the dataset for the next steps to facilitate the learning and feature representation processes. For instance, it may be used to identify and remove outliers, excessive noise, or any incorrect and corrupted data segment. It is worthwhile noticing that each biometric modality had its own set of methods and approaches to data cleaning since each one had its singularities. Furthermore, this step was not mandatory.
4.2. Feature Extraction
This step aimed to map a set of raw (or preprocessed) data to a more discriminating representation. This step is crucial in a machine learning context, and from that forward, each instance of one modality was represented by a feature vector. A feature vector typically carries relevant and non-redundant information in a compact view. The features must favor the learning and generalization process of training machine learning models and also facilitate the classification task.
4.3. Doddington Zoo Criterion
In a recognition system, there are always individuals who are more likely to be confused with others and those individuals with little intra-class variability, therefore being easily identifiable. The former individuals contribute more to false acceptance and false rejection errors and thus can distort results, especially when using EER to analyze the system performance [
12,
13].
In [
12], an animal-based nomenclature was proposed to classify individuals into a recognition system, called the
Doddington Zoo. According to Doddington et al. [
12], most common or default individuals who are predominant in the population are labeled as
sheep. Individuals that hard to recognize correctly, and therefore with larger intra-class variability, are called
goats. Those individuals who are easily imitated and have lower inter-class variability are classified as
lambs. Finally, those individuals who are easy to be confused with others or have potential to impersonate others are called
wolves. Since the spoofing scenario was not explored here, we did not consider the wolf category for the present work.
In [
13], the authors explored the use of the
Doddington Zoo nomenclature to assist in decision making for merging two biometric modalities. These authors investigated several scenarios and concluded that fusion was the best option when two modalities to be fused came from two individuals labeled as
goat or from two individuals labeled as
lamb. According to [
13], other combinations of individuals may favor one modality over another.
In this work, for each round of evaluation, a new (random) chimeric dataset was created, and therefore, if one does not control the combination of individuals using a criterion such as the
Doddington Zoo, one could create a dataset on which a specific modality has a disproportional weight in the result, or it is possible to create a dataset with an excess of weak individuals (
goat or
lamb type). To avoid this situation, we used the criterion proposed by Doddington et al. [
12] to label the individuals of the unimodal datasets, and we propose the following constraints for the combination:
A chimeric individual can only be created by combining individuals with the same label. Thus, the modalities that compose a chimeric individual must have the same level of difficulty, that is the same amount of each category of animals’ overall modalities. We considered that in datasets containing large numbers of lambs and goats, it tends to be harder to conduct a verification process, therefore more difficult;
The chimeric dataset must have a fixed number of chimeric individuals of the three labels: sheep, goat, and lamb;
The number of sheep-type individuals should be the majority case.
With these restrictions, we intended to favor the creation of more homogeneous chimeric datasets to promote reproducibility and comparison. Furthermore, the number of individuals categorized as
goat or
lamb was much lower than the number of individuals labeled as
sheep from the unimodal point of view. Other combination restrictions were explored in [
13] for two modalities. We selected the combinations that favored the fusion of modalities and reduced the impact of a single modality standing out from the others.
To determine the category of the individuals, we used the methodology proposed in [
13] and the training data. First, all individuals without at least one sample in any session were discarded. To determine who must be labeled as
goat, a few steps should be taken. First, the intra-class distance was calculated for all users and sorted in an ascending manner. After that, the last
n individuals were selected, which represented all individuals above the 70th percentile. The rationale for this was to select individuals with the largest distances from the intra-class samples. For the
lambs, first, the mean inter-class distance was calculated in a one-against-all comparison. Then, results were sorted in ascending order. Thereafter, the first
n individuals who represented the lower 30th percentile were selected. The rationale of this approach was to select individuals with a lower distance difference from others. Equation (
1) was used for percentiles’ calculation. This equation returns the index related to one specific percentile, and thus, the individual above or lower this index is considered a
lamb. It is worth highlighting that the individuals here were described as integers and each individual was labeled as only one animal.
in which
p is the percentile and
N is the number of individuals available in the dataset.
We have two possible scenarios: (1) modalities that come from different individuals and (2) modalities that come from the same individual. For the first scenario, the process was conducted separately for each modality since there was not a strong correlation among individuals. For the second scenario where modalities belonged to the same dataset, we forced the chimeric individual to share the same label among all modalities. To define the category (goats/lambs) of those chimeric individuals, we first detected categories separately in each modality and chose those that shared the same category. However, this approach may result in a total of individuals lower than the 10th percentile expected. To circumvent this problem, the following criteria were used: (1) individuals who had divergent labels, and therefore were excluded from the first round of chimeric individual composition, were now considered; (2) a new random permutation was created from this new set; (3) the first nindividuals were then used to complete the category.
Once the goats and lambs were defined, we generated a random permutation of them, selected a proportional 10th percentile of valid individuals of each “animal”, and set the remaining as sheep individuals. With that, we aimed to generate different chimeric datasets in each execution.
Once the goats, lambs, and sheep were determined for each modality, the goats were combined with other goats, lambs with lambs, and sheep with sheep. The amount of each “animal” was limited according to the modality with the lowest number of an animal available. For instance, if Modality 1 had 10 goats, 10 lambs, and 80 sheep and Modality 2 had 5 goats, 5 lambs, and 400 sheep, the chimeric dataset created over these two modalities would have 5 goats, 5 lambs, and 80 sheep.
4.4. Building the Chimerical Dataset
The first two steps were performed only once. For each modality, a specific feature extraction method was applied and the feature vectors stored. Those feature vectors were used to label individuals from unimodal datasets according to the Doddington Zoo criteria. In that sense, the chimeric individuals were created with pre-computed features. The proposed chimeric dataset creation was a guided stochastic process because of the Doddington Zoo criteria, and thus, for each new experiment, a new chimeric dataset had to be created. To promote reproducibility, we stress that the method and the seeds used to generate randomness should be stored and made available.
The process of creating the chimeric dataset can be formally expressed as follows. Let D be a dataset scenario with different modalities, with , i.e., . Furthermore, let be the set of individual samples of the modality i, where is the sample of the individual j of the modality . Note that each modality may have a distinct number of individuals and each individual a different number of samples. Then, let be the number of individuals for the modality i and the number of samples for each individual j of modality i, with and .
Then, a chimeric individual (or subject), denoted by
, is defined as follows,
for some
and
, chosen only once. Hence, the number of individuals of the chimeric dataset was equal to the number of individuals of the modality with fewer individuals. Besides, the number of samples of each chimeric individual was limited by the lowest number of samples of the selected individuals of each modality. The number of samples may vary for each chimeric individual. We stress that this process was accomplished separately for each category (label) regarding the
Doddington Zoo criteria and respecting the restrictions presented in
Section 4.3.
To understand the creation process of a chimeric individual, one can consider a scenario with three datasets and each dataset representing a different modality. From this scenario with three modalities, one individual was randomly selected from each dataset and combined with another two individuals from the other two datasets, respecting the label and restrictions presented in
Section 4.3, thus resulting in a new chimeric individual. The samples of each modality were combined sequentially, that is the first sample of
Modality 1 was combined with the first sample of
Modality 2,
Modality 1’s second sample was combined with
Modality 2’s second sample, and so on. All data belonging to this new chimeric individual were then excluded from the next rounds of selection, ensuring that a single individual from a dataset A was assigned only to a single individual from a dataset B, and the process was repeated until there were no more available individuals for selections on the datasets. As this association was made one by one (one individual of
Modality 1 was combined with one of
Modality 2 and one of
Modality 3, as illustrated in
Figure 5), the amount of individuals was limited to the dataset with the lowest number of individuals, and the number of samples per chimeric individual was also limited by the most restrictive modality of one specific individual.
Figure 5 illustrates the proposed chimeric dataset protocol with a didactic example, in which the amount of individuals on
Modality 2 dataset was the one that limited the total number of chimeric individuals to be generated and the number of samples of
Modality 3 limited the number of total samples per chimeric individual. The second individual of
Modality 2 also limited the number of samples of one chimeric individual.
4.5. Feature Matching and Decision
Once the chimeric dataset was created, the data could be used to perform a biometric recognition and, the next natural step, to evaluate and compare the methods as well. In biometric systems, a common evaluation mode is called verification or positive identification, where the goal is to prevent more than one person from accessing the same information [
28]. The
verification mode is a challenging evaluation since it resembles the open world/gallery case. Thus, our protocol recommended the use of verification to compare works.
In the
verification mode, the biometric system validated whether a subject was whom he/she claimed to be. That is, the input information was compared against the identity data previously informed to the system. Verification is widely used, and the most common forms are alphanumeric passwords and token cards. Considering verification by similarity, a verification system can be mathematically modeled as in Equation (
2):
where
I is the identity to be verified (an integer number denoting an individual label),
is the input feature vector,
is the genuine subject class,
is the impostor subject class,
S is a function that measures the similarity between two feature vectors (with the same dimensionality):
and
(comparison between two instance pairs), and
t a pre-defined threshold [
28]. The similarity (
S) was calculated using some metric distance. The most common distance metrics are the Euclidean distance, Manhattan distance, Mahalanobis, Spearman distance, and Hamming distance.
It is important to note that the verification modes used a threshold of t in the comparison. Two samples from the same subject, even on the same sensor, may present variations between them. These variations can be generated by the subject himself/herself (for example, a different haircut or a different pose) or even by sensor imperfection (noises). The threshold used directly affects the system effectiveness, and finding the most appropriate one depends on the application. It is common in the literature to construct distribution curves (histogram of scores), as well as the Detection Error Trade-off (DET) curve for method evaluation. From the DET curve and the score distribution curves, two different metrics could be used to evaluate the methodology: decidability and Equal Error Rate (EER). Both are widely used in the literature to compare methods on several publicly-available datasets.
The decidability (
d) indicates how far the intra-class (genuine) distribution scores are from the inter-class (impostor) ones [
29]. The calculation relies on the means and standard deviations of the intra-class (
and
) and inter-class distribution (
and
), and it can be expressed as:
The EER is related to the False Acceptance Rate (FAR) and the False Rejection Rate (FRR). The EER is the point where the FAR is equal to FRR. The lower is the EER, the better is the system in the average case. The EER can be calculated from the DET curve, which plots the false rejection rate vs. the false acceptance rate, as being the point that crosses the curve or the closest point to the perfect one.
When working with multimodal datasets, such as the ones used on the proposed chimeric dataset, a criterion must be applied to merge the modalities. However, how to combine/fusion modalities properly is still an open problem in the literature, and it may happen at four different levels: sensor, features, scores, and decision. The fusion at the sensor level consists of merging two or more input data before any recognition process, for instance fusion images in the infrared with the ones in the visible spectrum to perform face recognition. The fusion at the feature level is conducted after the feature extraction step. The merge may occur in several manners with the simplest one being the simple concatenation: n features of one modality concatenated with m features of another modality, which results in a new set of features with an vector size. This scenario is the one with the biggest potential for optimization in multimodal systems; once there is more information available.
The fusion at the score level also happened after the feature extraction step. The similarity scores were computed separately for each modality and combined employing a specific rule. The most simple and common rules used to combine scores are the sum, product, max, min, and exponential sum. It is also possible to use more complex strategies. The fusion at the decision level is less popular in the literature and is often associated with classifiers and the voting strategy.
4.6. Comparing Results
Due to the randomness of the chimeric dataset creation process, one must consider the law of large numbers. In that sense, for a fair comparison, we recommend executing the entire random process 30 times [
30]. Thus, one must compute the mean and standard deviation and perform a statistical test to compare the methods.
Besides, one should pay attention to whether datasets have different acquisition sessions. Two scenarios should emerge: the intra-session and the inter-session. In this context, the intra-session scenario means all the images are taken in a short period of time, usually in the same place and at the same moment. The inter-session scenario means all the images are taken at larger time intervals (days or even weeks apart). The inter-session scenario produces a “natural noise” due to its own individual interference, such as changes in haircut, skin tone, accessories, and makeup, among other changes. The inter-session problem tends to be harder than the intra-session problem since, for the inter-session problems, the variation among samples from the same class (same individual) is higher when compared to the intra-session problem. Since the proposed protocol aimed to evaluate the robustness of the data representation (features), we recommend evaluating methods in at least two sessions, to assess the impact of intra-class variation in time.
To perform the evaluation, one score should be computed for each pair of samples of the dataset, in a one-against-all fashion. For the intra-session scenario, only one set of data (one session) is considered for both training and testing. However, for the inter-session scenario, data from two or more sessions should be used. Thus, all the pairs are necessarily created with data from different sessions.
6. Multimodal Chimeric Dataset Evaluation
In this section, we describe the experiments performed on the FEyG dataset. The analysis was performed in three fusion scenarios: a fusion of two and three modalities combined. We also present the case in which the modalities were evaluated separately, in order to assess the impact of the fusion regarding each different modality.
We conducted the experiments on an Intel (R) Core i7-5820K CPU @ 3.30 GHz with 12 cores, 64 GB of DDR4 RAM, and a GeForce GTX TITAN X GPU. All the implementations were based on the MatConvNet toolbox [
37] linked with NVIDIA CuDNN.
The datasets considered here did not have the same amount of individuals: the CYBHi had 63, and the FRGC had over 300 individuals. Considering that and after the preprocessing (filtering) steps, the final number of chimeric individuals was 56. This value was determined by the number of individuals remaining in the base CYBHi dataset after the preprocessing. The samples for each chimeric individual were also limited to the worst case scenario (the individual with the minimum number of samples among all datasets). For our specific case, the number of chimeric individuals was limited by the amount of valid ECG individuals (CYBHi), while the amount of samples per chimeric individual was limited by ECG or face, depending on the combination.
Table 2 shows the results acquired for all modalities’ combinations in three different scenarios. The first column is related to the test conducted on the data used for training (intra-session in the training data), a test over data that were never seen in the training data, referenced as the test (intra-session in the test data), and the last one, where training against test data was used. In the last scenario (inter-session), the test data worked like a probe in a biometric system and the training data as a gallery.
To ensure a statistically fair comparison for each scenario, the experiments were executed 30 times (mean ± standard deviation), which means that each execution used a different combination of the chimeric dataset. We used Student’s
t to compare the results assuming EER in intra-session and inter-session separately. We compared the lowest EER along with the other results. The best results are highlighted in
Table 2. We emphasize that, according to the performed test, there was no statistical significance between the highlighted results in red. In the next sub-sections, we explore all modalities’ combinations.
6.1. One Modality
Since the model was overfitted for the training data, it was expected that the intra-session scenario for the same data used to construct the model would lead to better figures. Contrary to this, the inter-session scenario was the closest to the real scenario. For the inter-session scenario, each sample in the test dataset was compared to all samples in the training dataset. Thus, from a biometric system point of view, the test dataset samples played the role of the probe, while the training dataset played the role of the gallery. In the inter-session scenario, the samples were captured at intervals of days or months apart, and therefore, it was more likely to have more intra-class noise, which explains the less favorable results in this scenario.
Analyzing the results for the modalities in an isolated manner, we noticed that the standard deviation was more significant for the ECG modality. We hypothesize that individuals of the type lamb and goat were more difficult for the ECG dataset than the other datasets.
6.2. Two Modalities
Among all presented modality combinations, the fusion of the eye and the face was the most natural/practical. Very often, when a picture of a face is captured, it is intrinsic that the eye is also captured. Both face and eye are well established as strong biometrics. The results presented here (
Table 2) support the strength of this fusion, reducing the mean EER acquired with face recognition on FRGC by more than 45% and 90% for the eye.
Even though the most natural fusion is considered the face modality along with the eye, adding the ECG modality to the system can offer significant gains, and it seems to be feasible from the practical point of view. By
Table 2, one can see that in addition to reducing the error, the addition of the ECG caused a reduction in the standard deviation and, therefore, a more robust and accurate approach.
6.3. Three Modalities
The fusion of ECG with face or eye or both modalities is a feasible fusion since it can be implemented in the real world with accessible mobile devices. The system may be built over a mobile system, where the ECG is captured by off-the-person sensors, such as shown in
Figure 1, and the eye/face by the camera of a smartphone.
Robustness and enhancement of recognition are delivered with such a kind of fusion. It also can be used for spoofing detection [
14,
38]. There are also works that have explored the ECG signal as a manner to detect if a person is alive or not, also known as liveness detection [
39].
The ECG signal fusion with eye and face delivers advantages regarding verification rate (in terms of EER). The achieved results showed how two strong biometrics, such as face and eye, still benefited from the fusion with the ECG signal. Since the acquisition of the eye is intrinsic to the acquisition of the face, it is natural to introduce the eye into the fusion of ECG and face. This addition resulted (all three modalities combined) in even better performance, as can be seen in
Table 2.
Figure 9 shows the evolution regarding the addition of the eye and, thereafter, the ECG. It was possible to observe how the distance between the genuine and impostor distribution curves increased when modalities were included. One can see a reduction of the overlap among genuine and impostor distribution curves when the eye modality is combined with face. The overlap between the two curves was even smaller when the ECG signal was included.
In addition to the results, the present study investigated the fusion considering only 56 individuals, which hindered the study regarding scalability. It is important to evaluate the proposed methodology in large chimeric datasets (>1000 individuals).
6.4. Doddington Zoo Analysis
When datasets are created by using a random procedure, it is not ensured that the dataset is homogeneous, that is individuals may have one modality that weighs more in the result [
40]. The
Doddington Zoo criteria were used aiming at a more homogeneous chimeric dataset, ensuring that during the combination, both modalities were at the same level of difficulty.
It is worth pointing out that this scenario did not necessarily represent the real one; however, it was closer to a worst-case scenario, and it was a more challenging scenario from a biometric perspective. Furthermore, it is important to emphasize that the results on chimeric datasets may diverge from the results on real datasets [
41]. In addition, an evaluation protocol should allow a fair comparison between methods and should also facilitate reproducibility. We believe that with the criteria presented in
Section 4.3, our protocol favored both reproducibility and a fair comparison among methods.
7. Conclusions
In this work, we proposed a simple and reproducible protocol for chimeric datasets’ creation. Moreover, we conducted a statistical evaluation that covered three modalities: ECG signal, eye (periocular region), and face for the biometric system. Since multimodal datasets are scarce in the literature, standardization of protocols is highly desirable.
The chosen modalities allowed the investigation of combining strong and popular modalities (face and eye) along with others less popular ones, such as off-the-person ECG signal. Fusion at two different levels is explored: at the feature level by simple concatenation and at the score level by the sum, min, and multiplication rules. All images and signals were represented by deep learning models extracted using CNN and transfer learning.
Results for the multimodal approach indicated that the fusion of modalities analyzed here was promising. Comparing the multimodal versus unimodal results, the multimodal approach improved the decidability over 28% and 22% over the best unimodal approach for the intra-session of unknown data and inter-session scenario, respectively. Regarding EER, the multimodal approach reduced the error over 11% for the intra-session scenario and by 6% for the inter-session. We emphasize the multimodal approach of ECG, eye, and face, which achieved outstanding results in decidability and EER in both scenarios, intra-session and inter-session. It achieved a decidability of 7.20 ± 0.18, yielding an almost perfect verification system (i.e., EER of 0.20% ± 0.06) in the intra-session scenario for unknown data and a decidability of 7.78 ± 0.78 and an EER of 0.06% ± 0.06 in the inter-session scenario.
The proposed protocol aiming at the creation and evaluation of chimeric datasets forced the creation of homogeneous datasets and therefore allowed a fair comparison between methods. We believe that the protocol favored the reproducibility of the experiments.
Future works include investigating better comprehension of the effects of the chosen fusion and scalability (increase the number of modalities and feature vector size). The scalability can also be explored in the scope of a higher number of individuals (e.g., more than 1000 chimeric individuals). Moreover, another point to study is the use of different distance metrics.