1. Introduction
Human activity recognition (HAR) is a task aimed to automatically recognize human’s physical activity. HAR is applied in many fields and applications, such as health monitoring, smart homes, sports, security, context awareness and indoor navigation. Starting in the 1990s [
1], HAR has emerged as a key problem to ubiquitous computing, human-computer interaction and human behavior analysis with [
2,
3,
4]. There, sensory data obtained from wearable sensors [
5] are used to identify a human activity. One possibility for collecting sensor data is by utilizing the channel state information of Wi-Fi signals [
6] or by leveraging from multiple wearable devices sensors such as inertial sensors (accelerometers, gyroscopes and barometer) and ambient environment sensors (temperature and humidity) [
7,
8].
Focusing on indoor navigation, HAR approaches enable the possibility of classifying the user dynamics (mode), for example walking or running. This is made possible by using the smartphone’s inertial sensor. Knowledge of the user mode, helps to improve the positioning accuracy. For example, in [
9] statistical measures were employed to distinguish between walking and running user modes. Was classified, a selection of appropriate model parameters was made. Later, four types of user modes, namely, walking, running, bicycle, and vehicle were addressed in [
10].
An important branch of HAR in indoor navigation is smartphone location recognition (SLR) [
11,
12]. It refers to the process of identifying the location of a smartphone on the user during specific actions. For example, the smartphone will be judged to be in Texting mode when the user holds the phone and writes a text massage or in Talking mode when the user hold the phone during a phone call. Identifying those smartphone modes helps to improve pedestrian dead reckoning performance as was shown in [
13]. In [
14] both user and smartphone modes were addressed including eight types of user modes and seven types of smartphone locations. There, a distinction was made between several types of the Texting operation, namely with one or two hands and additionally in the Swing mode between big or small arm swing. In [
15] a finite state machine was used to identify three smartphone locations: Swing, Texting, and Pocket. To assist in the integration of the accelerometers and gyroscopes, a threshold-based approach was used to distinguish between the presence and absence of a Swing mode [
16]. Recently, smartphone mode recognition was used to improve heading determination performance [
17].
Most of the papers addressing HAR or SLR uses a dataset with specific known modes (for instance: Talking, Texting, or Swing) in a supervised learning approach. When the user encounters a previously unknown mode (for instance, Pocket), the classifier will perforce identify it as one of the original modes it was trained on. Such classification errors will degrade the navigation solution accuracy. This problem is not unique to HAR or SLR, of course, and is known in the machine learning (ML) literature under a number of different names, guises, and variations such as classification with a reject option, one-class classification, anomaly detection, and open set recognition, to name a few. We refer the reader to the recent survey [
18] for a more detailed and nuanced exposition on the different variants. Also to [
19] for a theoretical analysis.
Consider the following common scenario: a user is walking with a smartphone and a pedestrian dead reckoning (PDR) algorithm to estimate the user position is applied. In such an algorithm, different parameters are used pending on the user dynamics (Walking/Standing/Escalators and etc.) and smartphone locations (Texting/Talking/Pocket and etc.). For example, different user dynamics or smartphone location will result in different PDR gain values or different network parameters for estimating the pedestrian position. Focusing on the smartphone location, SLR classifies it and appropriate PDR parameters are selected. In situations of unknown smartphone locations, that is smartphone locations which are not defined in the PDR algorithm, it is desired to take some average gain value from all defined modes to minimize the positioning errors. Otherwise miss classifying the unknown mode as one of the known modes will result in
position error as shown in [
13] in a 21 m trajectory. For longer ones, the error is expected to increase. Thus, a methodology to cope with unknown modes is needed to be incorporated in the PDR algorithm.
To fill this gap, in this paper we propose two end-to-end ML-based approaches to address the unknown smartphone location modes problem using only the smartphone’s accelerometers measurements: (1) a supervised approach which requires the known modes labelling during the training phase and (2) an unsupervised approach without the requirement to label training data. In both approaches, a feature representation space is extracted and fed into a K-nearest neighbors algorithm to detect the unknown modes.
To enhance the efficiency and robustness of the proposed approaches multiple datasets are used in the training and testing phases. The training process is based on four different datasets recorded by 23 people while the smartphone was placed in four different known smartphone locations: Texting, Pocket, Swing, and Talking. The test dataset contains two additional datasets recorded by 25 people (not present in the train dataset) with the following five smartphone locations treated as unknown modes: Body, Bag, Belt, Waist, and Upper-arm.
Although our focus is unknown modes in the SLR problem, the proposed approaches can be easily adjusted to any other domains requiring unknown modes classification.
The rest of the paper is organized as follows:
Section 2 reviews current approaches to handle unknown modes while
Section 3 presents the mathematical foundations required for the proposed approaches. In
Section 4, the proposed approaches are described while the datasets used to evaluate them are described in
Section 5.
Section 6 brings the experimental results of this research. Finally,
Section 7 gives the conclusions of this work.
4. Proposed Ml-Based Approaches: Anomaly Detection in Deep Feature Space
The goal of this paper is to provide an effective mechanism for the detection of accelerometer signals belonging to unknown modes. To that end, the SLR domain is chosen to derive and present the proposed approach, but the it can be easily adapted in any other activity recognition tasks and in other domains as well. It is assumed that a separate model capable of classifying the known SLR modes exists, for example, as in [
11]. There the four known smartphone locations (classes) are: (1) Texting, (2) Swing, (3) Pocket, and (4) Talking. The central idea of our solution is as follows: a high-quality classification model for the known modes will have already learnt a good deep feature representation of the accelerometers readings. This representation is essentially a function
, where
n is the input dimension of the signal (
in our case) and
d is the dimension of the hidden representation. The proposed approach is illustrated in
Figure 1.
Two different approaches to utilize the feature representation in order to classify unknown modes are suggested:
SUN: supervised unknown network: The penultimate layer of a network, trained in a supervised manner from data belonging to a number of known modes, is employed for the feature representation space. Then, a KNN algorithm is applied to determine if the signal belongs to a known or an unknown mode. Another possibility is to apply standard techniques of dimension reduction (PCA or LDA) prior to KNN application.
UUN: unsupervised unknown network: Novel features extracted from a latent representation using a variational recurrent auto-encoder (VRAE) architecture, trained in an unsupervised manner, are used as the feature representation space. Then, a KNN algorithm is applied to determine if the signal belongs to a known or unknown modes. Unlike in the SUN approach, here the labeling of the known modes are not needed in the training process.
With the representation in hand (either by SUN or UUN approaches), a robust and effective way to detect unknown modes is via
anomaly detection in the representation space. To that end, the non-parametric method of [
29], as implemented in the PyOD toolbox [
35], will be employed. In other words, we will apply the KNN algorithm of [
29] to
d-dimensional feature vectors representing a signal instance and obtain either a TRUE or FALSE answer. TRUE means that we have on our hand an anomaly, which means that the signal is from an unknown mode, while FALSE means that the signal belongs to a known mode. The SUN and UUN approaches are shown in
Figure 2A.
Now after the proposed approaches are defined, they are compared to Thresholding, Reject option, and Training with a background class approaches (as described in
Section 2) in
Table 1. Notice, that for new unknown mode addition and unknown-known conversion both SUN and UUN proposed approaches don’t require additional retraining procedure.
Remark 1. The emphasis we place on the role of the deep feature representation as opposed to more sophisticated rejection mechanisms is inspired by the pervasive role deep features play in visual image understanding where it is by now well-established that good features allow strong generalization across tasks (cf., e.g., [36,37]), which is exactly what we seek here. This also allows for smooth handling of the appearance of new unknown modes since they usually do not affect the feature representation 4.1. SUN Network Architecture
Motivated by [
11], a supervised one-dimensional convolution neural network (1D-CNN) model was trained on the four known modes. The CNN architecture that is used for training the different SLR modes is shown in
Figure 2B. The input to the network is the accelerometer measurements (specific force vector). The first layer, is a 1D-CNN with 32 units and a ReLU activation. The second layer is identical to the first one. After a dropout of 0.6, the next layer is a 1D-Polling layer of size two followed by a flatten layer to set the dimensions to the following two dense layers. The dense layer has 32 units with a ReLU activation function and the final layer (dense) has a Softmax activation to output the SLR classification result. That is, given the accelerometers readings the network outputs one of the four known modes.
4.2. SUN Training Procedure
The network is trained with a minibatch of size 32, and the RMS propagation (RMSProp) [
38] algorithm for optimization, which divides the gradient by a running average of its recent magAdam optimizer, is used. An initial learning rate of
, a discounting factor for the history/coming gradient of
, and zero momentum are applied. Dropout is applied after the convolutional neural network (CNN) layers, with probability
. The network is trained for four different SLR modes (Swing, Texting, Talking, Pocket), using the categorical cross entropy (CCE) loss function defined in [
39] for a single label categorization. The network is trained for 12 epochs. This model was implemented using Keras open-source neural network library in Python [
40] and was trained on a single NVIDIA GeForce GTX 1080 GPU.
4.3. UUN Network Architecture
The UUN approach uses the VRAE [
41] network architecture. This model maps time sequences to one latent vector, and enables efficient, large scale unsupervised variational learning on time sequences, while it tries to avoid the exploding gradients problem and enable better scores. The main concept behind it is to partition unlabeled time-series measurements into homogeneous clusters based on generative features, which are interpretable.
The strength of VRAE architecture is that it extends the standard variational auto-encoder (VAE) model by combination of recurrent neural networks (RNNs) as the network encoder and decoder, and stochastic gradient variational Bayes (SGVB) [
42]. The VRAE architecture that is used for training the different SLR modes is presented in
Figure 2C. It receives the specific force vector measurements as input to the encoder RNN layer, which contains enrolled long short term memory (LSTM) block, with hidden size of 90, depth of 3, and dropout rate of
. Then, the RNN output is passed to the next, encoder to latent, layer which is mapped to the mean and standard deviation by using a liner unit activation. When the hidden code layer size of 20 dimensions, and distributed by mean and standard deviation that serves as the feature representation for the entire input during training (i.e., encodings). Next, the latent layer is passed through a linear unit activation to obtain initial states for the decoder RNN layer. Decoder inputs are updated using backpropagation.
The UUN model was trained to minimize smartphone known modes loss function for accelerometer readings
X, defined by:
where the loss function is a superposition of two loss measures—Smooth L1 and Kullback-Leibler (KL) divergence function, each with a corresponding gain
, and
, receptively.
The
loss measure is an auto-encorder loss that learns the identity function, so the sequence of input and output vectors must be similar. It is given by:
where
is defined by:
and
x (input) and
y (target) are in arbitrary shapes with a total of
n elements each. This loss was shown to be less sensitive to outliers than the mean square error loss and in some cases prevents exploding gradients [
43].
The second part of the loss function Equation (
3), is the KL-divergence function, a loss measure between the distribution that learned in latent space with the normal distribution, defined by:
where
L be the latent continues variable length,
the variational mean, and
is the standard deviation evaluated at datapoint
.
4.4. UUN Training Procedure
The network is trained with a minibatch of size 32 and the Adam optimizer, with
,
and
. We use an initial learning rate of
and a
penalty of 0 for the training process. Dropout is applied in the encoder layer (that uses a multi-layer LSTM) with probability
. The network is trained for four different SLR modes (Swing, Texting, Talking, Pocket), using the loss function defined in
3. Notice, that the modes labels are not used during the training process. The train is made with
;
, when in such a case, the weight for both losses is equal for learning. The network is initialized with Glorot and Bengio [
44], which proposed to adopt a properly scaled uniform distribution for initialization, and train it for 90 epochs, with Gradient clipping enabled and max norm of the gradients of size 5, to overcome explosion. The model was implemented using PyTorch [
45] and trained on a single NVIDIA GeForce GTX 1080 GPU.
5. Dataset
To evaluate the proposed SLR with unknown modes approaches, six different datasets were used. Two of the datasets were constructed for evaluation deep-learning approaches in the SLR problem [
11], while the other four datasets, found in the internet were constructed for other applications. In all the datasets, the smartphone location was, at least, in one of the four possibilities: (1) pocket, (2) texting, (3) swing, and (4) talking while the users were walking. No constraints on how the smartphone should be held in each location was imposed. For example, talking operation can be made while the user is holding the phone in the right hand close to the ear or in the left hand far from it. In all recordings only the accelerometer readings are used for the proposed approaches.
For the training process four different datasets, as described in
Table 2, were employed.
For the SUN approach, the known labels were used during the training while for the UUN they were not. The first dataset, D1, contains recordings from a single user with two different smartphones and in all four possible smartphone locations. A total of 164 min of recordings were made using a range of sampling rates between 25 and 100 Hz. The recorded data was made while the user was walking in inhomogeneous conditions. For example, varying walking speeds, uneven pavements, tight and sport trousers with a front and back pocket location, transitions between pavement and roads, varying hand swing (small to big), and texting and talking with a single hand (right and left) in different positions relative to the user.
The second dataset, HTA, was recorded by six people from Huawei’s Tel-Aviv research center. Each person used a different smartphone during the recordings. The third dataset used in this research is RIDI [
46]. This dataset was recorded for indoor navigation research and not activity recognition. There, the goal was to estimate the change in acceleration by machine learning approaches and use it to correct the raw accelerometer data. Although, their work was not related to SLR, the RIDI dataset was recorded using a smartphone in two locations: front pocket and in texting using eight people. The fourth dataset, OXF, was recorded to examine the possibility of using deep learning methods to estimate the pedestrian position and heading [
47]. There, a dataset using seven people with a time duration of 240 min was recorded while the smartphone was in Pocket or Texting modes were made.
The test dataset contains five smartphone locations, not present in the training data, which are addressed as unknown mode. Those are: Waist, Upper-arm, Belt, Body, and Bag. From RIDI [
46] the Body and Bag mode was taken for the analysis. There, the smartphone was placed in a small bag that was hold on the right leg and with a strap on the body. In [
48], an HAR problem using wearable sensors located in seven positions on the user: chest, forearm, head, shin, thigh, upper arm, and waist was addressed. This dataset, noted as WOB, contains recordings of 15 people (seven females and eight men) and here we employ only recordings taken from the Waist and Upper-arm location as an unknown mode. The Third one, [
49], also addressed an HAR problem with seven different modes among them was walking. There, five smartphones were placed on each user: right/left jeans pocket, right upper arm, wrist and on the belt. There, the smartphone was pointed towards the right leg using a belt clip. Here, we employ the latter as and the Body an unknown modes, and denote this dataset as PAR. Main parameters of those three datasets are given in
Table 3.
7. Conclusions
SLR aims to identify the location of a smartphone in specific user actions. This task is critical for accurate indoor navigation using PDR. Common PDR approaches cannot handle unknown modes with desired accuracy, and therefore, their performance is degraded. In this paper, two end-to-end ML-based approaches to cope with unknown modes during the classification process were suggested and evaluated on the smartphone location recognition problem.
The first approach, SUN, used a feature representation space of a trained network as its basis, while the second approach, UUN, generated the feature representation space using variational recurrent auto-encoder. Both approaches require only the smartphone’s acceleroemeters measurements to preform the classification and unknown detection.
Multiple datasets were used in the training and testing phases. In training, four different datasets were used. They were recorded by 23 people while the smartphone was placed in four different known smartphone locations: Texting, Pocket, Swing, and Talking. The test dataset contained two additional datasets recorded by 25 people (not present in the train dataset) with the following five smartphone locations treated as unknown modes: Body, Bag, Belt, Waist, and Upper-arm.
Before examining the proposed approaches, it was explained and shown why a classification approach based on background class cannot handle unknown modes effectively in the SLR problem. Then, the performance of a baseline thresholding approach [
21] and of the proposed approaches was evaluated The thresholding approach was chosen, since in the literature it was shown to obtain good performance. Yet, in the SLR problem it failed to work and orbited poor accuracy. On the other hand, the proposed SUN achieved an accuracy of 93.12% while UNN obtained 88.85% on the test dataset. The main advantage of the UNN approach is that it does not require labeling the known modes in the training process, however, its training time is 18 times longer than the SUN approach. This is attributed to the computational complexity of the proposed model, which can be evaluated based on the number of total trainable model parameters, where in SUN approach has 17,924 compared to UUN approach with 335,563 parameters. It was also shown that applying PCA or LDA in the SUN approach did not improve its performance, however they reduced the inference time by a factor of approximately 20. Thus, there is trade-off between accuracy and the computational cost, which should be determined based on the required application.
Finally, the proposed approaches were derived as end-to-end ML approaches and thus can be easily adjusted and applied to other related fields addressing the problem of unknown mode detection.
Future work includes an analysis of encoding length of the UUN approach and its influence on the accuracy and the computational cost over different datasets. Additionally, different detectors can be examined, as well as different deep architectures for detecting Unknown modes.